College of Engineering, ITaP Research Computing team studies supercomputer reliability

May 5, 2020

Researchers running demanding computations, especially for projects like infectious disease modeling that need to be re-run frequently as new data becomes available, rely on supercomputers to run efficiently with as few failures of the software as possible. The more jobs that fail, the less science can get done.

Understanding why some jobs fail and what can be done to make supercomputers more reliable is the focus of a recent project led by Saurabh Bagchi, a professor of electrical and computer engineering, and ITaP senior research scientist Carol Song.

The project, which began almost five years ago and was supported by three awards from the National Science Foundation (award numbers 1405906, 1513051, and 1513197) totaling over $1.1 million, analyzed data from supercomputer systems at Purdue, as well as the University of Illinois at Urbana-Champaign and the University of Texas-Austin. At Purdue, the Conte and Halstead community clusters were studied.

Among the conclusions Bagchi and Song have drawn:

  • Node-sharing doesn’t translate to a higher rate of job failure.
  • Memory-intensive applications can fail even before the rated memory of the node is reached, which suggests that close monitoring of the memory usage of applications may be necessary.
  • Careful allocation and scaling up of “remote” resources (such as parallel file systems and network connections to storage systems) is important as a cluster grows in size.

Bagchi says these are practical takeaways that supercomputer systems administrators can implement to make applications run on their computers more reliably.

In addition to their own data analysis, Bagchi and Song’s NSF grant funded the development of an open access repository known as FRESCO, where systems data from Purdue’s clusters and UT-Austin’s Stampede supercomputer is stored, as well as the team’s conclusions and actionable suggestions for the people who run computer clusters. They’ve also included simple scripts that will let anyone run their own data analysis on the data from the three schools. A similar repository houses the data from the Blue Waters supercomputer located at the National Center for Supercomputing Applications at the University of Illinois.

“We really want the computing community to benefit from this resource,” says Bagchi, of the open source repositories.

Rajesh Kalyanam, a software engineer on Song’s team, developed the technical infrastructure to collect data from supercomputers, and Stephen Harrell, a former ITaP scientific applications analyst, helped get the data from the Purdue clusters onto the FRESCO repository.

“FRESCO not only serves the computer systems researchers designing more dependable systems, it also has the potential to help researchers develop and test new big data algorithms, as well as train students in applying data science methods on real-world datasets,” says Song. “We in ITaP Research Computing are collaborating with faculty on both fronts.”

The team has published their findings in a recent paper to be presented at the upcoming Dependable Systems and Networks conference, which will be held virtually in June. That paper’s first author is Rakesh Kumar, one of Bagchi’s former graduate students who is now employed at Microsoft. Ravishankar Iyer, the George and Ann Fisher Distinguished Professor of Engineering and professor of electrical and computer engineering at the University of Illinois, is the lead investigator from Ilinois. Other researchers on the team include Ashraf Mahgoub from Purdue; Saurabh Jha, Zbigniew Kalbarczyk, William T. Kramer from the University of Illinois; and Todd Evans and Bill Barth from the University of Texas.

Originally posted: May 5, 2020  1:28pm