Regular maintenance, being memory conscious among factors that improve supercomputing success

  • October 25, 2018
  • Science Highlights

If you change your car's oil regularly, rotate the tires and otherwise keep up with routine maintenance, you're less likely to experience a breakdown on the road.

Turns out, the same thing could be said about your supercomputer, according to a study led by Purdue researchers, Saurabh Bagchi, an electrical and computer engineering professor, and Carol Song, a senior scientist who heads the ITaP Research Computing Scientific Solutions Group.

Computational jobs fail less often on high-performance computing systems whose nodes receive regular maintenance, an analysis of the biggest public data set on supercomputer failures indicates.

Code that uses memory excessively – near or over 50 percent of a node's raw memory capacity – also is linked to job failures, according to the analysis of the data set by the Purdue team and its collaborators, who are part of a National Science Foundation-funded project to make supercomputing systems even more reliable.

With the use of high-performance computers burgeoning in computational science, engineering and social science research, and the systems now integral in business and industry as well, the project has national and international implications.

The analysis found 13 percent of the more than 3 million jobs terminating with a non-success return code, Bagchi says.

"What we want to know," Bagchi says, "is why do systems fail? Does it relate to resource usage, how long the node has been up; and how long does the node take to come back?"

The data, collected over two years from Purdue’s Conte community cluster research supercomputer, is intended to let any researcher with an interest explore those questions, while also serving as a resource for validating algorithms against a common data set.

The repository, dubbed FRESCO, “lets other people ask and answer these questions on by far the largest publicly available data set,” Bagchi says.

Bagchi and Song are co-principal investigators on the project, which also will examine data from Blue Waters, a massive supercomputer located at the National Center for Supercomputing Applications (NCSA) in Illinois. Professors Ravi Iyer and Zbigniew Kalbarczyk from the University of Illinois at Urbana-Champaign are collaborators on the project.

Bagchi’s graduate student Rakesh Kumar and undergraduate student Natat Sombuntham analyzed the data for the Conte study and Rajesh Kalyanam from Song’s team handled most of the data processing. Stephen Harrell, a scientific application analyst for ITaP Research Computing, assisted in understanding the operation of the cluster.

In addition to cluster maintenance and memory usage, the researchers looked at the effect of software libraries in building reliable applications. They found that keeping library software updated also increased job success rates, although interim updates in a series sometimes degraded performance until fixed by a later update.

“Remote” resources, such as the network to which a supercomputer is connected or the parallel file system, did not show up as having a significant impact on job failures, Bagchi says.

Bagchi and Kumar say the findings are probably most useful to people who set specifications for, build and administer high-performance computing systems. A system administrator might, for example, shift a job pushing the memory limits on one node to another with more memory free. ITaP Research Computing already has applied some of the findings to improve the operation of Purdue’s research supercomputers.

But there are pointers for supercomputer users in the analysis as well. One big reason for failures is underestimating the “wall time” – an estimate of how long it will take a job to run – in a job submission. Just by estimating better, a user can substantially increase the chances for successful completion.

To assemble the data set and do the analysis, the researchers collected records of job submissions; resource utilization, such as CPU and memory, for every node on Conte; libraries used by each job; and downtimes or outages for individual nodes or the whole Conte cluster, both scheduled and unscheduled. The project anonymizes data that could identify individual users or details of the file system on any specific machine.

The NSF has supported the research through two awards from its Computer and Information Science and Engineering (CISE) Research Infrastructure (CRI) program.

Bagchi discussed the research at an NSF Grand Challenges in Computer Systems Research workshop in March and a PI meeting in October. They have already presented at the SC16 supercomputing conference and the 2016 IEEE International Workshop on Program Debugging. Based on an earlier, six-month data set from Purdue, the latter presentation also included an analysis by Lawrence Livermore National Lab in California on data from some of its high-performance computing systems, which yielded similar results.

Bagchi’s research focuses on software systems to make heterogeneous distributed computing systems like supercomputers more reliable and secure. He just won the prestigious, international Humboldt Bessel Research Award for his work in dependable systems.

Originally posted: April 23, 2018 1:40pm EDT