Purdue supercomputer usage and failure data research could make supercomputers even more super
Purdue’s Community Cluster Program supercomputers are used by campus researchers to look at everything from the molecular machinery of viruses to the origins of the universe and myriad science, engineering and social science problems in between.
But a National Science Foundation-funded project spearheaded by Professor Saurabh Bagchi, with ITaP as a partner, is employing Purdue’s community clusters and other supercomputers in research focused on the machines themselves — with the goal of making them work better for other researchers. ITaP senior scientist Carol Song and Preston Smith, ITaP’s director of research services and support, are leading personnel on the project.
The project is building a repository of usage and failure data from supercomputers on and off campus, analysis of which can be used to help researchers run their code on the machines more efficiently and reliably and get results faster. ITaP Research Computing staff already has begun tapping some of Bagchi’s findings to assist community cluster users at Purdue.
“Data is king in this,” says Bagchi, a professor in the School of Electrical and Computer Engineering. “I like to build my solutions with some idea of what the real problem is and it turns out that finding failure data on real computer systems is very, very difficult. This project steps toward remedying that situation.”
With computational science, engineering and social science research employing high-performance computers burgeoning, and the systems now integral in business and industry as well, the project has national and international implications.
Bagchi’s research focuses on software systems to make heterogeneous distributed computing systems like the community clusters more reliable and secure. He started the usage and failure data project with a pilot collecting and analyzing data from Purdue’s Conte community cluster, which was deployed in 2014.
The success of the pilot prompted the National Science Foundation (NSF) to expand the project. It now will include Conte, the new Rice community cluster and the Hansen cluster at Purdue along with Blue Waters, a massive supercomputer for use by researchers around the country, located at the National Center for Supercomputing Applications (NCSA) in Illinois. Professors Ravi Iyer and Zbigniew Kalbarczyk from the University of Illinois at Urbana-Champaign are collaborators on the project.
“For us I think a direct benefit is being able to better identify problems and to better support users,” says Song, who heads ITaP Research Computing’s Scientific Solutions Group and is a co-principal investigator on the project.
Stephen Harrell, a scientific application analyst for ITaP Research Computing, already has been able to use some results from the pilot to identify ways to help Conte users run their computations better.
Meanwhile, Song represents Purdue in a consortium of high-performance computing and other advanced digital resource providers, which includes NCSA and Blue Waters, designed to share best practices. Song’s Purdue group also supports users of national supercomputers that are part of the NSF’s Xtreme Science and Engineering Discovery Environment (XSEDE), in which Purdue is a partner.
The Purdue team is developing a repository for storing, accessing and analyzing the data on Purdue’s DiaGrid hub and also plans to develop some online analysis and visualization tools.
The data also should be valuable in designing Purdue’s next Community Cluster Program research supercomputer, say Song and Bagchi. Purdue builds a new supercomputing system annually (eight since 2008). The machines are used by hundreds of faculty and their students from throughout the campus to develop new treatments for cancer, improve crop yields to better feed the planet, engineer quieter aircraft, study global climate change, and much more.
Usage and failure data is hard to collect for both technological and psychological reasons, Bagchi says.
From a technical standpoint, a variety of tools exist for monitoring and analyzing such data, which makes comparing apples to apples and oranges to oranges challenging. Moreover, those operating supercomputers are reluctant to introduce monitoring software that may slow down or cause instability in their systems.
The Purdue project is focused on finding the most stable and least demanding combination of tools to yield accurate monitored data, and on standardizing and annotating those results with a library of documentation to make them easier for any researcher to download and use.
Psychologically, the challenge is a natural desire not to advertise usage and, especially, failure data.
“People are very loath to share data about failures,” Bagchi says. “It’s bad news, so people want to keep that kind of data, that kind of publicity, away from the public layer.”
To overcome that tendency, the project anonymizes data that could identify individual users, such as user and application names or the identifying number of the remote computer with which a user accesses a supercomputing system.