Research using DiaGrid and community clusters helps make computer systems we rely on more reliable

  • August 20, 2012
  • Science Highlights

International treaties that ban detonating nuclear weapons for testing are generally viewed as a good thing, but it certainly creates challenges for people charged with ensuring the weapons' reliability, in storage as well as in use.

Electrical and computer engineering Professor Saurabh Bagchi's lab is collaborating with the National Nuclear Security Administration's (NNSA) Lawrence Livermore National Laboratory in California on extremely accurate computer simulations that can take the place of live testing.

Run on tens of thousands of processors in powerful supercomputers — NNSA's Sequoia supercomputer, developed by IBM, was named the world’s fastest in June 2012 — these simulations, as with any complex software, pose their own reliability problems. Bagchi and his students are developing methods to identify and locate bugs automatically, allowing fixes to be implemented rapidly.

Bagchi's lab uses supercomputers at Lawrence Livermore, as well as some of the nation's other top machines through the National Science Foundation's XSEDE research network. Purdue is an XSEDE partner, providing hardware, custom software and expert user support through ITaP Research Computing.

But the researchers also employ Purdue's community clusters and the University's DiaGrid distributed computing system in developing, testing and refining their work. In the case of the bug tracking system, glitches often don’t appear until a certain number of processors come into play, making access to large-scale clusters like Purdue's vital for development and testing.

"Going forward this work is going to become more and more important because the complexity and scale of the software systems we rely on is increasing," Bagchi says.

The research isn't limited to national security. Bagchi has worked with AT&T on making smartphones smarter, able to tell in advance when a signal from a tower is about to drop, interrupting a call or a data stream, and to adjust accordingly, for instance by switching to another tower, even if the signal from the alternate tower is not as strong at the moment.

"Our work in all of this is driven by actual systems which have problems," Bagchi says. "We try to predict when a failure is coming down the pike, detect when a failure has happened and then, importantly, pinpoint what that failure is due to."

Bagchi's research focuses on software systems to make heterogeneous distributed computing systems more robust and reliable and more secure. That can range from the mix of software and data run on cluster supercomputers to the mix of hardware in a distributed system like DiaGrid, which pools a variety of machines in offices, student computer labs, server rooms and clusters. It also includes networks of embedded nodes cooperating for information gathering and analysis, whether radio frequency identification (RFID) tags retailers use to track inventory, not to mention buying habits, or location sensors attached to wild zebras to study their movements.

As these networks of small nodes become almost ubiquitous, Bagchi sees them as a research area with ample potential for advancing the technology. Take a project his lab did building a carbon dioxide sensor network in Purdue’s Pao Hall of Visual and Performing Arts. The system accurately detects CO2 levels within the building. But what if it also could automatically activate the ventilation system to pump in fresh air when the level reaches a certain point?

This kind of "actuation" is the next step for such systems, which present some interesting challenges, Bagchi says. Fault tolerance for one, especially with constraints on power, bandwidth and memory in tiny devices like RFID tags.

Bagchi and his students also are developing technology important on today's cutting edge of high-performance computing - and the next cutting-edge. Petascale and exascale computing involve machines with millions of processors. NNSA's petascale Sequoia, the current state of the art, has 1.6 million cores.

Supercomputers typically checkpoint jobs at regular intervals, akin to auto-save in a program like Microsoft Word, so everything doesn't have to be done all over again if the work is interrupted. This happens in the background and is unnoticeable when a few, or a few hundred, processors generate checkpoints and save them to relatively slow static memory. However, with thousands, or hundreds of thousands, or millions of cores in play, it becomes a major bottleneck. Bagchi's lab has developed a way to bundle and compress similar checkpoint data and speed up the process significantly.

More Information

  • Tanzima Zerin Islam, Kathryn Mohror, Saurabh Bagchi, Adam Moody, Bronis R. de Supinski, and Rudolf Eigenmann, mcrEngine: A Scalable Checkpointing System using Data-Aware Aggregation and Compression,accepted for the IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing), pp. 1-10, Salt Lake City, Utah, November 10-16, 2012. (One of eight finalists for the best student paper).
  • Ignacio Laguna, Dong H. Anh, Bronis R. de Supinski, Saurabh Bagchi, and Todd Gamblin, "Probabilistic Diagnosis of Performance Faults in Large Scale Parallel Applications," 21st International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 1-10, September 19-23, 2011, Minneapolis, Minnesota.
  • Matthew Tan Creti, Mohammad Sajjad Hossain, Saurabh Bagchi and Vijay Raghunathan, "AVEKSHA: A Hardware-Software Approach for Non-intrusive Tracing and Profiling of Wireless Embedded Systems," 9th ACM Conference on Embedded Networked Sensor Systems (SenSys), pp. 288-301, Seattle, Washington, November 1-4, 2011. (Winner of best paper award.)

Originally posted: July 1, 2014 4:16pm EDT