Skip to main content
Have a request for an upcoming news/science story? Submit a Request

Industrial Engineering students present their RCAC Capstone Project

  • Science Highlights
  • Anvil

Six students from the Edwardson School of Industrial Engineering recently completed their senior capstone project, which focused on helping to predict and reduce downtimes for RCAC’s computing systems. The group worked toward creating a predictive-maintenance Artificial Intelligence (AI) tool for monitoring the high-performance computing (HPC) infrastructure, and presented their work at an Industrial Engineering student poster session.

Unplanned downtimes—interruptions in HPC, storage, and network service offerings—are an unfortunate yet not uncommon occurrence in data centers worldwide. These downtimes can lead not only to losses in computing time and results for researchers, but can also incur significant financial costs for the host facility. The intrinsic nature of unplanned downtimes is that they stem from unpredicted issues. If a data center could figure out a way to predict such incidents, they could take preemptive action to avoid any outages. Thankfully, this problem is precisely what the Industrial Engineering students set out to solve.

The six students who Image description took on the project were Zechen Wei, Hongchen Liu, Zachary Ramirez, Nicolai Cronin, Carlos Cordova, and Justin Ha. The team worked under the supervision of RCAC staff members Kyle Purple, Ashish, and Samuel Weekly. The project itself was a continuation of a previous semester’s capstone project, with this group picking up where the other group left off. They focused their efforts on the Anvil supercomputer, Purdue’s powerful, nationally-resourced, NSF-funded system. Anvil is available to researchers nationwide, providing them with advanced computing capabilities via the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program. Since Anvil has such a large and diverse pool of users, it was an easy choice as the system the students would tailor their work towards.

Throughout the semester, the team of students analyzed historical failure data on the Anvil system in order to identify problem areas and pain points. The group found that the most common problem on Anvil stemmed from a lack of communication between the nodes and Zabbix, the open-sourced software used to monitor and track the performance of the computer. They then utilized four different machine learning approaches to create their predictive model. Once completed, the group assessed the accuracy of their model, determined the model limitations, and even created a Grafana visualization showcasing the model data.

"Working with the Edwardson School students was a highly rewarding experience,” says Samuel Weekly, Associate Research Solutions Engineer for RCAC and mentor to the students during their project. “The team successfully tackled numerous challenges while learning new technologies and developed a monitoring solution that RCAC will implement in our systems. This solution provides a solid foundation to be further enhanced, serving to improve insight into our data center environment and HPC systems."

While the students were happy with what they accomplished during the semester, they did note that they wished they had more time to work on the project. Their current model is working with historical data. This was essential for building and assessing the predictive model, but the group wants to move to live data, providing RCAC with a real-time automation tool. The ultimate goal of their work is to implement an automated alert system that can trigger immediate actions to prevent downtimes not only for the Anvil system, but for all of RCAC’s HPC resources.

The six students ended the semester by showcasing their work at the Industrial Engineering poster event. RCAC staff members stopped by to view their poster presentation and were thrilled with how well the group presented. Overall, the team’s efforts were a resounding success, resulting in a tool that RCAC can use and build upon in the future.

RCAC has a robust student employment program, CI-XP (Cyber Infrastructure-eXperience), with multiple opportunities for student workers across a wide range of teams and departments within RCAC. To learn more about the CI-XP program, please visit: https://www.rcac.purdue.edu/ci-xp

Written by: Jonathan Poole, poole43@purdue.edu

Originally posted: