Skip to main content
Have a request for an upcoming news/science story? Submit a Request

Unscheduled outage on multiple clusters

Link to update at April 30, 2021 1:45pm EDT UPDATE:

As of 1:45pm EDT, the WSC Hadoop cluster has been returned to full service as well. Please report any issues to rcac-help@purdue.edu.

Link to update at April 29, 2021 10:16pm EDT UPDATE:

As of 10:16pm EDT, the CMS, Bell, Brown, Gilbreth, Halstead, WCERES, and WSC Hadoop clusters have been returned to normal service. Job queues have been enabled and job scheduling has been resumed. Please note that during the outage many nodes may have been powered off or lost connection to scratch storage mid-flight, so you might want to check the status and output of your jobs, and resubmit if necessary.

The WSC Hadoop cluster is currently operating in decreased functionality mode due to issues with Spark history server. Troubleshooting is ongoing, we will provide another update by 2pm tomorrow.

We apologize for the disruption of service. Please report any issues to rcac-help@purdue.edu.

Link to update at April 29, 2021 8:28pm EDT UPDATE:

Work continues on bringing CMS, Bell, Brown, Gilbreth, Halstead, WCERES, and WSC Hadoop clusters back to normal operation. We will provide another update by midnight.

Link to original posting ORIGINAL:

Due to problems with cooling system in the MATH datacenter, the CMS, Bell, Brown, Gilbreth, Halstead, WCERES, and WSC Hadoop clusters began experiencing issues around 4:00pm EDT. Multiple front-end, compute and storage services are affected. Engineers are currently diagnosing the issue and are working to identify a fix. Subsets of cluster nodes were powered off and job scheduling has been paused in order to limit thermal footprint of affected systems.

We will provide an update by 9pm.

Originally posted:
Last updated: