Unscheduled outage on multiple clusters

UPDATE: April 30, 2021  1:45pm

As of 1:45pm, the WSC Hadoop cluster has been returned to full service as well. Please report any issues to rcac-help@purdue.edu.

UPDATE: April 29, 2021  10:16pm

As of 10:16pm, the CMS, Bell, Gilbreth, Brown, Halstead, WCERES, and WSC Hadoop clusters have been returned to normal service. Job queues have been enabled and job scheduling has been resumed. Please note that during the outage many nodes may have been powered off or lost connection to scratch storage mid-flight, so you might want to check the status and output of your jobs, and resubmit if necessary.

The WSC Hadoop cluster is currently operating in decreased functionality mode due to issues with Spark history server. Troubleshooting is ongoing, we will provide another update by 2pm tomorrow.

We apologize for the disruption of service. Please report any issues to rcac-help@purdue.edu.

UPDATE: April 29, 2021  8:28pm

Work continues on bringing CMS, Bell, Gilbreth, Brown, Halstead, WCERES, and WSC Hadoop clusters back to normal operation. We will provide another update by midnight.

ORIGINAL: April 29, 2021 4:00pm - April 30, 2021 1:45pm EDT

Due to problems with cooling system in the MATH datacenter, the CMS, Bell, Gilbreth, Brown, Halstead, WCERES, and WSC Hadoop clusters began experiencing issues around 4:00pm. Multiple front-end, compute and storage services are affected. Engineers are currently diagnosing the issue and are working to identify a fix. Subsets of cluster nodes were powered off and job scheduling has been paused in order to limit thermal footprint of affected systems.

We will provide an update by 9pm.

Originally posted: April 29, 2021 4:19pm EDT