Unscheduled outage on multiple clusters

April 29, 2021 4:00pm - April 30, 2021 1:45pm EDT
Outages
Bell, Brown, Gilbreth, Halstead, WCERES, WSC Hadoop

Link to update at April 30, 2021 1:45pm EDT UPDATE: April 30, 2021 1:45pm EDT

As of 1:45pm EDT, the WSC Hadoop cluster has been returned to full service as well. Please report any issues to rcac-help@purdue.edu.

Link to update at April 29, 2021 10:16pm EDT UPDATE: April 29, 2021 10:16pm EDT

As of 10:16pm EDT, the CMS, Bell, Brown, Gilbreth, Halstead, WCERES, and WSC Hadoop clusters have been returned to normal service. Job queues have been enabled and job scheduling has been resumed. Please note that during the outage many nodes may have been powered off or lost connection to scratch storage mid-flight, so you might want to check the status and output of your jobs, and resubmit if necessary.

The WSC Hadoop cluster is currently operating in decreased functionality mode due to issues with Spark history server. Troubleshooting is ongoing, we will provide another update by 2pm tomorrow.

We apologize for the disruption of service. Please report any issues to rcac-help@purdue.edu.

Link to update at April 29, 2021 8:28pm EDT UPDATE: April 29, 2021 8:28pm EDT

Work continues on bringing CMS, Bell, Brown, Gilbreth, Halstead, WCERES, and WSC Hadoop clusters back to normal operation. We will provide another update by midnight.

Link to original posting ORIGINAL: April 29, 2021 4:00pm EDT

Due to problems with cooling system in the MATH datacenter, the CMS, Bell, Brown, Gilbreth, Halstead, WCERES, and WSC Hadoop clusters began experiencing issues around 4:00pm EDT. Multiple front-end, compute and storage services are affected. Engineers are currently diagnosing the issue and are working to identify a fix. Subsets of cluster nodes were powered off and job scheduling has been paused in order to limit thermal footprint of affected systems.

We will provide an update by 9pm.

Originally posted: April 29, 2021 4:19pm EDT
Last updated: April 30, 2021 1:45pm EDT

Unscheduled outage on multiple clusters

Link to update at April 30, 2021 1:45pm EDT UPDATE: April 30, 2021 1:45pm EDT

Link to update at April 29, 2021 10:16pm EDT UPDATE: April 29, 2021 10:16pm EDT

Link to update at April 29, 2021 8:28pm EDT UPDATE: April 29, 2021 8:28pm EDT

Link to original posting ORIGINAL: April 29, 2021 4:00pm EDT

Follow Us