Unscheduled outage on multiple clusters
UPDATE: April 30, 2021 1:45pm
As of 1:45pm, the WSC Hadoop cluster has been returned to full service as well. Please report any issues to firstname.lastname@example.org.
UPDATE: April 29, 2021 10:16pm
As of 10:16pm, the CMS, Bell, Gilbreth, Brown, Halstead, WCERES, and WSC Hadoop clusters have been returned to normal service. Job queues have been enabled and job scheduling has been resumed. Please note that during the outage many nodes may have been powered off or lost connection to scratch storage mid-flight, so you might want to check the status and output of your jobs, and resubmit if necessary.
The WSC Hadoop cluster is currently operating in decreased functionality mode due to issues with Spark history server. Troubleshooting is ongoing, we will provide another update by 2pm tomorrow.
We apologize for the disruption of service. Please report any issues to email@example.com.
UPDATE: April 29, 2021 8:28pm
Work continues on bringing CMS, Bell, Gilbreth, Brown, Halstead, WCERES, and WSC Hadoop clusters back to normal operation. We will provide another update by midnight.
ORIGINAL: April 29, 2021 4:00pm - April 30, 2021 1:45pm EDT
Due to problems with cooling system in the MATH datacenter, the CMS, Bell, Gilbreth, Brown, Halstead, WCERES, and WSC Hadoop clusters began experiencing issues around 4:00pm. Multiple front-end, compute and storage services are affected. Engineers are currently diagnosing the issue and are working to identify a fix. Subsets of cluster nodes were powered off and job scheduling has been paused in order to limit thermal footprint of affected systems.
We will provide an update by 9pm.