Halstead and Brown unscheduled outage

February 11, 2019  8:40am – February 12, 2019  4:00pm
Brown, BrownGPU, Halstead, HalsteadGPU

UPDATE: February 12, 2019  4:23pm

As of 4:00 pm, the Halstead and HalsteadGPU scratch system and cluster has been returned to normal service. Job queues have been enabled and job scheduling has been resumed. Please report any issues to rcac-help@purdue.edu.

UPDATE: February 12, 2019  10:51am

Brown and BrownGPU scratch has been returned to normal service. Job scheduling has been restarted, so Brown and BrownGPU are back to full production. Please let us know if you see any lingering issues at rcac-help@purdue.edu.

Storage engineers and the vendor continue to work on bringing Halstead/HalsteadGPU scratch back to service. We will provide another update on Halstead by 2 pm today.

UPDATE: February 11, 2019  4:44pm

Both Halstead and Brown scratch filesystems (shared by their respective GPU system too) suffered damage due to a power spike during the power outage earlier today. Storage engineers and engineers from the vendor are continuing to work on it into this evening.

Job scheduling remains paused. Scratch purges are also canceled this week for Brown and Halstead scratches.

We will provide another update by 10:00 am tomorrow morning.

ORIGINAL: February 11, 2019  1:24pm

Halstead, HalsteadGPU, Brown, and BrownGPU went offline during a campus power event around 8:40 am this morning. Engineers are working to bring the compute nodes and the scratch system back online. Other systems are back online at this time. Job scheduling is paused at the moment.

We will provide an update by 5 pm this afternoon.

Originally posted: February 11, 2019  1:24pm