Unscheduled Gilbreth outage

January 3, 2020  11:30am – January 5, 2020  8:35pm
Gilbreth

UPDATE: January 5, 2020  8:35pm

As of 8:35pm, the Gilbreth cluster has been returned to normal service. Job queues have been enabled and job scheduling has been resumed. For a limited time, scratch performance may be somewhat degraded while the file system continues to recover from the failure.

We apologize for the disruption of service. Please report any issues to rcac-help@purdue.edu.


UPDATE: January 4, 2020  8:10pm

Storage engineers have replaced malfunctioning scratch controller hardware, work continues on system verification.

We will provide another update as verification progresses.


UPDATE: January 3, 2020  1:57pm

Engineers are engaged with the vendor and work continues on troubleshooting Gilbreth unresponsive scratch issues. The problem appears to be a hardware failure and is likely to extend into the weekend. We do not expect any data loss at this time and will be watching for its safety as always.

We will provide another update as more information becomes available.


ORIGINAL: January 3, 2020  11:42am

The Gilbreth cluster began experiencing issues with its scratch filesystem around 11:30am. Engineers are currently diagnosing the issue and are working to identify a fix. Job scheduling has been paused while this issue is being addressed.

We will provide an update by 2pm.

Originally posted: January 3, 2020  11:42am