Unscheduled Gilbreth outage
January 3, 2020 11:30am – January 5, 2020 8:35pm
As of 8:35pm, the Gilbreth cluster has been returned to normal service. Job queues have been enabled and job scheduling has been resumed. For a limited time, scratch performance may be somewhat degraded while the file system continues to recover from the failure.
We apologize for the disruption of service. Please report any issues to firstname.lastname@example.org.
UPDATE: January 4, 2020 8:10pm
Storage engineers have replaced malfunctioning scratch controller hardware, work continues on system verification.
We will provide another update as verification progresses.
UPDATE: January 3, 2020 1:57pm
Engineers are engaged with the vendor and work continues on troubleshooting Gilbreth unresponsive scratch issues. The problem appears to be a hardware failure and is likely to extend into the weekend. We do not expect any data loss at this time and will be watching for its safety as always.
We will provide another update as more information becomes available.
ORIGINAL: January 3, 2020 11:42am
The Gilbreth cluster began experiencing issues with its scratch filesystem around 11:30am. Engineers are currently diagnosing the issue and are working to identify a fix. Job scheduling has been paused while this issue is being addressed.
We will provide an update by 2pm.