Unscheduled Bell outage
January 10, 2021 9:00pm – January 13, 2021 1:00am
As of 12:45am, engineers resolved the Bell scratch issue and the cluster has been returned to normal service. Job queues have been enabled and job scheduling has been resumed. We apologize for the disruption of service. Please report any issues to email@example.com.
UPDATE: January 12, 2021 7:10pm
At about 5:45pm on Tuesday, January 12th, 2021, the problem with Bell scratch has returned. Work continues with our storage vendor on troubleshooting and fixing the issue. Job scheduling on Bell has been stopped again as of 6:50pm.
We appreciate your patience and will provide another update by noon tomorrow.
UPDATE: January 12, 2021 2:51pm
As of 2:30 pm, this issue has been resolved by our engineers working with the storage vendor.
Bell has returned to full operation and job scheduling has resumed.
UPDATE: January 12, 2021 10:29am
As of 10:00 am, work is still ongoing on this issue. Job scheduling on Bell is still paused.
We will post an update by 6:00 pm today.
UPDATE: January 11, 2021 6:09pm
As of 6:00 pm, engineers are continuing to work with the storage system vendor to resolve this problem. Job scheduling on Bell is still paused.
We will post an update by 10 am tomorrow (12 January).
UPDATE: January 11, 2021 11:42am
Engineers are working with the system vendor for Bell scratch to troubleshoot and identify the problem. Scheduling for new jobs is still paused.
We will post an update here by 6:00 pm.
ORIGINAL: January 10, 2021 11:50pm
The Bell cluster began experiencing issues with metadata on its scratch filesystem around 9:00pm. The problem manifests itself as
ls -l command hangs indefinitely, while the plain regular
stat FILE) appear to be working.
Engineers are currently diagnosing the issue and have opened the ticket with the vendor to identify a fix. Job scheduling has been paused while this issue is being addressed.
We will provide an update by noon tomorrow.