Unscheduled Bell outage

January 10, 2021  9:00pm – January 13, 2021  1:00am
Bell

UPDATE: January 13, 2021  12:52am

As of 12:45am, engineers resolved the Bell scratch issue and the cluster has been returned to normal service. Job queues have been enabled and job scheduling has been resumed. We apologize for the disruption of service. Please report any issues to rcac-help@purdue.edu.


UPDATE: January 12, 2021  7:10pm

At about 5:45pm on Tuesday, January 12th, 2021, the problem with Bell scratch has returned. Work continues with our storage vendor on troubleshooting and fixing the issue. Job scheduling on Bell has been stopped again as of 6:50pm.

We appreciate your patience and will provide another update by noon tomorrow.


UPDATE: January 12, 2021  2:51pm

As of 2:30 pm, this issue has been resolved by our engineers working with the storage vendor.

Bell has returned to full operation and job scheduling has resumed.


UPDATE: January 12, 2021  10:29am

As of 10:00 am, work is still ongoing on this issue. Job scheduling on Bell is still paused.

We will post an update by 6:00 pm today.


UPDATE: January 11, 2021  6:09pm

As of 6:00 pm, engineers are continuing to work with the storage system vendor to resolve this problem. Job scheduling on Bell is still paused.

We will post an update by 10 am tomorrow (12 January).


UPDATE: January 11, 2021  11:42am

Engineers are working with the system vendor for Bell scratch to troubleshoot and identify the problem. Scheduling for new jobs is still paused.

We will post an update here by 6:00 pm.


ORIGINAL: January 10, 2021  11:50pm

The Bell cluster began experiencing issues with metadata on its scratch filesystem around 9:00pm. The problem manifests itself as ls -l command hangs indefinitely, while the plain regular ls (or \ls, or stat FILE) appear to be working.

Engineers are currently diagnosing the issue and have opened the ticket with the vendor to identify a fix. Job scheduling has been paused while this issue is being addressed.

We will provide an update by noon tomorrow.

Originally posted: January 10, 2021  11:50pm