Unscheduled Rice scratch outage

September 6, 2018  4:30pm – September 7, 2018  5:30pm

UPDATE: September 10, 2018  8:48am

After working with ITaP storage engineers, vendor experts have identified a bug in the Lustre software which created an inconsistency within the filesystem's accounting of which server was providing access to which piece of data.

The immediate condition was remedied, and a patch will be installed on Rice's storage backends to prevent this issue from returning.

UPDATE: September 7, 2018  6:29pm

As 5:25pm, the issues with the Rice cluster scratch filesystem were resolved and the cluster has been returned to normal service. Job queues have been enabled and job scheduling has been resumed. We apologize for the disruption of service. Please report any issues to rcac-help@purdue.edu.

UPDATE: September 7, 2018  3:17pm

Both our own engineers and vendor engineers have been working to isolate the issue with the Rice scratch storage all day. Unfortunately, the cause of the overload on the filesystem servers remains unidentified. We have been forced to take increasingly drastic measures with regard to stopping or killing running jobs on Rice in order to restore normal filesystem function. Rest assured there is no concern for the data in the filesystem, but all work active on Rice may need to be restarted in order for this to be fully resolved.

We remain engaged in several ideas to isolate the issue, and we will update this post by 6:00pm or sooner.

UPDATE: September 7, 2018  9:52am

Rice scratch is stable, but interactivity remains slow, and queues are still paused.

Engineers from the vendor have been engaged overnight and work continues on isolating the source of the responsiveness issue.

UPDATE: September 6, 2018  10:11pm

Work continues on bringing Rice scratch back to normal operation. At the moment, some file system operations are returning normally but for others, scratch remains unresponsive. Engineers continue to work into the evening to resolve the problem.

Job scheduling will remain paused until scratch is fully stabilized.

We will provide another update by tomorrow morning at 10 am, if not sooner.

ORIGINAL: September 6, 2018  7:36pm

The Rice cluster began experiencing issues with its scratch system around 4:30pm. Engineers are currently diagnosing the issue and are working to identify a fix. Job scheduling has been paused while this issue is being addressed.

We will provide an update by 10 pm.

Originally posted: September 6, 2018  7:36pm