Unscheduled Rice scratch outage

September 6, 2018 4:30pm - September 7, 2018 5:30pm EDT
Outages
Rice

Link to update at September 10, 2018 8:48am EDT UPDATE: September 10, 2018 8:48am EDT

After working with ITaP storage engineers, vendor experts have identified a bug in the Lustre software which created an inconsistency within the filesystem's accounting of which server was providing access to which piece of data.

The immediate condition was remedied, and a patch will be installed on Rice's storage backends to prevent this issue from returning.

Link to update at September 7, 2018 6:29pm EDT UPDATE: September 7, 2018 6:29pm EDT

As 5:25pm, the issues with the Rice cluster scratch filesystem were resolved and the cluster has been returned to normal service. Job queues have been enabled and job scheduling has been resumed. We apologize for the disruption of service. Please report any issues to rcac-help@purdue.edu.

Link to update at September 7, 2018 3:17pm EDT UPDATE: September 7, 2018 3:17pm EDT

Both our own engineers and vendor engineers have been working to isolate the issue with the Rice scratch storage all day. Unfortunately, the cause of the overload on the filesystem servers remains unidentified. We have been forced to take increasingly drastic measures with regard to stopping or killing running jobs on Rice in order to restore normal filesystem function. Rest assured there is no concern for the data in the filesystem, but all work active on Rice may need to be restarted in order for this to be fully resolved.

We remain engaged in several ideas to isolate the issue, and we will update this post by 6:00pm or sooner.

Link to update at September 7, 2018 9:52am EDT UPDATE: September 7, 2018 9:52am EDT

Rice scratch is stable, but interactivity remains slow, and queues are still paused.

Engineers from the vendor have been engaged overnight and work continues on isolating the source of the responsiveness issue.

Link to update at September 6, 2018 10:11pm EDT UPDATE: September 6, 2018 10:11pm EDT

Work continues on bringing Rice scratch back to normal operation. At the moment, some file system operations are returning normally but for others, scratch remains unresponsive. Engineers continue to work into the evening to resolve the problem.

Job scheduling will remain paused until scratch is fully stabilized.

We will provide another update by tomorrow morning at 10 am, if not sooner.

Link to original posting ORIGINAL: September 6, 2018 4:30pm EDT

The Rice cluster began experiencing issues with its scratch system around 4:30pm EDT. Engineers are currently diagnosing the issue and are working to identify a fix. Job scheduling has been paused while this issue is being addressed.

We will provide an update by 10 pm.

Originally posted: September 6, 2018 7:36pm EDT
Last updated: September 10, 2018 8:48am EDT

Unscheduled Rice scratch outage

Link to update at September 10, 2018 8:48am EDT UPDATE: September 10, 2018 8:48am EDT

Link to update at September 7, 2018 6:29pm EDT UPDATE: September 7, 2018 6:29pm EDT

Link to update at September 7, 2018 3:17pm EDT UPDATE: September 7, 2018 3:17pm EDT

Link to update at September 7, 2018 9:52am EDT UPDATE: September 7, 2018 9:52am EDT

Link to update at September 6, 2018 10:11pm EDT UPDATE: September 6, 2018 10:11pm EDT

Link to original posting ORIGINAL: September 6, 2018 4:30pm EDT

Follow Us