Unscheduled Data Depot outage on the clusters

April 17, 2020  5:00pm – April 22, 2020  3:40pm
Brown, Gilbreth, Halstead, Hammer, Rice, Scholar, Snyder, Workbench

UPDATE: April 22, 2020  3:40pm

As of April 22, 2020  3:40pm, Data Depot filesystem on the Brown, Gilbreth, Halstead, Hammer, Rice, Scholar, Snyder, and Workbench cluster has been returned to normal service. All the jobs we temporarily held have been released (but not the jobs that were manually held by their owners).

We apologize for the disruption of service and thank you for your patience. Please report any issues to rcac-help@purdue.edu.


UPDATE: April 22, 2020  11:54am

Data Depot is now available on front-ends and compute nodes on Gilbreth, Snyder, Scholar, Rice and Workbench. Work continues on bringing it back on the rest of the clusters.

We will provide another update by 4pm or as soon as we have any additional information.


UPDATE: April 22, 2020  9:00am

Additional filesystem checks and overnight stability tests on Data Depot were successful. Systems engineers will begin the process of restoring the mount on compute nodes on a per-cluster basis, while continuing to monitor the health of the Data Depot.

We will provide another update by noon.


UPDATE: April 21, 2020  9:34pm

Work continues on bringing Data Depot to normal operation.

Please note that the SMB/Windows Network Drive access may currently suffer from intermittent failures.

Job scheduling is enabled on Brown, Gilbreth, Halstead, Hammer, Rice, Scholar and Snyder while Research Data Depot is temporarily unavailable on cluster compute nodes. This entails several unusual consequences which you should be aware of. Please refer to the following article for detailed explanations: Running Jobs on Community Clusters While Data Depot is Unavailable.

We will provide another update by 9am tomorrow or as soon as we have any additional information.


UPDATE: April 21, 2020  3:00pm

Work continues on bringing Data Depot to normal operation.

Please note that the SMB/Windows Network Drive access may currently suffer from intermittent failures.

Job scheduling is enabled on Brown, Gilbreth, Halstead, Hammer, Rice, Scholar and Snyder while Research Data Depot is temporarily unavailable on cluster compute nodes. This entails several unusual consequences which you should be aware of. Please refer to the following article for detailed explanations: Running Jobs on Community Clusters While Data Depot is Unavailable.

We will provide another update by 10pm tonight or as soon as we have any additional information.


UPDATE: April 21, 2020  9:55am

Data Depot filesystem check has completed. Work continues on bringing it back to normal operation on the clusters.

Job scheduling has been enabled on Brown, Gilbreth, Halstead, Hammer, Rice, Scholar and Snyder while Research Data Depot is temporarily unavailable on cluster compute nodes. This entails several unusual consequences which you should be aware of. Please refer to the following article for detailed explanations: Running Jobs on Community Clusters While Data Depot is Unavailable.

We will provide another update by 4pm or as soon as we have any additional information.


UPDATE: April 20, 2020  8:55pm

Data Depot filesystem check progresses per vendor-recommended procedure.

In order to return the clusters to service while this process continues, we will resume job scheduling on Brown, Gilbreth, Halstead, Hammer, Rice, Scholar and Snyder while Research Data Depot is temporarily unavailable on cluster compute nodes. This will entail several unusual consequences which you should be aware of. Please refer to the following article for detailed explanations: Running Jobs on Community Clusters While Data Depot is Unavailable.

We will provide another update by 10 am tomorrow or as soon as we have any additional information.


UPDATE: April 20, 2020  4:00pm

Data Depot filesystem check process progresses slowly but steadily, currently at 83.7%. Data Depot remains unavailable on Brown, Gilbreth, Halstead, Hammer, Rice, Scholar, Snyder, and Workbench clusters. Job scheduling on the clusters remains stopped.

We will provide another update by 10 am tomorrow or as soon as we have any additional information.


UPDATE: April 20, 2020  9:58am

Work continues on bringing Data Depot on Brown, Gilbreth, Halstead, Hammer, Rice, Scholar, Snyder, and Workbench clusters back to normal operation. Job scheduling remains stopped.

For low-impact access to your saved files and documents, you can use the SMB/Windows Network Drive method.

We will provide another update by 4pm or as soon as we have any additional information.


UPDATE: April 19, 2020  4:01pm

Data Depot filesystem check process progresses slowly but steadily, currently at 73.5% (close to 3PB out 3.9PB). Data Depot remains unavailable on Brown, Gilbreth, Halstead, Hammer, Rice, Scholar, Snyder, and Workbench clusters, and job scheduling remains stopped.

For low-impact access to your saved files and documents, you can use the SMB/Windows Network Drive method.

We will provide another update by 10am tomorrow or as soon as we have any additional information.


UPDATE: April 19, 2020  9:50am

Data Depot is still down on Brown, Gilbreth, Halstead, Hammer, Rice, Scholar, Snyder, and Workbench clusters, the filesystem check continues. Job scheduling on the clusters remains stopped.

We will provide another update by 4pm or as soon as we have any additional information.


UPDATE: April 18, 2020  4:12pm

Work continues on bringing up Data Depot filesystem on affected clusters. The filesystem check process progresses well, but at a lower rate than initially anticipated.

For low-impact access to your saved files and documents, you can use the SMB/Windows Network Drive method.

We appreciate how critical this service is for users of the clusters and are working around the clock to restore service as soon as possible. We will provide another update by 10am tomorrow or as soon as we have any additional information.


UPDATE: April 18, 2020  10:05am

The fix process for Data Depot filesystem on the clusters continues as expected. We will provide another update by 4pm today.


UPDATE: April 17, 2020  10:47pm

Work continues on bringing Data Depot on Brown, Gilbreth, Halstead, Hammer, Rice, Scholar, Snyder, and Workbench clusters back to normal operation. Engineers have identified the source of the problem and are currently working on the fix. This process is expected to continue through the night.

We will provide another update by 10am tomorrow.


UPDATE: April 17, 2020  7:57pm

Work continues on diagnosing Data Depot problems on Brown, Gilbreth, Halstead, Hammer, Rice, Scholar, Snyder, and Workbench and bringing the clusters back to normal operation. We will provide another update by midnight.


ORIGINAL: April 17, 2020  6:12pm

The Brown, Gilbreth, Halstead, Hammer, Rice, Scholar, Snyder, and Workbench clusters began experiencing issues with connection to Data Depot filesystem around 5:00pm on Friday, April 17th, 2020. Engineers are currently diagnosing the issue and are working to identify a fix. Job scheduling has been paused while this issue is being addressed.

We will provide an update by 8pm.

Originally posted: April 17, 2020  6:12pm