Unscheduled Brown scratch outage

June 11, 2020  12:00pm – June 19, 2020  8:30pm
Brown

UPDATE: June 19, 2020  8:30pm

As of 8:30pm, the scratch filesystem on Brown cluster has been brought online and the cluster was returned to normal service. Job queues have been enabled and job scheduling has been resumed. We apologize for the disruption of service.

Please report any issues to rcac-help@purdue.edu.


UPDATE: June 19, 2020  5:12pm

Work on Brown scratch filesystem progresses successfully. Low-level disk pools verification and consistency checks succeeded earlier today. Consecutively, at the recommendation of the vendor, the filesystem-level checks have been started and are currently in progress.

We will continue providing updates on this page as soon as more details become available.


UPDATE: June 18, 2020  12:51pm

Filesystem internal consistency checks continue progressing slowly but steadily. Making sure they finish successfully is a mandatory prerequisite for moving forward with the vendor procedure.


UPDATE: June 17, 2020  11:45am

Engineers began implementing a vendor-recommended procedure for gradually bringing the filesystem up in a careful step-by-step fashion with multiple internal consistency checks.

It is unclear at the moment how long these steps will take, but we will be providing updates as more details become available later tonight.


UPDATE: June 16, 2020  6:39pm

Engineers continue working with vendor support and development teams on deep troubleshooting, hardware modules replacements and low-level system logs analysis of Brown scratch. The filesystem remains in the down state, and Brown scheduling is still paused.

We understand the disruption this brings to your research projects, and we highly appreciate your patience. We do not currently have an ETA, but we are making every effort to bring the cluster back as soon as possible. As usual, status updates will be posted on this page and emailed periodically.

Please reach out to rcac-help@purdue.edu if you have any concerns.


UPDATE: June 14, 2020  4:32pm

Replacement hardware failed to install and communicate with the rest of the infrastructure properly. Engineers continue working with multiple tiers of vendor support to troubleshoot and analyze hardware diagnostics and software logs.


UPDATE: June 13, 2020  6:00pm

Work is continuing with vendor support teams in bringing the replacement hardware online.


UPDATE: June 13, 2020  12:43pm

The replacement hardware has arrived and is being prepped for installation.


UPDATE: June 12, 2020  1:54pm

Work continues on troubleshooting Brown scratch filesystem problems. Engineers collaborate with vendor support team on analyzing and identifying the source of hardware issues.

We appreciate your patience during this process. We will provide another update by 10am tomorrow or as soon as we have any additional information.


UPDATE: June 11, 2020  4:22pm

Work continues on bringing Brown scratch back to normal operation. Engineers are engaged with the vendor support team on identifying and troubleshooting the source of the problem.

We will provide another update by 10am tomorrow or as soon as we have additional information.


ORIGINAL: June 11, 2020  12:11pm

The Brown cluster began experiencing issues with its scratch filesystem around 12:00pm. Engineers are currently diagnosing the issue and are working to identify a fix. Job scheduling has been paused while this issue is being addressed.

We will provide an update by 4pm.

Originally posted: June 11, 2020  12:11pm