Unscheduled Brown scratch outage

June 11, 2020 12:00pm - June 19, 2020 8:30pm EDT
Outages
Brown

Link to update at June 19, 2020 8:30pm EDT UPDATE: June 19, 2020 8:30pm EDT

As of 8:30pm EDT, the scratch filesystem on Brown cluster has been brought online and the cluster was returned to normal service. Job queues have been enabled and job scheduling has been resumed. We apologize for the disruption of service.

Please report any issues to rcac-help@purdue.edu.

Link to update at June 19, 2020 5:12pm EDT UPDATE: June 19, 2020 5:12pm EDT

Work on Brown scratch filesystem progresses successfully. Low-level disk pools verification and consistency checks succeeded earlier today. Consecutively, at the recommendation of the vendor, the filesystem-level checks have been started and are currently in progress.

We will continue providing updates on this page as soon as more details become available.

Link to update at June 18, 2020 12:51pm EDT UPDATE: June 18, 2020 12:51pm EDT

Filesystem internal consistency checks continue progressing slowly but steadily. Making sure they finish successfully is a mandatory prerequisite for moving forward with the vendor procedure.

Link to update at June 17, 2020 11:45am EDT UPDATE: June 17, 2020 11:45am EDT

Engineers began implementing a vendor-recommended procedure for gradually bringing the filesystem up in a careful step-by-step fashion with multiple internal consistency checks.

It is unclear at the moment how long these steps will take, but we will be providing updates as more details become available later tonight.

Link to update at June 16, 2020 6:39pm EDT UPDATE: June 16, 2020 6:39pm EDT

Engineers continue working with vendor support and development teams on deep troubleshooting, hardware modules replacements and low-level system logs analysis of Brown scratch. The filesystem remains in the down state, and Brown scheduling is still paused.

We understand the disruption this brings to your research projects, and we highly appreciate your patience. We do not currently have an ETA, but we are making every effort to bring the cluster back as soon as possible. As usual, status updates will be posted on this page and emailed periodically.

Please reach out to rcac-help@purdue.edu if you have any concerns.

Link to update at June 14, 2020 4:32pm EDT UPDATE: June 14, 2020 4:32pm EDT

Replacement hardware failed to install and communicate with the rest of the infrastructure properly. Engineers continue working with multiple tiers of vendor support to troubleshoot and analyze hardware diagnostics and software logs.

Link to update at June 13, 2020 6:00pm EDT UPDATE: June 13, 2020 6:00pm EDT

Work is continuing with vendor support teams in bringing the replacement hardware online.

Link to update at June 13, 2020 12:43pm EDT UPDATE: June 13, 2020 12:43pm EDT

The replacement hardware has arrived and is being prepped for installation.

Link to update at June 12, 2020 1:54pm EDT UPDATE: June 12, 2020 1:54pm EDT

Work continues on troubleshooting Brown scratch filesystem problems. Engineers collaborate with vendor support team on analyzing and identifying the source of hardware issues.

We appreciate your patience during this process. We will provide another update by 10am tomorrow or as soon as we have any additional information.

Link to update at June 11, 2020 4:22pm EDT UPDATE: June 11, 2020 4:22pm EDT

Work continues on bringing Brown scratch back to normal operation. Engineers are engaged with the vendor support team on identifying and troubleshooting the source of the problem.

We will provide another update by 10am tomorrow or as soon as we have additional information.

Link to original posting ORIGINAL: June 11, 2020 12:00pm EDT

The Brown cluster began experiencing issues with its scratch filesystem around 12:00pm EDT. Engineers are currently diagnosing the issue and are working to identify a fix. Job scheduling has been paused while this issue is being addressed.

We will provide an update by 4pm.

Originally posted: June 11, 2020 12:11pm EDT
Last updated: June 19, 2020 8:30pm EDT