Unscheduled Brown scratch outage
As of 8:30pm EDT, the scratch filesystem on Brown cluster has been brought online and the cluster was returned to normal service. Job queues have been enabled and job scheduling has been resumed. We apologize for the disruption of service.
Please report any issues to email@example.com.
Work on Brown scratch filesystem progresses successfully. Low-level disk pools verification and consistency checks succeeded earlier today. Consecutively, at the recommendation of the vendor, the filesystem-level checks have been started and are currently in progress.
We will continue providing updates on this page as soon as more details become available.
Filesystem internal consistency checks continue progressing slowly but steadily. Making sure they finish successfully is a mandatory prerequisite for moving forward with the vendor procedure.
Engineers began implementing a vendor-recommended procedure for gradually bringing the filesystem up in a careful step-by-step fashion with multiple internal consistency checks.
It is unclear at the moment how long these steps will take, but we will be providing updates as more details become available later tonight.
Engineers continue working with vendor support and development teams on deep troubleshooting, hardware modules replacements and low-level system logs analysis of Brown scratch. The filesystem remains in the down state, and Brown scheduling is still paused.
We understand the disruption this brings to your research projects, and we highly appreciate your patience. We do not currently have an ETA, but we are making every effort to bring the cluster back as soon as possible. As usual, status updates will be posted on this page and emailed periodically.
Please reach out to firstname.lastname@example.org if you have any concerns.
Replacement hardware failed to install and communicate with the rest of the infrastructure properly. Engineers continue working with multiple tiers of vendor support to troubleshoot and analyze hardware diagnostics and software logs.
Work is continuing with vendor support teams in bringing the replacement hardware online.
The replacement hardware has arrived and is being prepped for installation.
Work continues on troubleshooting Brown scratch filesystem problems. Engineers collaborate with vendor support team on analyzing and identifying the source of hardware issues.
We appreciate your patience during this process. We will provide another update by 10am tomorrow or as soon as we have any additional information.
Work continues on bringing Brown scratch back to normal operation. Engineers are engaged with the vendor support team on identifying and troubleshooting the source of the problem.
We will provide another update by 10am tomorrow or as soon as we have additional information.
Link to original posting ORIGINAL:
The Brown cluster began experiencing issues with its scratch filesystem around 12:00pm EDT. Engineers are currently diagnosing the issue and are working to identify a fix. Job scheduling has been paused while this issue is being addressed.
We will provide an update by 4pm.