Home and Applications Filesystem Maintenance - All ClustersUPDATE: November 4, 2020 9:33am
As of 9:33am, the WCERES cluster has been returned to normal operations.
This concludes the emergency
/apps maintenance and all-floor downtime. Please report any issues to email@example.com.
UPDATE: November 4, 2020 9:10am
As of 9:10am, patching was completed successfully on the shared
/apps file systems. The file systems are healthy and mounted on all affected clusters and other systems.
Brown, Gilbreth, Halstead, Hammer, Rice, Snyder, Workbench and WSC Hadoop clusters have been returned to normal operations. Job queues have been enabled and job scheduling has resumed.
WCERES cluster remains down as systems engineers continue working on bringing it back online.
Thank you for your patience and understanding during this emergency maintenance. Please report any issues to firstname.lastname@example.org.
UPDATE: November 4, 2020 7:47am
gpu queue is re-enabled.
UPDATE: November 3, 2020 11:20am
As of 11:20am, the Scholar cluster has been returned to normal service with a dedicated
/home and new dedicated
/apps. Job queues have been enabled and job scheduling has been resumed with the exception of the
gpu queue that remains down.
The rest of the clusters (Brown, Gilbreth, Halstead, Hammer, Rice, Snyder, WCERES, Workbench, and WSC Hadoop) and other systems remain unavailable pending work on their shared
/apps file systems.
We appreciate your patience. Please report any issues to email@example.com.
ORIGINAL: October 27, 2020 12:48pm
Most of the research computing clusters (Brown, Gilbreth, Halstead, Hammer, Rice, Scholar, Snyder, WCERES, Workbench, and WSC Hadoop) as well as some other minor systems will be unavailable beginning at Tuesday, November 3rd, 2020 at 9:00am, for an urgent maintenance on a storage system providing shared
/apps file systems. Clusters will return to full production by Wednesday, November 4th, 2020 at 9:00am.
Our storage vendor has alerted us to a potentially serious bug on our global home and applications directory file system. In order to avoid loss of data, it is imperative that we patch the bug in a timely manner. Patching can not be done live and requires bringing these file systems down and unmounting them from all clusters that use them. With
/apps unavailable, users will not be able to login to the clusters nor use any of the provided software. Any service or system that requires access to these shared RCAC-wide file systems will be affected.
Research Data Depot and Github services are not affected. The Bell cluster has dedicated
/apps file systems and would not be directly affected by this outage, but will undergo a separate scheduled maintenance during a partially overlapping time window.
Any SLURM jobs which request a walltime which would take them past Tuesday, November 3rd, 2020 at 9:00am will not start and will remain in the queue until after the maintenance is completed.
We appreciate your understanding and apologize for a short notice caused by a narrow timeline suggested by the vendor based on potential severity of the bug. Preserving your data is of utmost importance to us.