Home and Applications Filesystem Maintenance - All Clusters

November 3, 2020 9:00am - November 4, 2020 9:00am EST
Maintenance
Brown, Gilbreth, Halstead, Hammer, Rice, Scholar, Snyder, WCERES, Workbench, WSC Hadoop

Link to update at November 4, 2020 9:33am EST UPDATE: November 4, 2020 9:33am EST

As of 9:33am EST, the WCERES cluster has been returned to normal operations.

This concludes the emergency /home and /apps maintenance and all-floor downtime. Please report any issues to rcac-help@purdue.edu.

Link to update at November 4, 2020 9:10am EST UPDATE: November 4, 2020 9:10am EST

As of 9:10am EST, patching was completed successfully on the shared /home and /apps file systems. The file systems are healthy and mounted on all affected clusters and other systems.

Brown, Gilbreth, Halstead, Hammer, Rice, Snyder, Workbench and WSC Hadoop clusters have been returned to normal operations. Job queues have been enabled and job scheduling has resumed.

WCERES cluster remains down as systems engineers continue working on bringing it back online.

Thank you for your patience and understanding during this emergency maintenance. Please report any issues to rcac-help@purdue.edu.

Link to update at November 4, 2020 7:47am EST UPDATE: November 4, 2020 7:47am EST

Scholar cluster's gpu queue is re-enabled.

Link to update at November 3, 2020 11:20am EST UPDATE: November 3, 2020 11:20am EST

As of 11:20am EST, the Scholar cluster has been returned to normal service with a dedicated /home and new dedicated /apps. Job queues have been enabled and job scheduling has been resumed with the exception of the gpu queue that remains down.

The rest of the clusters (Brown, Gilbreth, Halstead, Hammer, Rice, Snyder, WCERES, Workbench, and WSC Hadoop) and other systems remain unavailable pending work on their shared /home and /apps file systems.

We appreciate your patience. Please report any issues to rcac-help@purdue.edu.

Link to original posting ORIGINAL: November 3, 2020 9:00am EST

Most of the research computing clusters (Brown, Gilbreth, Halstead, Hammer, Rice, Scholar, Snyder, WCERES, Workbench, and WSC Hadoop) as well as some other minor systems will be unavailable beginning at Tuesday, November 3rd, 2020 at 9:00am EST, for an urgent maintenance on a storage system providing shared /home and /apps file systems. Clusters will return to full production by %enddatetime%.

Our storage vendor has alerted us to a potentially serious bug on our global home and applications directory file system. In order to avoid loss of data, it is imperative that we patch the bug in a timely manner. Patching can not be done live and requires bringing these file systems down and unmounting them from all clusters that use them. With /home and /apps unavailable, users will not be able to login to the clusters nor use any of the provided software. Any service or system that requires access to these shared RCAC-wide file systems will be affected.

Research Data Depot and Github services are not affected. The Bell cluster has dedicated /home and /apps file systems and would not be directly affected by this outage, but will undergo a separate scheduled maintenance during a partially overlapping time window.

Any SLURM jobs which request a walltime which would take them past Tuesday, November 3rd, 2020 at 9:00am EST will not start and will remain in the queue until after the maintenance is completed.

We appreciate your understanding and apologize for a short notice caused by a narrow timeline suggested by the vendor based on potential severity of the bug. Preserving your data is of utmost importance to us.

Originally posted: October 27, 2020 12:48pm EDT
Last updated: November 4, 2020 9:33am EST