Home and Applications Filesystem Maintenance - All Clusters

November 3, 2020  9:00am – November 4, 2020  9:00am
Brown, Gilbreth, Halstead, Hammer, Rice, Scholar, Snyder, WCERES, Workbench, WSC Hadoop

UPDATE: November 4, 2020  9:33am

As of 9:33am, the WCERES cluster has been returned to normal operations.

This concludes the emergency /home and /apps maintenance and all-floor downtime. Please report any issues to rcac-help@purdue.edu.


UPDATE: November 4, 2020  9:10am

As of 9:10am, patching was completed successfully on the shared /home and /apps file systems. The file systems are healthy and mounted on all affected clusters and other systems.

Brown, Gilbreth, Halstead, Hammer, Rice, Snyder, Workbench and WSC Hadoop clusters have been returned to normal operations. Job queues have been enabled and job scheduling has resumed.

WCERES cluster remains down as systems engineers continue working on bringing it back online.

Thank you for your patience and understanding during this emergency maintenance. Please report any issues to rcac-help@purdue.edu.


UPDATE: November 4, 2020  7:47am

Scholar cluster's gpu queue is re-enabled.


UPDATE: November 3, 2020  11:20am

As of 11:20am, the Scholar cluster has been returned to normal service with a dedicated /home and new dedicated /apps. Job queues have been enabled and job scheduling has been resumed with the exception of the gpu queue that remains down.

The rest of the clusters (Brown, Gilbreth, Halstead, Hammer, Rice, Snyder, WCERES, Workbench, and WSC Hadoop) and other systems remain unavailable pending work on their shared /home and /apps file systems.

We appreciate your patience. Please report any issues to rcac-help@purdue.edu.


ORIGINAL: October 27, 2020  12:48pm

Most of the research computing clusters (Brown, Gilbreth, Halstead, Hammer, Rice, Scholar, Snyder, WCERES, Workbench, and WSC Hadoop) as well as some other minor systems will be unavailable beginning at Tuesday, November 3rd, 2020 at 9:00am, for an urgent maintenance on a storage system providing shared /home and /apps file systems. Clusters will return to full production by Wednesday, November 4th, 2020 at 9:00am.

Our storage vendor has alerted us to a potentially serious bug on our global home and applications directory file system. In order to avoid loss of data, it is imperative that we patch the bug in a timely manner. Patching can not be done live and requires bringing these file systems down and unmounting them from all clusters that use them. With /home and /apps unavailable, users will not be able to login to the clusters nor use any of the provided software. Any service or system that requires access to these shared RCAC-wide file systems will be affected.

Research Data Depot and Github services are not affected. The Bell cluster has dedicated /home and /apps file systems and would not be directly affected by this outage, but will undergo a separate scheduled maintenance during a partially overlapping time window.

Any SLURM jobs which request a walltime which would take them past Tuesday, November 3rd, 2020 at 9:00am will not start and will remain in the queue until after the maintenance is completed.

We appreciate your understanding and apologize for a short notice caused by a narrow timeline suggested by the vendor based on potential severity of the bug. Preserving your data is of utmost importance to us.

Originally posted: October 27, 2020  12:48pm