RCAC Whole-Floor Downtime and Power Work

UPDATE: August 2, 2021  12:24pm

With the POD data center issue resolved, the Weber cluster has been returned back to normal service as of 12:24pm. All queues have been enabled and jobs have resumed scheduling. Please report any issues to rcac-help@purdue.edu

This concludes the whole-floor downtime and maintenance. Thank you for your patience!

UPDATE: August 2, 2021  11:59am

Weber networking problem is resolved successfully and the cluster is ready to be returned to service. The RTS process is currently pending on the resolution of the sudden cooling issue in the POD data center.

We will provide another update by 6pm or sooner once the cooling problem is resolved.

UPDATE: August 1, 2021  5:57pm

Engineers continue troubleshooting Weber cluster networking issue that prevents it from returning to service. We will provide an update by noon tomorrow, August 2nd.

UPDATE: August 1, 2021  2:35pm

As of 2:35pm, Geddes cluster has been returned back to normal service. Please report any issues to rcac-help@purdue.edu

Work continues on bringing Weber cluster back. We appreciate your patience and will provide an update by 6pm tonight.

UPDATE: August 1, 2021  2:10pm

As of 2:10pm, Halstead cluster has been returned back to normal service. All queues have been enabled and jobs have resumed scheduling. Please report any issues to rcac-help@purdue.edu

Work continues on bringing Geddes and Weber clusters back. We appreciate your patience and will provide an update by 6pm tonight.

UPDATE: August 1, 2021  11:55am

As of 11:55am, the required data center power work has been completed successfully.

Bell, Brown, CMS, Hammer, Gilbreth, Scholar and Workbench clusters have been returned back to normal service. All queues have been enabled and jobs have resumed scheduling. Please report any issues to rcac-help@purdue.edu

Work continues on bringing Halstead, Geddes and Weber clusters back. We appreciate your patience and will provide an update by 6pm tonight.

UPDATE: July 28, 2021  4:09pm

This is an update to remind you of the Maintenance downtime for most of the Research Computing resources starting this coming Friday, 30 July 2021. Please note in the attached schedule that some clusters (Brown, Hammer, and Weber) will be down on Friday while the others will not go down until Saturday.

The Data Depot will remain available for non-cluster access.

ORIGINAL: July 30, 2021 7:00am - August 1, 2021 12:00pm EDT

The majority of the Research Computing computational resources will be unavailable July 30, 2021 7:00am - August 1, 2021 12:00pm EDT for a whole-floor downtime due to electrical power work in MATH and POD data centers. Along with a required preventative maintenance, the work will provide power equipment upgrades necessary to house the upcoming NSF-funded Anvil supercomputer.

Due to the nature and extent of the work, some resources will be affected longer than the others. The following table provides tentative maintenance start and end times for various Research Computing systems:
Start time End Time Resources
Friday, July 30th, 2021 at 7:00am Sunday, August 1st, 2021 at 12:00pm Brown, Hammer, Weber
Saturday, July 31st, 2021 at 7:00am Sunday, August 1st, 2021 at 12:00pm Bell, CMS, Halstead, Geddes, Gilbreth, Scholar, Workbench
Not affected Data Depot, WSC, WCERES, customer VMs and servers

All systems will return to full production by Sunday, August 1st, 2021 at 12:00pm.

Any SLURM jobs which request a walltime which would take them past the above times will not start and will remain in the queue until after the maintenance is completed.

Originally posted: July 7, 2021 11:58am EDT