Multiple clusters outage

October 17, 2023 8:30am - October 20, 2023 5:45pm EDT
Outages
Bell, Brown, Geddes, Gilbreth, Negishi

Link to update at October 20, 2023 5:44pm EDT UPDATE: October 20, 2023 5:44pm EDT

The Gilbreth cluster has been returned to service and scheduled user jobs have begun running. At this time, all clusters affected by the water issue in the DCM have been returned to service and jobs are actively being scheduled and/or running. Additionally, the Globus Endpoint for Bell's Home directories is now available. We will continue working on the Globus Endpoint issues affecting other clusters and recommend using alternative approaches for transferring data to and from such clusters in the meantime.

Link to update at October 20, 2023 4:15pm EDT UPDATE: October 20, 2023 4:15pm EDT

The Negishi cluster has been returned to service and scheduled user jobs have begun running. We are aware that Negishi's Globus endpoint is unavailable and we are continuing to work on it alongside other priorities. We will provide the next update before 6 pm today.

Link to update at October 20, 2023 11:48am EDT UPDATE: October 20, 2023 11:48am EDT

Bell and Brown clusters are back in operation. We are aware that the Globus Endpoints for Brown Scratch directories and Bell Home directories are not available yet, but that should not impact a vast majority of users and jobs. Purdue IT engineers are working on bootstrapping Negishi and Gilbreth community clusters and addressing the remaining Globus issue. We will provide the next update before 5 pm today.

Link to update at October 19, 2023 5:41pm EDT UPDATE: October 19, 2023 5:41pm EDT

Datacenter power and cooling have been restored. The Geddes composable platform is back in operation.

Purdue IT engineers are working at bootstrapping the remaining community cluster systems, and anticipate them to return to normal operation over the next 24 hours.

Link to update at October 17, 2023 5:17pm EDT UPDATE: October 17, 2023 5:17pm EDT

After this morning’s power issue across campus, most community cluster supercomputers were shut down due to a burst 3” water main, spraying water above the data center floor tiles and onto Negishi and nearby power distribution units.

We are currently awaiting clearance from campus electricians to verify that electrical equipment is dry enough that the affected HPC systems can be safely powered back on, which we anticipate within the next 48-72 hours.

Link to original posting ORIGINAL: October 17, 2023 8:30am - October 18, 2023 5:00pm EDT

Multiple clusters have been powered off in MATH G109 datacenter due to a water issue in the building. Affected systems are Bell, Brown, Geddes, Gilbreth and Negishi.

We will provide an update by 5:00 PM today.

Originally posted: October 17, 2023 9:44am EDT