Unscheduled outage to MATH datacenter

March 30, 2012 – April 1, 2012

Update - 9:30pm, 4/1/2012:

As of about 9:30pm, Sunday, 1 April, ITaP systems staff have returned Hansen to production status, and job scheduling is re-enabled.

The scratch filesystem on Hansen has been restored with no apparent loss of files; if you do encounter any problems, please let us know at rcac-help@purdue.edu.

Thank you for your patience during this outage.

Update - 5:00 pm, 4/1/2012:

The LustreC filesystem is still unavailable. ITaP engineers continue to work with the vendor to bring the system back online. J

ob scheduling remains disabled on Hansen. We will update the status on Hansen storage at 9:00 pm, 4/1/2012

Update - 2:45pm, 4/1/2012:

The LustreA filesystem has been restored and Coates and Rossmann are back in production mode, with jobs running. Storage engineers are still working on LustreC, so job scheduling on Hansen remains disabled. We will update the status on LustreC and Hansen at 5:00pm 4/1/2012

Update - 12:00am, 4/1/2012:

ITaP engineers continue to work with storage vendors to bring LustreA and LustreC filesystems back online. Job scheduling on Coates, Rossmann, and Hansen remains disabled.

The next update on the status of Coates, Rossmann, and Hansen storage will come by 12:00 noon, 4/1/2012.

Update - 4pm, 3/31/2012: LustreB has been returned to production on Steele. LustreA (Rossmann, Coates), and LustreC (Hansen) are still unavailable, as ITaP engineers are working carefully with vendor assistance to repair the filesystems while avoiding data corruption. The current goal for return to production is 9pm.

Update - 3:30am, 3/31/2012:

  • Radon - In production
  • Moffett - In production
  • Steele - In production
  • Hansen - Available for login, job starting is disabled.
  • LustreC unavailable - time to repair unknown.
  • Rossmann - Available for login, job starting is disabled. LustreA unavailable - time to repair unknown.
  • Coates - Available for login, job starting is disabled. LustreA unavailable - time to repair unknown.

ITaP storage engineers have escalated Lustre issues to the storage vendor. We will send an update on the cluster status by 12 noon, 3/31/2012.

Original Message: At approximately 11:10 PM, March 30, 2012, part of the West Lafayette campus experienced an unexpected interruption to electrical service.

The part of campus affected included the MATH datacenter housing Coates, Rossmann, and Hansen clusters. The Steele cluster is not affected.

Login nodes for Coates, Rossmann, and Hansen are currently unavailable.

Systems engineers are currently assessing the extent of this unscheduled outage.

At this time, there is no estimate of a return to production.

Please contact us at rcac-help@purdue.edu for any questions regarding this outage.

Originally posted: April 2, 2012