Unscheduled Power outage in Math Datacenter

August 20, 2012  1:00pm – August 21, 2012  8:30pm
Carter, Coates, Hansen, Radon, WinHPC

Update: 10:00pm Tuesday

As of 8:30pm Tuesday 21 August 2012, the LustreB filesystem has been returned to full service. Our storage engineers with assistance of the vendor have verified that the system is stable. If you encounter any issues, please contact us at rcac-help@purdue.edu

Update: 6:00pm Tuesday

Work is continuing on the LustreB fileservers, but all Steele queues have been re-enabled. Jobs that refer to any directories or files within the /scratch/lustreB tree will probably fail and should not be submitted.

Update: 10:00am Tuesday

As of 10:00 am Tuesday 21 August 2012, the LustreB file servers are up, but the vendor is doing further work on them to ensure they will remain stable before returning them to service. Queues on Steele whose owners do not use LustreB are running and accepting new jobs; the 'standby' queue and owners whose users require LustreB remain stopped pending the return of the LustreB storage. The next update will be at 5:00 pm today.

Update: 11:30pm Monday

As of 11:30 pm Monday 20 August 2012, the LustreB file servers are still being worked on. Queues on Steele whose owners do not use LustreB are running and accepting new jobs; the 'standby' queue is still stopped pending the return of the LustreB storage. The next update will be at 10:00 am tomorrow.

Update: 6:20pm Monday

As of 6:20 pm Monday, 20 August 2012, Coates, Rossmann, Hansen, and Carter have been returned to full service. All jobs running when the power went out will need to be resubmitted, as the systems were forced to reboot by the interruption.

Some queues on Steele, including standby, are currently stopped while storage engineers bring the Lustre B filesystem back online. We will have another update prior to 9pm regarding Steele.

Update: 5:30pm Monday

As of 5:20 pm Monday, 20 August 2012, the Radon cluster and Moffett SiCortex system have been returned to normal service. All jobs running when the power went out will need to be resubmitted, as the systems were forced to reboot by the interruption.

Monday afternoon, 20 August 2012, the MATH building suffered a brief unexpected power outage. This has had significant impact on all of RCAC's research computing resources. System staff are bringing the clusters back on line at this time.

All Lustre scratch systems were affected; we don't believe any files or data were lost, but jobs using that scratch space were killed, and will need to be resubmitted when the clusters resume full production status. Jobs on Steele that did not use Lustre file systems continued to run, and are probably unaffected, but new jobs will not be scheduled until the system is fully back in service.

Systems known to be affected are Carter, Coates, Hansen, Moffett, Radon, and WinHPC.

Originally posted: August 21, 2012

Purdue University, 610 Purdue Mall, West Lafayette, IN 47907, (765) 494-4600

© 2017 Purdue University | An equal access/equal opportunity university | Copyright Complaints | Maintained by ITaP Research Computing

Trouble with this page? Disability-related accessibility issue? Please contact us at online@purdue.edu so we can help.