Unscheduled power outage in MATH Tuesday, 2/10

February 12, 2009

RCAC experienced an unscheduled power failure in its MATH data center at approximately 8:20pm Tuesday, February 10.

The Pete, Steele, and Radon Linux clusters, and the Julius/Caesar SGI Altix 4700 were affected and went out of service. 

The Rossmann, Prospero, and Venice Linux clusters, the Condor servers, the CMS systems in MANN Hall, the Moffett SiCortex system, and the RCAC storage servers including the DXUL/fortress archival storage system remained in operation.

As of 9:15pm, electricians and RCAC systems staff were on-site in MATH but were unable to provide an estimate of when power will be restored so the systems could be returned to service. 

The next update on this outage is scheduled for 12 midnight Tuesday, February 10.

Update as of midnight Tuesday, February 10:   Electricians restored power in the MATH data center at approximately 10:35pm, at which point RCAC system engineers started rebooting the affected systems.  Batch job scheduling was resumed on a cluster-by-cluster basis starting at 11:15pm and all systems were back in service by 11:45pm. 

We regret any inconvenience caused by this outage, and will be conducting a review to determine its cause and explore ways to mitigate the impact of future power-related problems.

Share this...
Close
E-mail It