Unscheduled outage to Rossmann cluster
March 15 – 20, 2012
At approximately 10:50pm, Thursday, March 15, the power distribution to large portions of the Rossmann cluster failed. These feeds also power the login nodes for the cluster, which, while unavailable, renders Rossmann unavailable for use.
Power was restored to the Rossmann racks in the datacenter this morning and ITaP engineers returned Rossmann to service at 9:30am.
Around 2:00pm, the problem that took approximately half the nodes and the front-ends offline last night has recurred, and Rossmann again became unreachable, though many jobs continued to run.
As of 2:50pm, half of Rossmann's nodes are still down due to the power failure, but one frontend has been restored to service, so it should be possible to monitor any active jobs and some jobs may still be able to run effectively, though queue waits will be longer than usual.
Engineers are continuing to work on restoring power to the remainder of Rossmann, and we will post an update by 5:00pm if not sooner.
As of 5:00 we are still running with about half the Rossmann compute nodes down. The front-end login hosts are up but we have seen intermittent problems causing poor interactive response, and network engineers continue to work on this issue. Our current plan is to issue another update tomorrow (Saturday) about 5:00. Until then we will try to maintain Rossmann in its current state.
New jobs can still be submitted, and running jobs can be monitored, but the queue wait time may be increased due to the reduced node count, and responsiveness of the login hosts may be poor at times.
On Sunday, 18 March, the system administrators implemented a workaround for the interactive response problem.
On Monday, 19 March, the power distribution issue was fixed.
As of Tuesday, 20 March, Rossmann is operating at full capacity again.