Pete cluster temporarily unavailable Tuesday, September 2

September 03, 2008

The pete.rcac back-end server for the Pete cluster hung earlier today, Tuesday, September 2.  As a result, pete.rcac had to be rebooted and PBS jobs executing at the time had to be requeued.

Part of the process of restarting the PBS server involves its reestablishing communication with the PBS software running on each compute node.  Under normal circumstances, reestablishing a cluster's nodes should take 15-20 minutes.

However, this operation is taking much longer than that on the Pete cluster -- our current estimate is that it will complete at approximately 5pm, at which point PBS job scheduling will be reenabled.  We have escalated this issue to our software vendor, and are working toward a better turnaround in this sort of situation.

Update:  PBS job scheduling on the Pete cluster was reenabled at 5pm Tuesday, September 2.

Please refer questions about this outage to rcac-help@purdue.edu.

Share this...
Close
E-mail It