Emergency Carter Cluster Maintenance
March 15, 2017 12:00pm – March 16, 2017 11:59pm
Owner queues on Carter have been restarted. While Carter is currently deemed stable, performance is still impacted. Engineers are closely monitoring the situation and will take corrective action if necessary.
At this time, only Carter’s standby queues remain enabled as engineers continue to monitor the scratch file system. Performance has improved and we are working to restore full owner access as soon as possible.
We will provide another update by 5pm today, March 16th.
Systems engineers alleviated performance problems on the scratch filesystem and brought the cluster back online so users can login to the front-ends and submit jobs. Standby queue is enabled, however owners' queues are still temporarily paused. We will continue monitoring the scratch performance and gradually release owners' queues as conditions allow.
We will provide next update no later than 10am on Thursday, March 16, 2017.
The Carter cluster will be taken down at Wednesday, March 15th, 2017 at 12:00pm for emergency maintenance. The scratch storage system which serves Carter is not performing correctly. Engineers have made several changes to try to isolate and resolve this issue for days, including pausing standby jobs to reduce load, but we believe the issue cannot be resolved while the cluster is in production.
Any PBS jobs in queue now which request a walltime which would take them past Wednesday, March 15th, 2017 at 12:00pm will not start and will remain in the queue until after the maintenance is completed. Any jobs which have already started and do not complete by Wednesday, March 15th, 2017 at 12:00pm will be forcibly stopped and requeued.
We will post an update on the status of this work by 5:00pm on March 15.