Unscheduled Outage on Conte
November 4, 2015 12:20pm – 9:15pm
Update - 9:20pm
Conte has been returned to full production as of 9:15pm.
During the failure earlier today, the internal tracking of jobs within the scheduler on Conte was corrupted. Unfortunately, this resulted in all running and pending jobs being dropped from their queues. Engineers have restored service and done extended testing of the scheduler's hardware to ensure it is stable, and we will continue to monitor this closely.
Any jobs you had waiting in queue will now need to be resubmitted.
Update - 5:30pm
The Lustre filesystem is back in production. Job scheduling remains paused as ITaP engineers troubleshoot issues with the PBS job scheduler.
The scratch filesystem and the PBS server serving Conte is currently unavailable.
Both attempts to access files in scratch will block until the filesystem is back online. Job commands such as "qsub" and "qstat" may return error messages or incorrect information.
Job scheduling on Conte has been paused while ITaP engineers address the issue.