Unscheduled Outage on Conte

November 4, 2015  12:20pm – 9:15pm
Conte

Update - 9:20pm

Conte has been returned to full production as of 9:15pm.

During the failure earlier today, the internal tracking of jobs within the scheduler on Conte was corrupted. Unfortunately, this resulted in all running and pending jobs being dropped from their queues. Engineers have restored service and done extended testing of the scheduler's hardware to ensure it is stable, and we will continue to monitor this closely.

Any jobs you had waiting in queue will now need to be resubmitted.

Update - 5:30pm

The Lustre filesystem is back in production. Job scheduling remains paused as ITaP engineers troubleshoot issues with the PBS job scheduler.

Original Message

The scratch filesystem and the PBS server serving Conte is currently unavailable.

Both attempts to access files in scratch will block until the filesystem is back online. Job commands such as "qsub" and "qstat" may return error messages or incorrect information.

Job scheduling on Conte has been paused while ITaP engineers address the issue.

Originally posted: November 4, 2015  1:42pm