Article #798: Unscheduled scratch outage on Conte
The scratch filesystem has been restored to full service and all queues have been restarted. Original Message: The scratch filesystem serving Conte i...
The scratch filesystem has been restored to full service and all queues have been restarted. Original Message: The scratch filesystem serving Conte i...
Carter has been return to normal operations. All queues have been enabled. Update: December 2, 2015 12:15pm Carter is mostly ready to return to serv...
The scratch filesystem serving Hammer, Rice, and Snyder has been restored to normal operations, and all queues have been re-enabled. Original Message:...
As of 12:46, December 2, the home filesystem serving Conte, Hammer, Hansen, Hathi, Peregrine1, Radon, Rice, and Snyder was restored to normal operatio...
January 7, 2016, 6pm The Fortress move has completed and has been returned to production. Original Due to a failure in the notice system, the earlier...
Carter has been returned to normal operation. Update: January 20, 2016 3:26pm: We are doing return to service testing now and expect Carter to return...
As of 10:40pm, the Snyder cluster was returned to normal service in the POD. This concludes this maintenance. Update: February 5, 2016 8:54pm As of...
The Hammer scratch filesystem has now returned to normal operations. Original Message: During the maintenance of the Rice and Snyder clusters this wee...
The underlying issues affecting Carter are resolved and job scheduling has been resumed. Many individual nodes remain offline for corrective action,...
Fortress will be unavailable from 8:00am to 9:00am Wednesday, 3 February, 2016 for routine maintenance.
The Hathi and WinHPC clusters will be unavailable beginning at Thursday, February 4th, 2016 at 6:00am EST, for scheduled maintenance to the power feed...
The cause of this turned out to be a power loss to Carter's scratch filesystem and portions of the Data Depot, which has been restored now. Carter no...
Most of the impact of this turned out to be to the Depot storage system, which has now been restored to normal operations. All the other affected sys...
The scheduler issue has been resolved, and Conte has been returned to normal operations as of Wednesday, February 10th, 2016 at 9:30pm EST. Update: Fe...
There was an issue with the cluster's gateway switches, causing infiniband traffic to be incapable of IP over infiniband. This also caused an instabil...
As of 9:15 PM, the Snyder and Rice clusters have been brought back into service after cooling was brought back online. Front-ends are operational and...
The Depot filesystem checks have all completed cleanly and the Depot has been fully returned to normal operations. All queues on all clusters are sch...
Engineering Computing Network (ECN) will be performing staged patching and reboots of all of ECN's RedHat Linux workstations and servers to protect ag...
As of Monday, March 7th, 2016 at 12:30pm EST, the Peregrine1 cluster is unavailable due to a failed network switch in its datacenter. This switch is...
Outage RESOLVED A misconfiguration that caused an unneeded IB driver to be loaded was fixed. Peregrine-1 is back online. Job scheduling is on. Origi...