Article #957: Emergency Carter Cluster Maintenance
Update: Owner queues on Carter have been restarted. While Carter is currently deemed stable, performance is still impacted. Engineers are closely moni...
Update: Owner queues on Carter have been restarted. While Carter is currently deemed stable, performance is still impacted. Engineers are closely moni...
The scratch filesystems serving Carter, Hammer, Rice, Scholar, and Snyder started behaving abnormally this morning. This may have affected some jobs,...
Due to a recent security vulnerability, the Carter, Halstead, Hammer, Radon, Rice, Scholar, and Snyder clusters will have their operating system upgra...
The maintenance for Carter cluster was cancelled and will be rescheduled at a later date. The cluster has remained in service. Original Notice The Ca...
The Carter Cluster was returned to production at 10:45pm on November 7. We apologize for this extended outage. Update: November 7, 2016 6:01pm Work...
Measures taken within the first two hours of this problem seem to have resolved the issue. Original Message: A portion of the systems serving the Rese...
UPDATE As of about 6:30 pm, the new scratch system was brought back online, and scheduling has been restarted on Carter. Original Message The new scra...
Conte has been returned to normal operations as well now. This concludes the home directory maintenance on all systems. Update: September 27, 2016 1...
UPDATE: ITaP engineers have implemented a temporary solution so that work may continue on Carter until the scheduled upcoming maintenance window on Tu...
We have seen a significant wave of these events this morning, September 21. For the most part, this wave seems to have been linked to a storage probl...
Engineering Computing Network (ECN) will be performing scheduled maintenance this weekend on several ECN server resulting in their unavailability for...
Carter and Scholar are back online for use as of 6:25am, though they will be operating with many nodes still offline. Staff will be working through W...
The Isilon filesystem was restored to normal service and all affected clusters had it remounted as quickly as was sustainable by the filesystem. This...
UPDATE: The issue with Carter's scratch filesystem has been resolved. The filesystem is now available. Job scheduling on the cluster has been resum...
The scratch storage on Carter and Scholar has been returned to normal operations. The rebuild process will be continuing in the background, so we wil...
Engineering Computing Network (ECN) will be performing staged patching and reboots of all of ECN's RedHat Linux workstations and servers to protect ag...
There was an issue with the cluster's gateway switches, causing infiniband traffic to be incapable of IP over infiniband. This also caused an instabil...
The cause of this turned out to be a power loss to Carter's scratch filesystem and portions of the Data Depot, which has been restored now. Carter no...
Most of the impact of this turned out to be to the Depot storage system, which has now been restored to normal operations. All the other affected sys...
The underlying issues affecting Carter are resolved and job scheduling has been resumed. Many individual nodes remain offline for corrective action,...