Article #794: Cluster Maintenance - Carter
Carter has been return to normal operations. All queues have been enabled. Update: December 2, 2015 12:15pm Carter is mostly ready to return to serv...
Carter has been return to normal operations. All queues have been enabled. Update: December 2, 2015 12:15pm Carter is mostly ready to return to serv...
Carter has been returned to normal operation. Update: January 20, 2016 3:26pm: We are doing return to service testing now and expect Carter to return...
The underlying issues affecting Carter are resolved and job scheduling has been resumed. Many individual nodes remain offline for corrective action,...
The cause of this turned out to be a power loss to Carter's scratch filesystem and portions of the Data Depot, which has been restored now. Carter no...
Most of the impact of this turned out to be to the Depot storage system, which has now been restored to normal operations. All the other affected sys...
There was an issue with the cluster's gateway switches, causing infiniband traffic to be incapable of IP over infiniband. This also caused an instabil...
Engineering Computing Network (ECN) will be performing staged patching and reboots of all of ECN's RedHat Linux workstations and servers to protect ag...
The scratch storage on Carter and Scholar has been returned to normal operations. The rebuild process will be continuing in the background, so we wil...
UPDATE: The issue with Carter's scratch filesystem has been resolved. The filesystem is now available. Job scheduling on the cluster has been resum...
A new web-based quota monitoring tool is available to all Research Cluster and Data Depot users. This tool is a web equivalent of the myquota tool on...
On Monday, May 9, 2016 the environment module system on Carter, Conte, Hansen, and Hathi will be upgraded to Lmod, bringing all compute clusters up to...
The Isilon filesystem was restored to normal service and all affected clusters had it remounted as quickly as was sustainable by the filesystem. This...
Carter and Scholar are back online for use as of 6:25am, though they will be operating with many nodes still offline. Staff will be working through W...
Engineering Computing Network (ECN) will be performing scheduled maintenance this weekend on several ECN server resulting in their unavailability for...
We have seen a significant wave of these events this morning, September 21. For the most part, this wave seems to have been linked to a storage probl...
UPDATE: ITaP engineers have implemented a temporary solution so that work may continue on Carter until the scheduled upcoming maintenance window on Tu...
During the Home Filesystem Maintenance - All Clusters maintenance on September 27th, several upgrades and changes will be made to the software stack o...
We are seeing some issues with the systems in the warp-scratch set of hosts. You may encounter an error with your home directory and/or a message abo...
Conte has been returned to normal operations as well now. This concludes the home directory maintenance on all systems. Update: September 27, 2016 1...
On September 27th, 2016, the Carter cluster scratch filesystem, which had been suffering from numerous issues, was replaced by an entirely new system....