Unscheduled outage on Carter
February 2, 2016 6:00pm – February 3, 2016 10:50pm
The underlying issues affecting Carter are resolved and job scheduling has been resumed. Many individual nodes remain offline for corrective action, and these will be returning to service gradually as engineers are able to fix them. In the interim, Carter's capacity will be diminished and wait times may be slightly longer than normal.
Update: February 3, 2016 6:06pm
Some work has been done to the fiber runs to Carter that appears to have improved the network situation within the Carter cluster. Engineers are trying to bring nodes back to a normal operational state now. We will issue another update by at least 11:00pm.
Update: February 3, 2016 2:45pm
While the network core is operational again, Carter's internal Infiniband networking has not been able to recover from the incident. Some of the fiber connecting Carter to the research core has been offlined and is being investigated, and engineers are continuing to try to bring the nodes back to a working state even with a slightly reduced fiber path. We will issue another update by 6:00pm if the situation is not resolved.
Starting Tuesday, February 2nd, 2016 at 6:00pm, Carter has been running in a diminished capacity, and many nodes were experiencing errors due to a network core failure. Unfortunately, we were unable to communicate this issue sooner due to the network failure, but our staff have been working on this since it started. Scheduling on Carter has been paused to try to lessen the impact on jobs.
The network core itself has been restored to service, but issues remain with Carter nodes. Our engineers are addressing these problems, and we will return the cluster to normal operations as soon as possible.