Degraded performance of several systems

September 13 – 21, 2016
Carter, Conte, Data Depot, EXRC, Fortress, Hammer, Hansen, Hathi, Peregrine1, Radon, Rice, Scholar, Snyder

We have seen a significant wave of these events this morning, September 21. For the most part, this wave seems to have been linked to a storage problem that has been resolved. However, we are implementing new monitoring and response procedures today to ensure a similar recurrence is caught and dealt with much more quickly.

Original Message:

System, Network, Storage, and Support staff are working to diagnose and correct issues that have been seen recently within ITaP's Research Computing systems.

Symptoms being reported involve an apparent complete freeze of open sessions, the inability to open new login sessions, difficulties using text editors, and disruptions in file access. In cases we have seen, these events seem to last for about 3-5 minutes, then clear up. However, there may be ongoing effects on jobs running on the Research Clusters, including job failure due to the storage access disruption.

We are examining log files and monitoring processes actively, and are working to correlate the timing of these events across our systems, and expect to identify a fundamental cause that we can then correct. At this time, however, we do not have an estimated time for a fix.

Please follow this news item for further information.

Originally posted: September 13, 2016  5:01pm