Skip to main content
Have a request for an upcoming news/science story? Submit a Request

Outages

  • Unscheduled outage on Peregrine-1

    • Last updated:

    Outage RESOLVED A misconfiguration that caused an unneeded IB driver to be loaded was fixed. Peregrine-1 is back online. Job scheduling is on. Original Message: The Peregrine-1 cluster is currently offline due to problems with the cluster nodes' op...

  • Unscheduled outage for Peregrine1

    • Last updated:

    As of Monday, March 7th, 2016 at 12:30pm EST, the Peregrine1 cluster is unavailable due to a failed network switch in its datacenter. This switch is currently in the process of being replaced. Estimated time to complete this work and bring the clu...

  • ECN services outage - ITaP Research Computing systems impacted

    Engineering Computing Network (ECN) will be performing staged patching and reboots of all of ECN's RedHat Linux workstations and servers to protect against a serious vulnerability in glibc system library. A significant number of ECN services will be...

  • Unscheduled Outage on Data Depot

    • Last updated:

    The Depot filesystem checks have all completed cleanly and the Depot has been fully returned to normal operations. All queues on all clusters are scheduling new jobs again. Any existing jobs which had been waiting for Depot access may also resume....

  • Unscheduled outage on Rice and Snyder

    • Last updated:

    As of 9:15 PM, the Snyder and Rice clusters have been brought back into service after cooling was brought back online. Front-ends are operational and scheduling has been resumed. Original Message: At about 7:30 pm Wednesday, 17 February, 2016, the fr...

  • Unscheduled scratch outage on Carter

    • Last updated:

    There was an issue with the cluster's gateway switches, causing infiniband traffic to be incapable of IP over infiniband. This also caused an instability in the lustre scratch servers, which required that they be rebooted. Jobs that were using scratc...

  • Unscheduled Outage in Math Data Center

    • Last updated:

    Most of the impact of this turned out to be to the Depot storage system, which has now been restored to normal operations. All the other affected systems are showing a return to normal operations now. Original Message: As of Thursday, February 4th,...

  • Unscheduled outage on Carter

    • Last updated:

    The cause of this turned out to be a power loss to Carter's scratch filesystem and portions of the Data Depot, which has been restored now. Carter nodes are returning to normal operations now. Original Message: As of Thursday, February 4th, 2016 at...

  • Unscheduled outage on Carter

    • Last updated:

    The underlying issues affecting Carter are resolved and job scheduling has been resumed. Many individual nodes remain offline for corrective action, and these will be returning to service gradually as engineers are able to fix them. In the interim,...

  • Unscheduled scratch outage on Hammer

    • Last updated:

    The Hammer scratch filesystem has now returned to normal operations. Original Message: During the maintenance of the Rice and Snyder clusters this week, it became necessary to shut down the scratch filesystem which these clusters currently share with...

  • Unscheduled Home Filesystem Outage

    • Last updated:

    As of 12:46, December 2, the home filesystem serving Conte, Hammer, Hansen, Hathi, Peregrine1, Radon, Rice, and Snyder was restored to normal operation. All queues have been re-enabled. As of Wednesday, December 2nd, 2015 at 12:00pm EST, Conte, Hamm...

  • Unscheduled scratch outage on Rice, Hammer, and Snyder

    • Last updated:

    The scratch filesystem serving Hammer, Rice, and Snyder has been restored to normal operations, and all queues have been re-enabled. Original Message: The scratch filesystem serving Hammer, Rice, and Snyder is partially unavailable. Both currently ru...

  • Unscheduled scratch outage on Conte

    • Last updated:

    The scratch filesystem has been restored to full service and all queues have been restarted. Original Message: The scratch filesystem serving Conte is currently unavailable. Both currently running jobs and attempts to access files in scratch will bl...

  • Unscheduled Outage on Conte

    • Last updated:

    Update - 9:20pm Conte has been returned to full production as of 9:15pm. During the failure earlier today, the internal tracking of jobs within the scheduler on Conte was corrupted. Unfortunately, this resulted in all running and pending jobs being...

  • Unscheduled outage for Samba/Windows

    • Last updated:

    Service was restored around 7:30pm today. Engineers changed the way Samba authenticates users to avoid this problem going forward. -- Service was restored around 10:30am today, but has since failed again. Engineers are working on the problem, and we...

  • Scratch Issues on Carter

    • Last updated:

    October 30, 2015 11:00am ITaP Engineers have made additional timeout changes to the scratch filesystem which has increased stability. Additional work is being scheduled for Tuesday, December 1, 2015 from 7:00am to 7:00pm. October 8, 2015 5:00pm An e...

  • Unscheduled scratch outage on Rossmann

    • Last updated:

    **Update: August 25, 2015 9:00 pm ** On Monday, August 24, a disk tray in the Rossmann scratch storage system suffered multiple failures and despite great effort by both ITaP storage engineers and the system vendor, this portion of the scratch system...

  • Unscheduled scratch outage on Rossmann

    • Last updated:

    UPDATE As of 8pm on August 15, 2015 the scratch filesystem serving Rossmann is back in full production. Original message: The scratch filesystem serving Rossmann is currently unavailable. Both currently running jobs and attempts to access files in sc...

  • ECN Service Interruption

    Due to power work in the MSEE building, most ECN services will be unavailable between 6:30am – 9:00pm EDT on Saturday, August 15, 2015. For Research Computing users this means that software packages licensed through ECN servers will not be able to ch...

  • Data Depot connectivity issues

    • Last updated:

    ITaP engineers have identified issues causing intermittent failures on Carter. Engineers are currently tuning parameters on Depot system that have been identified as potential fixes to the issues. Access to Depot on Carter has been stable since tunin...