Outages
-
Unscheduled outage on Rice and Snyder
As of 9:15 PM, the Snyder and Rice clusters have been brought back into service after cooling was brought back online. Front-ends are operational and scheduling has been resumed. Original Message: At about 7:30 pm Wednesday, 17 February, 2016, the fr...
-
Unscheduled scratch outage on Carter
There was an issue with the cluster's gateway switches, causing infiniband traffic to be incapable of IP over infiniband. This also caused an instability in the lustre scratch servers, which required that they be rebooted. Jobs that were using scratc...
-
Unscheduled Outage in Math Data Center
Most of the impact of this turned out to be to the Depot storage system, which has now been restored to normal operations. All the other affected systems are showing a return to normal operations now. Original Message: As of Thursday, February 4th,...
-
The cause of this turned out to be a power loss to Carter's scratch filesystem and portions of the Data Depot, which has been restored now. Carter nodes are returning to normal operations now. Original Message: As of Thursday, February 4th, 2016 at...
-
The underlying issues affecting Carter are resolved and job scheduling has been resumed. Many individual nodes remain offline for corrective action, and these will be returning to service gradually as engineers are able to fix them. In the interim,...
-
Unscheduled scratch outage on Hammer
The Hammer scratch filesystem has now returned to normal operations. Original Message: During the maintenance of the Rice and Snyder clusters this week, it became necessary to shut down the scratch filesystem which these clusters currently share with...
-
Unscheduled Home Filesystem Outage
As of 12:46, December 2, the home filesystem serving Conte, Hammer, Hansen, Hathi, Peregrine1, Radon, Rice, and Snyder was restored to normal operation. All queues have been re-enabled. As of Wednesday, December 2nd, 2015 at 12:00pm EST, Conte, Hamm...
-
Unscheduled scratch outage on Rice, Hammer, and Snyder
The scratch filesystem serving Hammer, Rice, and Snyder has been restored to normal operations, and all queues have been re-enabled. Original Message: The scratch filesystem serving Hammer, Rice, and Snyder is partially unavailable. Both currently ru...
-
Unscheduled scratch outage on Conte
The scratch filesystem has been restored to full service and all queues have been restarted. Original Message: The scratch filesystem serving Conte is currently unavailable. Both currently running jobs and attempts to access files in scratch will bl...
-
Update - 9:20pm Conte has been returned to full production as of 9:15pm. During the failure earlier today, the internal tracking of jobs within the scheduler on Conte was corrupted. Unfortunately, this resulted in all running and pending jobs being...
-
Unscheduled outage for Samba/Windows
Service was restored around 7:30pm today. Engineers changed the way Samba authenticates users to avoid this problem going forward. -- Service was restored around 10:30am today, but has since failed again. Engineers are working on the problem, and we...
-
October 30, 2015 11:00am ITaP Engineers have made additional timeout changes to the scratch filesystem which has increased stability. Additional work is being scheduled for Tuesday, December 1, 2015 from 7:00am to 7:00pm. October 8, 2015 5:00pm An e...
-
Unscheduled scratch outage on Rossmann
**Update: August 25, 2015 9:00 pm ** On Monday, August 24, a disk tray in the Rossmann scratch storage system suffered multiple failures and despite great effort by both ITaP storage engineers and the system vendor, this portion of the scratch system...
-
Unscheduled scratch outage on Rossmann
UPDATE As of 8pm on August 15, 2015 the scratch filesystem serving Rossmann is back in full production. Original message: The scratch filesystem serving Rossmann is currently unavailable. Both currently running jobs and attempts to access files in sc...
-
Due to power work in the MSEE building, most ECN services will be unavailable between 6:30am – 9:00pm EDT on Saturday, August 15, 2015. For Research Computing users this means that software packages licensed through ECN servers will not be able to ch...
-
Data Depot connectivity issues
ITaP engineers have identified issues causing intermittent failures on Carter. Engineers are currently tuning parameters on Depot system that have been identified as potential fixes to the issues. Access to Depot on Carter has been stable since tunin...
-
Rice job submission failing for some users
Update: The scheduling server has been rebooted and job submissions appear to be working normally again. Please let us know at rcac-help@purdue.edu if you see any further issues. Thanks again for your patience! Job submissions for at least some users...
-
Due to power work in the MSEE building, most ECN services will be unavailable between 5:30 pm Thursday, 11 June, 2015 and 8:00 am Friday 12 June 2015. In particular, for Research Computing users this means that software packages licensed through ECN...
-
Fortress Samba service has been restored as of 10:15am on Monday, June 8th. We apologize for any inconvenience this has caused and thanks for your patience. Beginning Friday afternoon, the Fortress Samba mounts became unavailable due to an issue with...
-
Hathi Hadoop cluster planned outage
The Hathi Hadoop cluster will be unavailable Monday, 13 April, 2015 from 9:00 am to 1:00 pm. During that time, the cluster hardware will be upgraded with new network interfaces. The cluster will go offline at 9:00 am, and we expect the work to be com...