Outages and Maintenance
-
Fortress will be unavailable from 8:00am to 9:00am Wednesday, 3 February, 2016 for routine maintenance.
-
The underlying issues affecting Carter are resolved and job scheduling has been resumed. Many individual nodes remain offline for corrective action, and these will be returning to service gradually as engineers are able to fix them. In the interim,...
-
Rice and Snyder Cluster Maintenance
As of 10:40pm, the Snyder cluster was returned to normal service in the POD. This concludes this maintenance. Update: February 5, 2016 8:54pm As of 8:25 pm, Friday, 5 Feb 2016, the Rice cluster maintenance has completed and the system is returning...
-
Unscheduled scratch outage on Hammer
The Hammer scratch filesystem has now returned to normal operations. Original Message: During the maintenance of the Rice and Snyder clusters this week, it became necessary to shut down the scratch filesystem which these clusters currently share with...
-
Carter has been returned to normal operation. Update: January 20, 2016 3:26pm: We are doing return to service testing now and expect Carter to return to production by 7:00pm. Update: January 20, 2016 12:00pm: Work is being wrapped up on Carter and...
-
January 7, 2016, 6pm The Fortress move has completed and has been returned to production. Original Due to a failure in the notice system, the earlier attempts to notify of this work which were sent on Dec 7th and Jan 3rd were not delivered. The Fortr...
-
Unscheduled Home Filesystem Outage
As of 12:46, December 2, the home filesystem serving Conte, Hammer, Hansen, Hathi, Peregrine1, Radon, Rice, and Snyder was restored to normal operation. All queues have been re-enabled. As of Wednesday, December 2nd, 2015 at 12:00pm EST, Conte, Hamm...
-
Unscheduled scratch outage on Rice, Hammer, and Snyder
The scratch filesystem serving Hammer, Rice, and Snyder has been restored to normal operations, and all queues have been re-enabled. Original Message: The scratch filesystem serving Hammer, Rice, and Snyder is partially unavailable. Both currently ru...
-
Carter has been return to normal operations. All queues have been enabled. Update: December 2, 2015 12:15pm Carter is mostly ready to return to service, but the site-wide home filesystem has suffered a failure which is preventing this from being co...
-
Unscheduled scratch outage on Conte
The scratch filesystem has been restored to full service and all queues have been restarted. Original Message: The scratch filesystem serving Conte is currently unavailable. Both currently running jobs and attempts to access files in scratch will bl...
-
Update - 9:20pm Conte has been returned to full production as of 9:15pm. During the failure earlier today, the internal tracking of jobs within the scheduler on Conte was corrupted. Unfortunately, this resulted in all running and pending jobs being...
-
The Fortress Archive service, Fortress, will be unavailable starting Wednesday, November 4th, 2015 at 6:00am EST for regular maintenance and will return at Wednesday, November 4th, 2015 at 8:00am EST. During this time, access via HSI, HTAR, Globus of...
-
November 3, 2015 6:15pm The maintenance for Radon is completed and the cluster has been returned to production. Original The Radon cluster will be unavailable beginning at Tuesday, November 3, 2015 from 7:00am - 7:00pm EST, for scheduled maintenance....
-
Unscheduled outage for Samba/Windows
Service was restored around 7:30pm today. Engineers changed the way Samba authenticates users to avoid this problem going forward. -- Service was restored around 10:30am today, but has since failed again. Engineers are working on the problem, and we...
-
October 22, 2015 9:15pm All services have been restored and Hammer is now in production. October 22, 2015 7:00pm Engineers continue to work through issues relating to the move. Another update will be sent at 9pm. Original The Hammer cluster will be...
-
October 30, 2015 11:00am ITaP Engineers have made additional timeout changes to the scratch filesystem which has increased stability. Additional work is being scheduled for Tuesday, December 1, 2015 from 7:00am to 7:00pm. October 8, 2015 5:00pm An e...
-
Emergency scratch maintenance on Carter and Scholar
The scratch filesystem serving Carter/Scholar underwent emergency maintenance through Friday night and well into Saturday. We expect this work to resolve the periodic hangs this filesystem has been experiencing for the last two days. Job scheduling...
-
Cluster Maintenance - Hansen/Peregrine1
Update: September 22, 2015 1pm The work affecting Hansen and Peregrine1 scratch filesystems has been completed and the clusters are back in full production. Original The Hansen and Peregrine1 cluster will be unavailable beginning at Tuesday, Septembe...
-
Update: September 23, 2015 8am Shortly after 2am, Engineers were able to complete the file transfer and return Carter back to production. Update: September 22, 2015 11pm The file transfer continues and will last well into the night. The next update...
-
Unscheduled scratch outage on Rossmann
**Update: August 25, 2015 9:00 pm ** On Monday, August 24, a disk tray in the Rossmann scratch storage system suffered multiple failures and despite great effort by both ITaP storage engineers and the system vendor, this portion of the scratch system...