Outages
-
Scheduling paused on ITaP research clusters
During scheduled network maintenance on network equipment connecting storage to ITaP clusters, all scheduling will be paused from 4-6pm. Running jobs will continue to execute, and new jobs may be submitted to PBS queues, but no new jobs will start u...
-
Unexpected Power Outage in MATH
Update: Noon, 1/8/13 The power issue in MATH has been resolved. Power has been restored to the nodes in the Coates-A subcluster affected by the outage. ITaP engineers have verified that the Coates-A subcluster is operating correctly, and have restart...
-
Campus chilled water serving the MATH data center is experiencing above-normal temperatures, and as a precaution, scheduling on the Coates, Rossmann, Hansen, Carter, and Radon clusters has been stopped. Steele is not affected. There should be no impa...
-
Update: As of about 11:00 am, the problem with the chilled water has been corrected, and scheduling has resumed on all RCAC clusters. Thank you for your patience. If you encounter any issues or have questions, please contact us at rcac-help@purdue.ed...
-
As of 9:00am, are seeing a problem with the LustreC scratch filesystem that serves Carter, Hansen, and Peregrine1. To prevent any more jobs from running into this, we have temporarily suspended scheduling of new jobs, though you may still submit to...
-
Update: ITaP engineers have corrected the issue affecting the LustreC filesystem. The system is back in production. Job scheduling on Carter, Hansen and Peregrine1 has been restarted. As always, thank you for your patience. If you encounter any issue...
-
Scheduling paused on Carter cluster
Update: 8:12pm Scheduling on Carter has been resumed, and Carter is back in full production. Original Message: Beginning the morning of April 16, a number of compute nodes on the Carter cluster are experiencing a connectivity issue. While ITaP engine...
-
Network outage affecting Peregrine1 cluster
On April 24, 2013, network engineers will be relocating fiber optics that connect the Peregrine1 cluster to infrastructure in West Lafayette. This outage is scheduled for 12:00am through 5:00am. This will leave Peregrine1 unable to run jobs Any PBS j...
-
Resolved: As of about 4:45pm ET, the connectivity issue affecting the Fortress archive has been resolved. The HPSS archive is back in full production. If you encounter any issues, please contact us at rcac-help@purdue.edu Update: ITaP Storage Enginee...
-
LustreC filesystem unavailable
Update: May 13, 2013 11:00pm: LustreC has been returned to service. Carter, Hansen, and Peregrine1 are back in production with queues enabled. Update: May 13, 2013 3:00pm: storage engineers are continuing to work with vendor support to return Lustre...
-
Fortress HPSS Archive Unavailable
Update - 10:15 am Fortress is back in full production. Original Message: As of 8:00am, Thursday, September 19, the Fortress HPSS is temporarily unavailable due to issues with communicating with its tape drives. Storage engineers are working to return...
-
Partial scratch96 filesystem outage
In the evening of 10/10/2013, the fileserver providing the "scratch96" filesystem serving some users of the Steele and Radon clusters suffered a permanent failure to its 2nd tier storage. This means that files on scratch96 that are older th...
-
Update: 11:00pm, Nov. 12, 2013 ITaP storage engineers have returned the offline hardware to production and LustreC is back in production. Queues on Hansen and Carter have been restarted as of 11:45pm. Update: 5:00pm Following consultation with vendor...
-
Nearly all major clusters operated by ITaP Research Computing are stopped due to issues with their storage systems relating to the power loss on the West Lafayette campus in the wake of the severe weather Sunday night. This includes: Conte, Carter,...
-
The Fortress HPSS Archive is offline due to issues with their storage systems relating to the power loss on the West Lafayette campus in the wake of the severe weather Sunday night. Engineers are investigating the problem now, but until this is reso...
-
All ITaP Research Computing systems are currently experiencing an issue with accessing network filesystems. A case has been opened with our vendor as ITaP engineers troubleshoot the issue. Cluster users may experience issues accessing files in /home,...
-
Lustre D filesystem unavailable
Update - 2:25pm, 12/16/2013 The LustreD scratch filesystem has been returned to service and both the filesystem and scheduler appear to be working properly. Conte has been returned to normal production service as of 2:20pm. Update - 10:30am, 12/16/2...
-
The Lustre D filesystem, serving the Conte cluster, has become unavailable as of about 8:00 pm Thursday 13 Feb, 2014. System engineers are working to bring the system back to 100% operation. Currently running jobs should be able to continue, but sch...
-
Scheduling Paused on Hansen and Carter
The scratch filesystem on Hansen and Carter is currently unavailable due to a hardware issue. Attempts to access scratch will block until the filesystem is back online. Job scheduling on Hansen and Carter has been paused while storage engineers addre...
-
UPDATE: Fortress was successfully returned to service as of 7:35 pm Wednesday, 15 July. As of 8:30am on July 15, 2014, the Fortress HPSS Archive is unavailable due to a hardware issue. Access to Fortress via HSI, HTAR, Globus, or CIFS is not availabl...