Outages
-
Unscheduled Outage on Data Depot
As of 7:30 pm, all methods for connecting to Data Depot have been restored to working order. All connections with Samba (Network Drive mappings: datadepot.rcac.purdue.edu, samba.rcac.purdue.edu) are working normally again. More Rice and Snyder nodes...
-
Self-service management web tool outage
As of 3:20 pm, the self-service tool is back in action. An issue with the database backing authentication was discovered and repaired. Original message The self-service management tool (user management) is experiencing issues with authentication. Att...
-
Unscheduled Outage on Data Depot
UPDATE As of 5:30 pm. Friday, 5 August, 2016, we believe the problem affecting access to the Data Depot has been corrected. Thank you for your patience, and I apologize for the disruption this caused. Original Message Access to the Data Depot is curr...
-
Degraded performance of several systems
We have seen a significant wave of these events this morning, September 21. For the most part, this wave seems to have been linked to a storage problem that has been resolved. However, we are implementing new monitoring and response procedures toda...
-
Unscheduled scratch outage on Carter
UPDATE: ITaP engineers have implemented a temporary solution so that work may continue on Carter until the scheduled upcoming maintenance window on Tuesday. Any jobs running which were using the scratch space have been stopped in order to allow for t...
-
Unscheduled Scratch Outage on Carter
UPDATE As of about 6:30 pm, the new scratch system was brought back online, and scheduling has been restarted on Carter. Original Message The new scratch filesystem serving Carter that was just activated on Tuesday night is currently unavailable. Bot...
-
Measures taken within the first two hours of this problem seem to have resolved the issue. Original Message: A portion of the systems serving the Research Data Depot have suffered a failure. Some systems using Depot have been affected, particularly...
-
GitHub issues - Internal Server Error and Web Interface Not Updating
The issue with the GitHub web interface was resolved late yesterday evening. The website is now reflecting changes made to git repositories as normal. Please let us know at rcac-help@purdue.edu if you see any continuing issues. The sporadic "Int...
-
This issue has been resolved. Original Message: A portion of the systems serving the Research Data Depot have suffered a failure. Some systems using Depot have been affected, particularly Carter, Snyder, and systems accessing Depot over NFS. Some sys...
-
Job scheduling paused on Radon
Job scheduling was paused on Radon between 6 pm and 7 pm this evening. Node monitoring processes marked most nodes offline around 6 pm, preventing new jobs from starting. System engineers cleared the fault in the node monitoring, and nodes came back...
-
Update: Engineers were able to isolate the problem and restart the necessary systems. The Data Depot should be available again. Halstead users should double check their running work. A portion of the systems serving the Research Data Depot have suffe...
-
UPDATE As of 7:50 pm, Wednesday, 14 December 2016, this issue is completely resolved. UPDATE As of about 6:00 pm another problem has been found in the EXRC scheduler code. We will update this news item once we have more details. Original Item The EXR...
-
Unscheduled Outage for EXRC Cluster
Following the restoration of power to the affected building, the EXRC cluster has been returned to service on Thursday, December 22nd, 2016 at 2:45pm EST. Original article As of Tuesday, December 20th, 2016 at 12:00pm EST, EXRC is unavailable due to...
-
Connectivity issues to Research Data Depot
System monitoring has revealed intermittent issues connecting to the Research Data Depot on Thursday January 19. When this issue occurs, users will experience pauses when working in a UNIX shell on community cluster systems, or as interrupted or drop...
-
Unscheduled scratch outage on Conte
The scratch filesystem serving Conte is currently unavailable. Both currently running jobs and attempts to access files in scratch will block until the filesystem is back online. Job scheduling on Conte has been paused while storage engineers addres...
-
Halstead MPI problem, scheduling paused
Following the security updates on Halstead, an issue was discovered that prevented multi-node MPI jobs from running properly. Scheduling on Halstead has been stopped, and systems engineers are working on fixing the issue. We will provide further stat...
-
Unscheduled scratch outage on Rice, Snyder, and Hammer
The scratch filesystem serving Hammer, Rice, and Snyder is currently unavailable. Both currently running jobs and attempts to access files in scratch will block until the filesystem is back online. Job scheduling on Hammer, Rice, and Snyder has been...
-
The Research Data Depot has been restored to service. A portion of the systems serving the Research Data Depot have suffered a failure. Some systems using Depot have been affected, particularly research clusters and users accessing the Depot over NFS...
-
Partial scratch outages on Rice, Snyder, Carter, Scholar and Hammer
The scratch filesystems serving Carter, Hammer, Rice, Scholar, and Snyder started behaving abnormally this morning. This may have affected some jobs, and anyone using one of the login nodes for these clusters may have had sessions freeze or seen dela...
-
The Fortress archival storage system is currently experiencing intermittent connectivity. We expect the situation to be resolved by approximately 1pm. UPDATE: Storage engineers have resolved the connectivity problems and Fortress is back in full prod...