Unscheduled Outage on Data Depot

July 20, 2016  11:00am – 11:00pm
Data Depot, Rice, Snyder
Download Calander Event

As of 7:30 pm, all methods for connecting to Data Depot have been restored to working order. All connections with Samba (Network Drive mappings: datadepot.rcac.purdue.edu, samba.rcac.purdue.edu) are working normally again.

More Rice and Snyder nodes have also been repaired and are back online. Systems engineers will continue to work on bringing the remaining offline nodes back online, and as they do so job scheduling will resume to it's normal pace.

All Firebox Virtual machines have had their Data Depot mounts repaired.

At this time, Data Depot is back to normal operations. If you encounter any lingering issues, please let us know at rcac-help@purdue.edu

Update July 20, 2016 6:25pm

As of 6:25 pm, connections to Data Depot through Globus are functioning and stable again. Work continues on addressing the remaining issues.

Update July 20, 2016 6:01pm

System engineers continue into the evening diagnosing and resolving issues with Data Depot.

At this time, Samba (Network Drive mappings) connections are still having difficulties connecting to Data Depot. Connections may work for a minute or two before freezing, or lock up when trying to write a file. Connections to Data Depot through Globus are also experiencing intermittent problems.

As well, a number of Rice and Snyder nodes are marked offline due to missing Data Depot mounts. You may experience slower than normal job scheduling with a number of nodes unavailable for new jobs on Rice and Snyder.

Data Depot connections have been repaired on most Firebox Virtual Machines, however, a few VMs are continuing to have issues with Data Depot.

All cluster front-ends, however, are functioning and able to access Data Depot at this time. Work continues on addressing the remaining problems.

Another update to this outage will be posted by 11 pm this evening, at the latest.

Original Message

At around 11:00am, on Wednesday, July 20th, 2016 a problem with some back-end machines serving Data Depot to Snyder, some Rice nodes, Samba (Network Drive mappings: datadepot.rcac.purdue.edu, samba.rcac.purdue.edu), and some ancillary services, was discovered. This issue arose during routine updates to the back-end servers.

This disruption has resulted in unreliable performance, connection drops, or unresponsiveness from Data Depot on all of these services. This also includes some software which relies on Data Depot.

Some compute nodes on Rice and Snyder were also affected, which may have impacted some running jobs that rely on Data Depot. Please monitor your job output for any unexpected error messages or problems.

System engineers are currently working on bringing services back up to speed, and assessing any remaining issues. There is no current estimated time to full recovery, but we will post an update no later than 6pm, if not fully recovered sooner.

If you have any concerns, or are seeing issues from Data Depot on other services please let us know at rcac-help@purdue.edu.

Originally posted: July 20, 2016  3:55pm