Running Jobs on Community Clusters While...

Running Jobs on Community Clusters While Data Depot is Unavailable

April 20, 2020 9:00pm - April 22, 2020 3:40pm EDT
Outages
Brown, Gilbreth, Halstead, Hammer, Rice, Scholar, Snyder

Since Friday, April 17, the Research Data Depot filesystem has been unavailable on community cluster systems due to an ongoing filesystem verification. While we don't believe there is any danger of data loss, the filesystem verification will continue for some time per vendor-recommended procedure.

In order to return the clusters to service while this process continues, we will resume job scheduling while Research Data Depot is temporarily unavailable on cluster compute nodes. This will entail several unusual consequences which you should be aware of.

Running Jobs

Any operation that refers to /depot in any way will fail while other running jobs will not be impacted. Referring to /depot includes the following scenarios:
- Anything that loads bioinformatics modules (module load bioinfo, module use /group/bioinfo, etc)
- Anything that explicitly refers to /depot (your lab’s shared settings, programs, scripts or data) in your job script. This includes (but is not limited to) variants of cd /depot/..., source /depot/mylab/..., export PATH=/depot/mylab/..., or module use /depot/mylab/...
- Anything that explicitly refers to/depot/ in your shell startup files (such as ~/.bashrc, ~/.bash_profile, ~/.profile, ~/.cshrc, ~/.login, etc). This includes (but is not limited to) variants of source /depot/mylab/..., export PATH=/depot/mylab/..., or module use /depot/mylab/...
- In general, anything that tries to read or write to /depot/...

Jobs Waiting in Queues

All jobs that are now in the queue will be held. This means they will stay in the queue, but the scheduler will not try to start them.
Jobs that you are certain do not use the /depot filesystem for any purpose can be released to the scheduler. You can do this for your own jobs with the command scontrol release JobID
Jobs that do refer to /depot should not be released - without the /depot filesystem being available, they will fail. You can leave these jobs in the 'held' state until /depot is remounted on the nodes, or, if it is more convenient for you, you can cancel them with the scancel JobID command. We will leave those decisions up to you.
If you do nothing, when the Data Depot is returned to full service and is available to compute nodes, any jobs we have held will be released, and the scheduler will treat them normally.
If you are not sure whether /depot is referred to in your jobs, we recommend you either cancel them or leave them in hold.

Submitting New Jobs

New jobs can be submitted, but until the emergency outage is over, the caveats above will apply. Specifically, any new job that refers to /depot will fail before Data Depot is returned to full service.

Getting Data To and From Your Depot Space

For low-impact access to your saved files and documents, you can use the SMB/Windows Network Drive method
Processing data can be staged from your Depot space to your cluster scratch space using Globus
Please note that performance is strongly affected at the moment, and we advise refrainining from unnecessary large transfers.

We appreciate your patience during this difficult time. As always, please contact us at rcac-help@purdue.edu if you have any questions or concerns.

Originally posted: April 20, 2020 7:58pm EDT

Running Jobs on Community Clusters While Data Depot is Unavailable

Follow Us