Unscheduled Data Depot and community clusters outage

At about 9:30am, Data Depot servers started experiencing a ramping high load. Coupled with an ongoing scaling issues with the metadata subsystem, this caused Data Depot to become increasingly unresponsive for both community clusters and network drive users, as well as for all other access methods.

Engineers were able to trace the load to a specific set of cluster jobs and processes performing highly suboptimal access patterns to the filesystem. After these misconfigured jobs were terminated, the load on Data Depot servers has returned to normal levels and performance of Data Depot was restored.

As always, but particularly during this period of limited metadata capacity, we strongly recommend researchers to avoid performing heavy I/O operations against the Data Depot filesystem. Carrying out most of your intermediate processing in highly-performant cluster scratch spaces (and only copying final results to Data Depot at the end) will likely yield a much higher overall throughput. Please reach out to us if you would like to discuss your lab data workflows and brainstorm possible enhancements to them.

We greatly appreciate your patience and understanding during this transition period. Please contact rcac-help@purdue.edu if you have any questions or concerns.

Originally posted: August 19, 2021 5:44pm EDT