Partial scratch96 filesystem outage

October 10, 2013  8:00am – 5:00pm

In the evening of 10/10/2013, the fileserver providing the "scratch96" filesystem serving some users of the Steele and Radon clusters suffered a permanent failure to its 2nd tier storage. This means that files on scratch96 that are older than 30 days may be permanently lost. Affected files may appear in "ls" or other commands, but may return I/O errors when accessed. Recently used files (newer than 30 days) are stored on a different tier of storage and are unaffected.

You can determine which fileserver that your scratch directory is hosted on by examining the contents of the $RCAC_SCRATCH environment variable. If the path returned contains "scratch96", then your scratch space may be affected.

-bash-4.1$ echo $RCAC_SCRATCH /scratch/scratch96/p/psmith

The filesystems "scratch95" or "lustreB" are NOT affected.

If you are a user of scratch96, please examine the contents of your scratch directory to identify the impact of this event on your files.

As a reminder, please keep in mind that scratch filesystems are not backed up to protect from hardware failures or accidental deletions. Please be sure to use the Fortress HPSS archive to permanently store your data and results.


Originally posted: October 10, 2013