Research Data Depot Consistency Checking Underway, Snapshot Schedule Affected

June 5, 2020
Data Depot

The Research Data Depot is currently undergoing a full filesystem consistency scan. While the scan is running, all backend servers within the GPFS filesystem are under load while checking the filesystem, in addition to serving data in the course of normal operations. Until this scan completes, users may potentially encounter several symptoms during the course of use.

  • Slow file access
  • Increased latency
  • File access returning incorrect data

Impact on Snapshots

Additionally, due to the increased load created by the filesystem check, regular snapshots may not be able to successfully complete. This means that options for recovering files accidentally deleted will be curtailed for the coming weeks. We strongly urge the regular use of Fortress to maintain long-term archives of any important research data.

Background

In May, 2020, scattered reports emerged of files containing content that did not match what was expected to be in the file. For example, reading a genome sequence file might yield a JPEG image file.

Conference calls were immediately initiated with IBM triage managers. An average of twice-weekly meetings were held with GPFS developers.

After consulting with IBM GPFS support staff, ITaP storage engineers determined the following:

  • As noted in Unscheduled Data Depot outage on the clusters, the Research Data Depot suffered a disk failure on one half of the filesystem’s mirror during late April, which required the disk to be repaired (mmchdisk) and then brought back in sync.
  • The version of GPFS running on Data Depot had a bug in mmchdisk, where the command may indicate success while not synchronizing all the replicas. Subsequent attempts to read files may retrieve data from an out-of-date or uninitialized replica.

Actions

ITaP engineers and IBM performing the following actions to remedy the GPFS bug:

  • May 15-19 - Upgraded GPFS software on all storage servers to patch level 4.2.3.21.
  • May 20 - With GPFS upgraded, the mmrestripefs command was executed to reconcile the out-of-date replicas.
    • Attempts at mmrestripefs runs repeatedly failed with an I/O error, and various efforts were made to troubleshoot the inability to successfully execute mmrestripefs.
  • May 29 - IBM engineers were at last able to identify an issue with an additional disk within the filesystem and return it to an “on” state”. With that final issue resolved, mmrestripefs was able to begin, and confirmed to the Purdue/IBM team that the suspected issue (described above) did in fact exist, what the extent was of data affected, and that a path to resolution was identified.
    • Data from mmrestripefs indicates that only 0.02% (two 100ths of a percent) of files in the system may have an out-of-date replica.
  • May 29 through present - mmrestripefs run ongoing across the entire Depot filesystem.
  • June 22 - 41% of all files are scanned, and 25k out-of date replicas have been identified.

Outcomes

The Data Depot has nearly a billion files, and all must be scanned. Many files have already been corrected. The scan process will identify files for repair as they are identified, but completion will take several weeks. We understand how critical your research data on Depot is to your work however, and we will not wait for the scan to step in and correct any issues you may find.

Any issues found with Depot files pointing to the wrong replica are able to be repaired immediately.

Please report any file pointing to an incorrect replica to rcac-help@purdue.edu, and we will correct the replica.

Originally posted: June 6, 2020  12:09pm