Data Depot Hardware Replacement and Migration

  • May 11, 2021 5:00pm - May 12, 2021 11:00pm EDT
  • Outages and Maintenance
  • Data Depot

UPDATE: October 25, 2021  8:10pm

As of October 24, 2021, the Data Depot migration project is completed! The last remaining space has been migrated to new hardware over this past weekend. It has been a long ride, not without a few bumps, and we greatly appreciated your patience and understanding during this entire process!

Please contact rcac-help@purdue.edu for any comments, feedback or observations.

UPDATE: October 8, 2021  1:08pm

As of October 7, the back-ordered metadata subsystem hardware has arrived. Additional flash storage has been immediately installed and brought online on the new Data Depot server, and the filesystem performance metrics have drastically improved.

Research Computing staff have resumed migration of the remaining Data Depot spaces. Please contact rcac-help@purdue.edu for any comments or observations.

UPDATE: August 18, 2021  1:02pm

In the past few weeks we have received a number of reports about decreased performance, slow access and frequent disconnects when working with migrated Data Depot spaces (both on the clusters and via mapped network drives on personal computers). We would like to provide an update and status report on the issue.

The root cause for this problem has been identified as a subpar performance of the metadata subsystem in the new Depot, causing many metadata lookup operations to time-out or retry. Per the vendor recommendation, an order for additional SSDs for the metadata subsystem was placed more than a month ago. Unfortunately, the global semiconductor shortage has gotten in the way, and the vendor is currently unable to procure the necessary drives and unable to fulfill the order. We are not alone in this sad state, with several large projects in peer organizations waiting on their vendors for flash storage tiers. We are in active discussion with the vendor for alternative procurement sources and options.

In addition, the vendor is actively troubleshooting the frequent disconnects our network drive users had reported. We believe we are approaching a solution to this issue (within the confines of the overarching metadata issue).

While this may not be suitable for every workflow, we highly recommend Globus for large scale data transfers to/from Data Depot with added reliability and resilience. Please see the Globus section of the user guide for more details on Globus transfers. Cluster users may additionally benefit from shifting bulk of their intermediate processing from Data Depot to much more performant cluster scratch filesystems (and copying final results back to Depot at the end of processing). Please reach out to us if you would like to discuss your lab data workflows and brainstorm possible enhancements to them.

We greatly appreciate your patience and understanding during this transition period. Migration of 2.5 PB while live is not an easy feat. Things will get better, but we still have a bit of rough seas in front of us at the moment. Please contact rcac-help@purdue.edu if you have any questions or concerns.

UPDATE: July 6, 2021  3:38pm

The background scan task that was severely affecting Data Depot performance during the past couple weeks has completed successfully over the weekend of July 4th. Data Depot responsiveness and performance are back to normal now. Please let us know if you still see any abnormalities.

UPDATE: June 22, 2021  1:31pm

In the last few days we have received a number of reports about decreased performance and slow access to migrated Data Depot spaces. This is a known issue caused by a high load from a background filesystem scanning task. The task must be let running to completion as part of the hardware migration process to re-confirm integrity and correctness of all transfers.

The scanning task has been running since June 11. We do not have an exact estimate on when it finishes, but we anticipate it to be over within a few days to a week or so.

We appreciate your patience and understanding during this transition period. Please reach out to us if you have any questions or concerns.

UPDATE: May 13, 2021  11:28am

Earlier this morning we have received a number of tickets about issues while mounting migrated Data Depot spaces as Windows/Mac network drives.

After troubleshooting with vendor support, the issue is now resolved. If you had problems earlier, please try to re-map the drives following instruction in the User Guide section.

UPDATE: May 12, 2021  9:29pm

As of 9:29pm, the first stage of Data Depot migration is completed. 614 out of 751 spaces (totaling 247 TB in 162 million files) have been migrated to the new hardware at an average rate of 3 GB/second.

Data Depot is returned to full production service. Most access methods will continue working without change for both migrated and not-yet-migrated spaces. However, accessing migrated spaces via Windows/Mac network drives (SMB/CIFS) will require a change on your part. Please see Data Depot Migration FAQ for detailed information on all aspects of migration and accessing your Data Depot spaces.

We will be contacting owners of remaining spaces individually to schedule their migration in the following weeks.

UPDATE: May 12, 2021  10:12am

Data Depot migration continues as planned, with over 600 out of 751 spaces migrated in the first 13 hours. Data Depot and community clusters remain unavailable while data is being moved.

We will provide another update by the end of the maintenance (11pm as planned).

ORIGINAL: May 11, 2021 5:00pm - May 12, 2021 11:00pm EDT

On May 11, 2021 5:00pm - May 12, 2021 11:00pm EDT, the Data Depot storage service will be unavailable while it will be transitioned to new hardware. All Depot access methods (SCP/SFTP, Windows network drives, Globus, NFS exports, direct mounts on Research Computing clusters, etc) will be affected.

Current Status: Data Depot was brought online in 2014 and has been instrumental in providing an affordable and reliable shared storage solution to the Purdue research community, while tightly integrating with the Community Cluster cyberinfrastructure. The system is currently utilized by 740 research labs and collaborative projects, using close to 95% of its size and file count capacity.

The lifecycle hardware replacement and expansion project for Data Depot has been underway for quite some time now, with the goal to provide an even larger, faster and more feature-rich storage system. As you can imagine, syncing 2.5 PB of live changing data is no small feat. Data has been continuously replicated from the current filesystem onto the new one for the last six months. The first rounds of file syncs have completed, and regular syncs are ongoing until the cut-over is executed.

Storage Resiliency: The quickly growing file count has caused a few issues recently. The filesystem, being nearly full, affects performance and some important functionality of the storage system. Specifically, due to increased load and limited metadata resources, we had to resort to suspending all regular snapshots operations on the current Data Depot filesystem. Please note that this does not in any way affect the resilient and redundant distributed nature of Data Depot storage. Your data remains fully safe, secure and protected against hardware and software failures across multiple disk trays, storage servers and datacenters, but options for self-recovering accidentally deleted files are temporarily unavailable. However, data is copied to and snapshotted on the new hardware, and RCAC staff are happy to recover files for you. As always, we strongly encourage regular use of Fortress to maintain long-term archives of any important research data.

Migration Plan: We strive to make the switch as transparent and straightforward as possible, with little to no changes in your everyday operations. On the day of the transition, Data Depot will require a downtime. During this period, we will migrate Depot spaces for the majority of current labs, so at the end of this downtime most of you should be able to simply return to your usual operation (now on the new Depot). A handful of very large or metadata-intensive spaces may not be fully converted during this initial run. For such spaces, the old Depot spaces will remain functional and we will individually coordinate with affected research groups for best times to complete their transition. Once everyone is migrated, we will need one final downtime (likely in the mid-July or early August time frame) to finalize the flip on the back-end.

For those of you who also use Community Clusters, please note that clusters will be unavailable during the two transitional downtimes.

We greatly appreciate your patience and your continuing support and use of our services. Please reach out to us if you have any questions or concerns.

Originally posted: April 20, 2021 10:44am EDT