Skip to main content
Have a request for an upcoming news/science story? Submit a Request

New Protected Data Filesystem is available at RCAC

  • Science Highlights

The Rosen Center for Advanced Computing (RCAC) announces new capabilities to enable scientific computing with protected, life science data within the community cluster program. Interim CIO Christian Theumer stated that Purdue IT invested in life science computing because demand for sector-specific capabilities have increased. "We observed a burgeoning demand for life sciences computing and enhanced data protection services, specifically with the One Health initiative ramping up. Purdue IT has invested in new resources for our community cluster program in order to address these needs."

Purdue’s open science Image description community cluster program has long supported a full portfolio of I/O subsystems for high-performance computing. Capabilities include high-speed POSIX scratch, project space in the Data Depot, S3-compatible object storage, and the Fortress archive. These I/O subsystems must support a wide set of workloads from over 60 departments and every academic college at Purdue, ranging from physics-based modeling and simulation, to bioinformatics, Cryo-EM microscopy, imaging, and AI. While many of the workloads are similar to those in open science, the terms in data use agreements (DUAs) for many life or social science datasets frequently require higher levels of security, such as heightened access control, and encryption.

To address this gap, RCAC has deployed the Protected Data Filesystem (PDFS) as a resource for sensitive or restricted data. Available today on the Negishi cluster, this filesystem provides over 2.5PB of fully encrypted high-performance Lustre file storage and can expand as demand requires. The PDFS filesystem is well suited for all HPC applications and performs on par with the main scratch filesystem on Negishi. Community cluster user-management tools allow for access controls to be managed in alignment with contractual requirements.

A high-assurance instance of the Globus data transfer tool allows for fully audited and encrypted ingress and egress into the filesystem. Later this year, new offerings in the ecosystem will include a fully protected HPSS archive for long-term storage of controlled data.

Research Associate Professor Nadia Lanman of the Department of Comparative Pathobiology uses community clusters for large-scale bioinformatics workloads. “Our group requires the ability to process patient sequencing data, which could be potentially identifiable and is thus protected data. These capabilities on our campus HPC are essential to be able to work with human genetic data and sequencing information to enable the discovery of driver mutations, epigenetic modifications, and other features that can lead to oncogenesis or disease progression”

Access to the PDFS is available at an annual per-TB rate—as with the Data Depot—for non-purged high-performance storage. Purdue researchers with data use agreements requiring higher levels of data security may request the PDFS in their data security plan (DSP) during the review process.
To streamline the process and offer a default option for protected data, the PDFS and all Purdue IT high-performance computing resources are built and operated in line with the  NIST 800-233 "High-Performance Computing Security: Architecture, Threat Analysis, and Security Posture" best practices and certified ISO 27001. According to RCAC Executive Director Preston Smith, “Aligning our data security to standards like ISO 27001 or NIST controls gives us a common vocabulary with which we may easily communicate to sponsors that our information security posture meets their terms.”

The PDFS has been approved to host a variety of datasets, including but not limited to data subject to the NIH Genomic Data Sharing Policy, Database of Genotypes and Phenotypes (dbGaP), the UK Biobank, human cancer genomic data, and more.

Purdue IT encourages PIs with any sort of protected dataset requiring analysis to reach out to RCAC or Information Assurance to discuss their DUA’s data security requirements and potentially utilize the filesystem. While appropriate for a wide variety of sensitive and restricted data, the PDFS is NOT intended for export-controlled data subject to ITAR, or for controlled unclassified information (CUI).

Please visit RCAC’s Storage page to learn more about the available storage resources within RCAC’s community cluster ecosystem

To learn more about High-Performance Computing, please visit our “Why HPC?” page. To stay up-to-date on all RCAC projects and updates, please visit our “News” page.

Originally posted: