RCAC student employee successfully defends Master’s Thesis
Yiqing Qu, a Graduate Research Assistant at the Rosen Center for Advanced Computing (RCAC), recently obtained her Master of Science (MS) degree in Computer Information and Technology. Her MS thesis was related to the work she conducted at RCAC, which ensured that there is a way to measure and improve adherence to the FAIR principles of scientific data management.
Qu began working on her research project in September of 2022, when she first joined RCAC. Her work is part of the GeoEDF (Extensible Geospatial Data Framework), an NSF-funded project with the goal of providing seamless connections among platforms, data, and tools and making large scientific and social geospatial datasets directly usable in scientific models and tools. Essentially, GeoEDF allows researchers to easily find, combine, and use multi-scale geospatial data sets directly in their scientific workflows without time-consuming data wrangling across multiple platforms. Part of the GeoEDF project is to develop a resource data management portal that allows researchers to then publish their workflows and their results, allowing other colleagues to reproduce their work. A key strategy in ensuring such reproducibility is to adopt and adhere to the FAIR data principles.
The ‘FAIR Guiding Principles for scientific data management and stewardship’ were published in Scientific Data in 2016. FAIR is an acronym for:
- Findability—Ensuring digital assets are easy for both humans and computers to locate through unique identifiers and rich metadata.
- Accessibility—Guaranteeing digital assets can be retrieved by users with clear and accessible data and metadata, regardless of the user's location.
- Interoperability—Facilitating the integration, exchange, and analysis of data across various formats and platforms, ensuring they can interact seamlessly.
- Reusability—Maximizing the utility of digital assets by ensuring they are well-described and maintain their value over time, allowing them to be reused in different contexts or for different purposes.
Without the use of FAIR principles, it is incredibly difficult for scientists to find and utilize the most relevant data for their research. Rajesh Kalyanam, a Senior Research Scientist for RCAC and Qu’s mentor for her project, speaks to the importance of including FAIR principles within data portals:
“So, essentially, there are what's called FAIR principles, which is an abbreviation of findable, accessible, interoperable, and reusable. These are principles for scientific data management. So when you're putting data up online—in order for it to actually be usable by other researchers—you want to make sure that you adhere to these principles. Any data portal or any website that's hosting research data needs to try and match or adhere to the FAIR principles, otherwise, that data could end up having limited impact outside of the original project for which it was created.”
The problem with FAIRness comes from its implementation. Research Software Engineers had no clear guidelines for adhering to the FAIR principles when creating a new data repository from scratch. So, in theory, incorporating FAIRness into data portals and data sets is a wonderful idea that everyone should start doing immediately. In practice, it is more akin to conquering the Chimera—or at least it was.
After discussing the idea with Kalyanam and her academic advisor, Dr. Baijian Yang, the Associate Dean for Research and a Professor at the Purdue Polytechnic Institute, Qu decided to tackle the problem of FAIRness for her Master’s thesis using the GeoEDF project as the functional basis of her research. The ultimate goal of her project was to create a methodology for evaluating FAIRness and to develop a structured approach to implementing improvements that would lead to the creation of FAIR-compliant data portals. If successful, researchers could then use her work to build their own data repositories that adhere to the FAIR principles.
“For working on my Master’s thesis,” says Qu, “I wanted to solve two questions: How to evaluate the FAIRness of the project, and how to improve the FAIRness of the project? I decided to spend two years working on these research questions, and chose this particular project due to its real-world applications and its close alignment with my personal research experience.”
Starting with a bare-bones Django portal framework built by the Globus team, Qu needed to:
- Evaluate the FAIRness of the data portal.
- Implement new features on the portal to improve the FAIRness score.
- Re-evaluate the FAIRness score of the improved data portal and iterate.
Once she decided to use the Globus portal framework as a starting point, Qu began by deploying the portal on the Anvil Composable Subsystem, a Kubernetes-based private cloud managed with Rancher that provides a platform for creating composable infrastructure on demand. Anvil is an NSF-funded shared computing resource and Purdue’s most powerful supercomputer. By using Anvil, Qu ensured that her work would be immediately available to researchers nationwide. Qu then moved on to the first FAIRness evaluation. She tested three separate evaluation tools and determined that a tool known as F-UJI was best for the project. Qu used F-UJI to score the FAIRness of the barebones Globus framework, which received a score of 47%. For comparison, Qu chose a well-known, mature, FAIR-compliant platform, that had diverse data types and well-designed metadata, to test against. The platform, known as HydroShare, scored a 64% in the FAIRness evaluation. Now that Qu had a target score to aim for, she began to implement new features to design her new data repository, named the GeoEDF Data Portal.
Aside from simply giving a FAIRness score, F-UJI also provided feedback on what improvements could be made. Qu looked at the feedback and prioritized the features that would lead to the greatest overall impact on the score. She then systematically added new features to the portal, increasing its FAIRness score from 47% to 60—a huge improvement that puts the portal on par with HydroShare in the FAIRness assessment. Needless to say, Qu and her mentors were thrilled with this result. Qu is now working on a paper based on her work and hopes to present this to other portal developers and researchers who are involved in building similar data portals.
“This project contained a lot of components,” says Qu. “When Rajesh was first presenting it to me, I could not imagine being able to complete it in two years. But we worked step-by-step to implement each of the components, and it ended up being a great success.”
“We are beyond excited by what Yiqing was able to do while working with us,” says Kalyanam. “You would be hard-pressed to find anyone who could complete so much in such little time, yet she did it while going to graduate school. She is one of the hardest working students I’ve ever had the pleasure of working with, and it was a joy to have her as part of the team.”
In her two years working at RCAC, Yiqing accomplished an astonishing amount, learned a lot, and had a good experience with the organization.
“I really enjoy working at the RCAC,” says Qu. “Not only working on this specific project, but all of the activities I had there. There were many opportunities to share our work with others and have our projects seen. I participated in poster sessions and Lightning Talks, and was able to discuss my project with others and gain useful insights from their work as well. Also, I learned a lot. The mentorship at RCAC is very good. Rajesh worked with me over the two years and was very supportive. He knows a lot about doing the research as well as the actual software engineering. He was also very good at providing guidance and advice for all aspects of the project.”
Now that Qu has successfully defended her thesis and graduated, she will be transitioning to her new job at Klaviyo, a company that provides intelligent marketing automation powered by customer data. In her new role, Qu will be working on Klaviyo's real-time data pipeline, facilitating the ingestion, processing, and movement of data points that power Klaviyo's core functionalities. While RCAC is sad to see Qu go, they are very excited for her and know that she has a promising career ahead.
To learn more about HPC and how it can help you, please visit our “Why HPC?” page.
Anvil is Purdue University’s most powerful supercomputer, providing researchers from diverse backgrounds with advanced computing capabilities. Built through a $10 million system acquisition grant from the National Science Foundation (NSF), Anvil supports scientific discovery by providing resources through the NSF’s Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS), a program that serves tens of thousands of researchers across the United States.
Researchers may request access to Anvil via the ACCESS allocations process. More information about Anvil is available on Purdue’s Anvil website. Anyone with questions should contact anvil@purdue.edu. Anvil is funded under NSF award No. 2005632. GeoEDF is funded under NSF award No. 1835822.
P.S.—Yiqing: On behalf of the entire RCAC department, congratulations and best of luck!
Written by: Jonathan Poole, poole43@purdue.edu