DataCenterHub addresses challenges with preserving, sharing research data

November 11, 2016

All too familiar with the phrase “publish or perish,” researchers aren’t shy about sharing their results with colleagues, but they often don’t effectively store and share the full set of data that leads to those results.

“This is a large problem, and it needs a solution,” says Santiago Pujol, a professor of civil engineering at Purdue and the academic director for ITaP Research Computing, who worries that challenges in preserving research data can result in lost knowledge and unnecessary duplication of work. Pujol and ITaP research scientists are developing a platform, called DataCenterHub, that can help mitigate such concerns.

While it’s likely that published books and journal articles will be preserved and accessible for a long time to come, it’s unclear whether data stored on a medium such as a DVD will be available, unlike a book, hundreds or thousands of years from now.

Even if the DVD survives and future generations have the appropriate technology to read it – a big “if,” given how fast technology changes – such a localized means of storage renders the data inaccessible to researchers working anywhere else in the world, who might notice something in the data that the original researchers didn’t. By preserving and sharing only a researcher’s final conclusions and overlooking all of the raw data that went into producing those results, we are potentially missing out on a vast amount of knowledge, Pujol says.

People have created databases for storing research data, but they’re typically specific to an individual field of study and inadequate for other researchers’ needs. Pujol and his then-graduate student Lucas Laughery, who is now a postdoctoral researcher at Purdue, realized the need for a new solution.

Most data repositories are organized in a way that forces a researcher to dig through layers of file hierarchy to find a single piece of data, which made side-by-side comparison of experiments difficult. Users also chose their own format for the uploaded data, creating variability and making results hard to read and interpret.

Pujol and Laughery needed a platform that would not only preserve their data, but also enable it to be viewed and shared with others in a useful fashion. They turned to ITaP Senior Research Scientist Ann Christine Catlin and her team for help building a solution. The result of that collaboration is DataCenterHub, a repository that preserves data from all kinds of experiments and presents it to researchers in a clear, easily accessible way.

Rather than requiring a user to click through multiple screens to learn additional details about an experiment, DataCenterHub presents its uploaded datasets in a table, with each experiment in its own row and columns for attributes such as experiment title, source and date. A user can sort the datasets by any of these attributes or perform a keyword search. This makes finding a particular dataset much easier and also enables comparisons between datasets.

DataCenterHub also organizes different kinds of data into separate groups, and provides formatting guidelines for certain types of files, resulting in a standardized format for uploaded data that eliminates inconsistencies between datasets uploaded by different users.

Pujol and Laughery have reached out to other research groups to learn how they are managing their data and to let researchers know about the benefits of using DataCenterHub.

“We’re trying to show researchers that instead of putting their data in a safe and setting that safe off to the side, they can put it into a (digital) safe that has windows into it and has shelves that help you organize it,” says Laughery.

Today, DataCenterHub has expanded beyond its roots in earthquake simulation and includes data from a broad variety of fields, including agriculture and entomology. DataCenterHub currently hosts almost 30 terabytes of research data, with more being added every day.

Faculty or graduate students who would be interested in attending a presentation to learn more about DataCenterHub and how it might help with their research data needs should contact Lucas Laughery, llaugher@purdue.edu.

Originally posted: November 11, 2016  9:22am