Skip to main content
Have a request for an upcoming news/science story? Submit a Request

Cancer researchers learn about big data analysis using Anvil

  • Science Highlights
  • Anvil

The BigCare 2023 Summer Workshop, an on-site program that utilized the Anvil supercomputer to help train cancer researchers on how to harness the power of big data, has recently concluded, and based on participant feedback, it was a huge success!

The BigCare workshop, otherwise known as the “Big Data Training for Cancer Research,” is a program funded by the National Cancer Institute (NCI). The purpose of the workshop is to help cancer researchers develop the requisite skills for managing, visualizing, analyzing, and integrating various types of “omics” data in cancer studies. BigCare was founded in 2020 by Min Zhang, PhD, a professor of epidemiology and biostatistics at the University of California, Irvine’s Program in Public Health, as well as the biostatistics shared resources director for the UCI Chao Family Comprehensive Cancer Center. Also on the team of organizers are the two co-Investigators, Dr. Nadia Lanman and Dr. Dabao Zhang, and the senior IT specialist, Doug Crabill. Together, the team has worked hard over the past four years to ensure the BigCare workshop is a success. This year, the BigCare on-site program was a 10-day intensive workshop hosted at Purdue University. For those who could not make it to West Lafayette, a self-paced online course is currently being offered. The Rosen Center for Advanced Computing’s (RCAC) Anvil supercomputer is the high-performance computing (HPC) resource being used to support both versions of the workshop, allowing students and researchers to easily access, manage, and analyze the massive amount of data involved in their research.

Cancer research is a vast field, and scientists study the disease in many different ways. The BigCare 2023 Summer Workshop focused on the analysis and interpretation of genomic and genetic data, but the principles taught in the program can be applied to any omics data research. Omics data refers to any information obtained from research in biological fields that end with the suffix -omics: genomics, proteomics, transcriptomics, etc. Omics studies generally involve information at the biomolecular level, so the research yields immense amounts of data—too much to be efficiently handled on personal computers. This field of research thus presents two major issues:
  1. Most cancer researchers are not data scientists. Even when equipped with the proper computing resources, they may not know how to handle the amount of data required to make actionable conclusions from the results.
  2. Most cancer researchers are not computer science professionals. They do not know how to write code, develop or manipulate software, or run jobs on HPC systems.

The BigCare workshop provides Image descriptionsolutions to both of these problems by improving the participants’ practical bioinformatics skills and their computational competency. Dr. Min Zhang, the principal investigator on the NCI-funded project, teaches the participants the skills needed to analyze their research data, while Anvil provides an HPC environment that has a very low barrier to entry, ensuring that non-HPC professionals can quickly and easily complete their research without having to become an expert in computing.

“All of our participants are cancer researchers, so they have zero experience in computing and minimal experience in statistics,” says Zhang. “The data we use in the workshop is essentially the data they collected in their own research, and a lot of times, the participants have already sent it to a company and paid money to have it analyzed for them. Then, every time they need something else from that data, the company always asks them to pay more. So the idea is, since they generated the data by themselves, why not enable them to analyze the data too?”

Giving researchers the ability to manage and analyze their own data gives them a much higher degree of control over their research, allows for the data to be used over time in many different ways, and decreases the likelihood of a research project being hamstrung due to insufficient funding. By teaching these skills and introducing researchers to resources such as Anvil, the BigCare workshop helps enable cancer research to grow at a faster rate, potentially leading to a breakthrough in cancer care.

A technical problem that many cancer researchers face is the sheer size of the data sets that their research produces. Outside of needing to know how to manage this data, a researcher also needs access to a powerful enough computer to store and process this data. Personal and work computers simply won’t cut it, which is where supercomputers like Anvil step in. Anvil is an extraordinarily fast HPC system that is available to researchers across the nation. It is funded by the National Science Foundation (NSF), meaning that researchers can gain access to Anvil for free. But Anvil’s power is not its only highlight—it was also designed to be used by those with little to no HPC experience, which makes Anvil a desirable option for cancer researchers.

“We don’t want to turn everyone into a computer scientist, because they have more important things to do,” says Zhang. “Previously, we had to teach users the front end, back end, command lines, all this kind of stuff, and now it’s all gone! Life is so much easier. And everyone was so excited that they wanted to take Anvil to their own institution. Some of them would even say, ‘We do have HPC, we do have cloud, but it’s not as user-friendly as Anvil.’”

Anvil was so helpful for the workshop that Zhang intends to renew it as the resource for supporting BigCare for the foreseeable future. “It definitely made our workshop run much better and much smoother, and attracts many more researchers. So I think we will carry on this collaboration for not only next year, but the next five years. The cancer researchers will benefit a lot from the Anvil computing environment.”

The participants from this year’s on-site Image descriptionworkshop varied tremendously, with both students and professionals attending the event. BigCare accepts applications from all qualified faculty, clinicians, postdocs, and post-prelim graduate students in cancer research and specifically encourages individuals from underrepresented groups to apply. According to Zhang, “The cancer researchers can be from many different levels—senior grad students, postdocs, junior faculty, senior faculty. But what they share is a common interest in wanting to learn how to analyze big data.” And there is certainly a lot for participants to learn from the program. Key topics from this year’s workshop include:

•Using public databases and tools to better understand the molecular basis of cancer

•Differential expression analysis using bulk and single-cell RNA-seq data

•Use of Next Generation Sequencing data for ChIP-seq and epigenetics

•Visualization and functional assessment of data

•Network construction and meta-analysis

•Genome variation and genome-wide association study

Another unique aspect of the workshop is that the data sets used to teach these topics are brought in by the participants themselves from research they have already conducted. This is immensely useful—not only do the researchers get to learn with their own real-world data, they also have an excellent jumping-off point for when they return home.

“To finish a project is more like homework after the workshop,” says Zhang. “They are so excited, they bring in their own data and ask, ‘How do I analyze this, how do I analyze that?’ Essentially, it teaches them to wrangle their own data after they’ve built up the skills during the workshop, which is the final goal.”

Participants from the event were thrilled by all that they learned at the BigCare workshop this year. In a follow-up questionnaire, one participant—Dr. Xiulei Mo from the Department of Pharmacology and Chemical Biology at Emory University—had this to say:

"The 2023 BigCare Workshop has been a transformative experience, providing me with a solid foundation in cutting-edge computational approaches. Through lectures, hands-on training, and group projects, I gained a holistic understanding of various omics data types, such as bulk RNA-seq, single-cell RNA-seq, and ChIP-seq. This knowledge has proven invaluable in my current research projects, where I investigate how genetic mutations in pancreatic cancer influence cellular phenotypes like proliferation, invasion, and drug response."

Many of the participants from the workshop have already requested Anvil allocations so they can continue to utilize the supercomputer in their research. In the end, giving researchers the power to do what they need with their research is what Zhang hopes to achieve:

“We are trying to break the barrier between the researchers and the data,” says Zhang, “to at least give them an opportunity to see their data, to work with their data, and to extract information from their data.”


More information about the BigCare 2023 Summer Workshop can be found on their “Big Data Training for Cancer Research” webpage. Information about the Anvil supercomputer can be found on Purdue’s Anvil Website.

For more information regarding HPC and how it can help you, please visit our “Why HPC?” page.

Anvil is funded under NSF award No. 2005632. Researchers may request access to Anvil via the ACCESS allocations process.

Written by: Jonathan Poole,

Originally posted: