Hadoop cluster now available for Purdue researchers analyzing big data
September 16, 2014
Professor William Cleveland and colleagues analyze terabytes of cybersecurity data looking for new ways to identify and combat spammers, data thieves and other Internet bad guys.
To do it, Cleveland, Purdue’s Shanti S. Gupta Professor of Statistics, employs the popular and versatile R statistical programming language and Hadoop, software for storing and processing huge data sets on cluster supercomputers. Hadoop helps the researchers break up big problems, solve many pieces at the same time on supercomputers and merge the results into a unified answer.
“That enables a major speedup, enough to make what we do practical, let’s put it that way,” Cleveland says.
Now, ITaP Research Computing (RCAC) is making a stand-alone cluster specially set up for Hadoop jobs available to any Purdue researcher involved in big data analysis. Faculty interested in using the new Hadoop cluster can send an email to firstname.lastname@example.org.
The cluster is named Hathi, Hindi for elephant and the name of a character, a bull elephant, in “The Jungle Book.” Hadoop’s mascot is an elephant.
Both Cleveland and Preston Smith, ITaP Research Computing (RCAC) manager of research support, expect the resource to be popular. Big data analysis, after all, is the name of the game today.
“This is happening across all of science and technology and business,” says Cleveland, whose lab plays a leading role in the development of Tessera, an open source environment combining R and Hadoop to enable deep analysis of large complex data sets.
Tessera, which Cleveland started in the Purdue Department of Statistics and now includes Pacific Northwest National Laboratory and Mozilla as partners, is a driving force in Cleveland’s research and a reason he has been using Hadoop almost since it became available, including a Hadoop cluster ITaP Research Computing (RCAC) built and has operated for him.
With the Community Cluster Program, Purdue researchers have come to rely on ITaP for high-performance computing resources and Smith says ITaP Research Computing (RCAC) has received faculty requests regularly for a Hadoop-specific resource.
“Hadoop is widely used and useful to probably anybody that’s got large amounts of data, especially unstructured data,” Smith says.
Cleveland also is teaching a class using Hadoop for the first time this fall. ITaP Research Computing (RCAC) is helping facilitate the classroom use with the Scholar cluster, open to instructors from any field whose classes include assignments that could make use of supercomputing.
The Hadoop cluster is part of a series of improvements brought online by ITaP this fall to aid Purdue researchers in the kind of world-changing research called for in President Mitch Daniels’ Purdue Moves initiative.
That includes an upgraded campus research network 58 times faster than before and designed to speed the movement of research data on campus and off. It also includes the new Research Data Depot, which makes available over 2 petabytes of storage to Purdue faculty and campus units in need of a high-capacity central solution for storing large, active research data sets at a competitive price.