What Are the Computational Keys to Future Scientific Discoveries?

NERSC Develops a Data Intensive Pilot Program to Help Scientists Find Out

August 23, 2012

Linda Vu,lvu@lbl.gov, +1 510 495 2402

Advanced Light Source at the Lawrence Berkeley National Laboratory. (Photo by: Roy Kaltschmidt, Berkeley Lab)

A new camera at the hard x-ray tomography beamline of Lawrence Berkeley National Laboratory’s (Berkeley Lab’s) Advanced Light Source (ALS) allows scientists to study a variety of structures as a function of time—from bones to rocks, plants, and even metallic alloys—in unprecedented detail.

According to ALS Scientist Dula Parkinson, the new camera generates data 50 times faster than the one it replaced, and researchers hope that all of this information may someday lead to more effective methods of storing carbon underground, treating diseases, or creating stronger materials for supersonic jets. But before these breakthroughs can happen, ALS scientists must figure out how to manage, store and share the torrent of data being generated.

And they are not alone. From astronomy to genomics, increasingly sophisticated instruments are producing data at staggering rates, and many scientists are struggling to find the right computational and storage strategies to deal with the deluge. To help researchers in this effort, the Department of Energy’s National Energy Research Scientific Computing Center (NERSC) developed a Data Intensive Computing Pilot.

“Many of the big data challenges that have long existed in the particle and high energy physics world are now percolating other areas of science. At NERSC we’ve seen an increase in user requests for more computing resources and bigger storage allocations to deal with bigger datasets,” says David Skinner, who leads NERSC’s Outreach, Software and Programming Group. “So the goal of this pilot is to see how new technologies and software environments will help these scientists better manage, analyze, store and share their growing datasets.”

The 18-month program offers select research collaborations priority access to Hadoop-style computing resources on NERSC’s Carver Infiniband cluster with access to a 1 TB memory node; up to 1 petabyte of disk and tape storage; priority access to a 6 terabyte flash-based file system with 15 gigabits per second transfer speeds; and the ability to remotely access data stored at NERSC via Science Gateways. In anticipation of this program, the facility’s Storage Systems Group has worked closely with IBM to provide support for increasing the facility’s General Purpose File System (GPFS) storage capacity to about 100 petabytes by 2019. The current capacity of the NERSC GPFS file systems is about 8 petabytes. NERSC and IBM recently agreed on putting support in place through 2019 and that is a key enabler to the Data Intensive Computing Pilot.

“Two years ago the hard x-ray tomography beamline at Berkeley Lab’s ALS generated about 100 gigabytes of data per week, but we got a faster camera and now we are generating anywhere from 2 to 5 terabytes of data per week,” says Parkinson. “This is pushing the limit of what our current infrastructure can handle.”

“Many of the big data challenges that have long existed in the particle and high energy physics world are now percolating other areas of science.... So the goal of this pilot is to see how new technologies and software environments will help these scientists better manage, analyze, store and share their growing datasets.”

— David Skinner, NERSC Outreach, Software and Programming Group

According to Parkinson, in the current system, a typical ALS user will create a folder on a data storage server connected to the instrument, and save their raw data to this folder. In many cases, users may do some initial processing on desktop computers at the ALS and save these files on the facility’s storage server. Upon leaving the facility, researchers will copy their data on an external hard drive and carry it home for further analysis. The files and raw data initially saved on the ALS storage server are typically left behind for the facility’s staff to manage.

“Before, the datasets were small enough that collecting data to our servers then carrying it home on a USB drive wasn’t really a problem. But now that each data set can be collected in less than a minute and is several gigabytes, it is becoming increasingly clear that we need to automate our process for archiving data,” says Parkinson. “ALS users and staff are primarily scientists. Most of us don’t know how to set up and maintain the large computational resources for transferring and archiving data, and so we really appreciate the ability to leverage NERSC’s expertise.”

In addition to storage, he notes that ALS users need access to more computing power for analysis as their data grows. “Most people currently do analysis on desktop computers, but what they are able to do is limited by their computer’s memory and by the large amount of time it takes to process large data sets. With the increased data rate, many of our users are having a really hard time analyzing their data,” says Parkinson. “So having access to a supercomputer would be extremely useful.”

Some other scientists that will benefit from NERSC’s exploration of big data solutions are Josh Lerman and Edward O'Brien, graduate students at the University of California, San Diego (UCSD). Lerman, O’Brien, and their advisor Bernhard Palsson, UCSD Professor of Bioengineering, are continuing development of a method for simultaneously modeling an organism’s metabolism (to identify the biochemical reactions taking place) and underlying gene expression (to understand the “machinery” producing those reactions). Lerman notes that this technique could be a boon to areas like metabolic engineering by giving engineers a near complete accounting of the material and energy costs associated with new strains of design before a huge investment is made.

Although Lerman has been using NERSC resources since 2010 to develop his method, he acknowledges that his datasets are about to get a lot bigger. Soon he hopes to integrate sequencing data into his models. This work will allow researchers to validate UCSD’s method and models, and map the sequencing data to an underlying biochemical network of chemical reactions.

“The amount of throughput that we get at NERSC systems for handling sequencing data and running models is just incredible. In our laboratory we max out at about 80 cores and are limited by memory,” says Lerman.

In addition to the computing and storage resources, Lerman notes that another appealing aspect of the Data Intensive Pilot is access to science gateways. “It would be nice to have researchers anywhere in world use our model to run experiments and then share it with the rest of the community through an online portal,” he says.

In addition to Lerman, O’Brien, and Palsson’s research, 18 other projects were selected to participate in the program. Although the ALS is not participating in the pilot, NERSC staff are working closely with ALS scientists to develop an infrastructure for processing, analyzing, storing and sharing experimental datasets.

About Computing Sciences at Berkeley Lab

High performance computing plays a critical role in scientific discovery. Researchers increasingly rely on advances in computer science, mathematics, computational science, data science, and large-scale computing and networking to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab’s Computing Sciences Area researches, develops, and deploys new foundations, tools, and technologies to meet these needs and to advance research across a broad range of scientific disciplines.