NERSC and ESnet Take Scientific Data Collaboration to the Next Level
New Data Sharing Tool Boosts Researchers’ Ability to Work With Ultra Large Data Sets
April 12, 2021
When the Dark Energy Science Collaboration (DESC), one of the projects associated with the upcoming Legacy Survey of Space and Time (LSST), needed a fast, secure, and convenient way to share large data sets with scientists around the world, they turned to the National Energy Research Scientific Computing Center (NERSC) for help. Having used NERSC resources to generate simulations and store data, it was time for DESC to enable sharing that data outside the collaboration. Working with NERSC staff they found a solution — and a leap forward for anyone sharing scientific data: the Modern Research Data Portal (MRDP), a high-performance networking resource developed by engineers at the U.S. Department of Energy’s Energy Sciences Network (ESnet) in collaboration with computer scientists at Argonne National Laboratory and the University of Chicago.
The ten-year LSST, conducted by the Vera C. Rubin Observatory under construction in northern Chile, will soon be producing vast amounts of imaging data. By conducting the largest-ever astronomical survey using the world’s largest digital camera, the science collaborations working with Rubin Observatory hope to answer foundational questions about the universe.
Observatory systems will need to be ready to receive and analyze massive amounts of data when the facility is complete; in preparation, DESC is currently simulating portions of those data sets, producing realistic faux data to mimic the workings of the finished observatory.
“LSST will observe about half the sky for ten years. We picked a 300-square-degree patch within that survey area and simulated a five-year survey,” said DESC Computing Coordinator Katrin Heitmann. “After we simulated the data, we used the Rubin Science Pipelines to process it. It was quite an elaborate process, but we wanted to test the processing pipelines in detail and generate an LSST-like data set. Now the science collaboration has started running their analysis pipelines on the simulated data and trying to get cosmology out.”
With the release of these data sets, scientists around the world, both inside and outside LSST DESC can view and download them. “Since this is a rather rich data set,” said Heitmann, “we thought it would be nice to share it [with scientists] beyond the collaboration.” That’s where the MRDP comes in.
A Faster Way to Transfer Data
Many scientists still use traditional, monolithic data portals to move and share data. These portals are based on the standard HTTP protocol — but the software and hardware of the “civilian” Internet can’t always keep up with the increasing size of data sets or what it takes to move them around. Transfers may be slow, and the probability of transfer error becomes more common as the size of the data set becomes more unwieldy.
The MRDP — based on the Science DMZ, a high-performance network architecture developed at ESnet — is an increasingly common method for sharing data; it separates the administrative elements of data sharing, such as user authentication and request processing, from the task of data storage. DMZ is a term used in computing architecture to indicate a zone somewhere between internal and external networks; the Science DMZ allows data transfers requiring high performance, such as large data sets, to reach the outside world without competing with other traffic. The Science DMZ employs multiple data transfer nodes (DTNs), specialized devices dedicated to high-speed data transfer and running high-performance data transfer tools.
When a user initiates a data transfer using an MRDP portal, “the data request is outsourced to systems that are specialized to do the data transfer,” said NERSC data architect Rollin Thomas. “It’s a different model, but it’s better because you can achieve much faster rates of transfer and much higher bandwidth. It’s generally the way to transfer large scientific data sets, of which we have many more than we used to.” The separation of functions and use of a separate high-speed network, such as ESnet, allows for faster data transfer, stronger security, more convenient access for users, and ultimately more capacity for complexity and scaling up.
Part of the appeal of the MRDP design is its speed; the Science DMZ can transfer one terabyte (the equivalent of 200 DVDs’ worth of data) every 20 minutes. “Using older systems, it could have taken hours, days, or weeks to move this kind of data around,” said ESnet Science Engagement Engineer Jason Zurawski. “This faster access means we can do more science.
When DESC Met MRDP
When DESC needed a platform to share their data, software developer Heather Kelly reached out to the NERSC team for consultation. Together, they agreed: an MRDP setup, the first in NERSC history, would be the best way to share and showcase their data before and after the Rubin Observatory is completed. “No matter where the processing or the simulating was happening, ultimately all the data is stored [at NERSC]. So when we wanted to serve up the data….it seemed like a natural choice,” said Kelly.
With some guidance from NERSC staff, Kelly and her team set up their MRDP instance within a month, using the NERSC service platform Spin and leveraging NERSC’s Globus Sharing endpoint for transferring data. “We’re fortunate in that we have storage space at NERSC and collaborators all over the world,” she said. “But when we want to share outside that group, we have to provide access to the data in a way that’s easy. People find this web interface simple to use, and it makes it easier for us to make the data publicly available.”
A Data Portal for the Future
The MRDP used by DESC is the first of its kind at NERSC, but almost certainly not the last. As experimental science yields more and more data, more scientists may be drawn to these accessible, high-performance setups, and NERSC will be ready for them.
“I think the work this team has been doing with the modern research data portal is the first example of how we’re going to see an increasing number of scientists using NERSC,” said NERSC Data Science Engagement group lead Debbie Bard. “It’s the first step of a much bigger trend: science is producing ever-larger amounts of data from simulations and experiments and sharing that data with their community.”
In turn, this new design pattern is shaping the way NERSC prepares for future users, she added. “We can create tools and technologies that can handle [the challenge of big data transfers], and then those tools can be used by all our other users as well, so it scales out the support. So we’re not just doing this work for this science team; all our other science teams can take advantage as well.”
So where do NERSC and the MRDP go from here? After working with the DESC, NERSC staff is prepared to help users create their own modern research data portals while avoiding obstacles along the way and working more efficiently.
“We learned about where users are going to get stuck building an MRDP and how we can help them stand these things up in a much more self-serve kind of way,” said Thomas. “We’re going to have people not just able to share their data but make it available in the best way possible that the technology allows. That’s the Science DMZ model and the modern research data portal.”
ESnet and NERSC are DOE Office of Science user facilities based at Lawrence Berkeley National Laboratory.
About Computing Sciences at Berkeley Lab
High performance computing plays a critical role in scientific discovery. Researchers increasingly rely on advances in computer science, mathematics, computational science, data science, and large-scale computing and networking to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab’s Computing Sciences Area researches, develops, and deploys new foundations, tools, and technologies to meet these needs and to advance research across a broad range of scientific disciplines.