Science Data Pilots: An Infrastructure to Harness Big Science Data
All talks are in booth #1939
Science breakthroughs in the 21st century will depend on how well researchers in a variety of disciplines—from biology to the physical sciences—harness the massive datasets that have been cultivated over decades of experiments, observations and simulations. But for many researchers, taking full advantage of these scientific troves requires technologies and a robust computational infrastructure that doesn’t currently exist for their field.
Recognizing this need, the Department of Energy’s Office of Science—the single largest supporter of basic research in the physical sciences in the United States—is bringing together researchers, network engineers and computer and computational scientists to build the tools and infrastructure for modern scientific discovery. These following projects, which are lead by Lawrence Berkeley National Laboratory researchers, reflect some of that progress so far.
Tuesday, Nov. 18 10:30-11 a.m.
This data technology project demonstrated the ability to use a central scientific data facility serving data from multiple experimental facilities. Data from experiments at the Advanced Light Source (ALS), Advanced Photon Source, the Linac Coherent Light Source and the National Synchrotron Light Source was moved to the National Energy Research Scientific Computing Center (NERSC) via ESnet, DOE’s Energy Sciences Network. To accurately reflect the data variety of scientific information produced by user facilities operated by the Office of Basic Energy Sciences, the project used three distinct X-ray methods, four different light source facilities and four different beamlines.
Participants are Craig Tull, Eli Dart, Dilworth Parkinson, Nicholas Sauter and David Skinner, LBNL; Amber Boehnlein, SLAC; Francesco De Carlo and Ian Foster, ANL; and Dantong Yu, BNL.
Tuesday, Nov. 18 11-11:30 a.m.
This project conducted a multi-facility data technology demonstration illustrating a concept known as a "super facility," which supports the seamless integration of multiple, complementary DOE Office of Science user facilities into a virtual facility which presents a fundamentally greater capability for users. The facilities involved are the ALS, NERSC, the Oak Ridge Leadership Computing Facility and ESnet. Enabled by the network connectivity provided by ESnet between ALS, NERSC and OLCF and using specialized software, the project demonstrated the capability for researchers in organic photovoltaics to not only expose their samples at the ALS and see realtime feedback on all their samples through the SPOT applicatoin running on NERSC, but also to see near realtime analysis of their samples running at the largest scale on the Titan supercomputer at OLCF. This allowed researchers, for the first time, to understand their samples sufficiently during beamtime experiments to adjust the experiment to maximize their scientific results.
Participants are Craig Tull, Shane Canon, Eli Dart, Alex Hexemer and James Sethian, LBNL; Ian Foster, ANL; and Galen Shipman, ORNL.
Wednesday, Nov. 19. 2:30-3:15 p.m.
In recent years astrophysics and cosmology have undergone a renaissance, transforming from data-starved to data-driven sciences. A new generation of ongoing and near-future survey experiments will gather massive data sets that will provide more than an order of magnitude improvement in our understanding of cosmology and the evolution of the universe. Their analysis requires leading-edge high performance computing resources and novel techniques to handle the many petabytes of data generated throughout these surveys. Furthermore, interpreting these observations is impossible without a modeling and simulation effort that will generate orders of magnitude more simulation data, which will be used to directly understand and constrain systematic uncertainties in these experiments. This project developed an example of this pipeline and what a future set of data facilities in the DOE complex could deliver in terms of significantly enhanced scientific reach and turnaround time.
Participants are Peter Nugent and Shane Canon, LBNL; Salman Habib, ANL; Michael Ernst and Anže Slosar, BNL; Bronson Messer, ORNL.
Wednesday, Nov. 19. 3:00-4:00 p.m.
HPC facilities present unique opportunities and challenges for high energy physics event processing. The massive scale of many HPC systems means that fractionally small utilizations can yield large returns in throughput. Parallel applications which can dynamically and efficiently fill any scheduling opportunities the resource presents benefit both the facility (maximal utilization) and the compute-limited science. We will demonstrate an enabling framework for such applications, a novel fine grained data processing system for HEP-like event processing tailored to HPCs, called Yoda. Yoda is a specialization of an Event Service workflow engine designed for the efficient exploitation of distributed and architecturally diverse computing resources. It was developed in the ATLAS experiment where the compute-limited physics program stands to benefit greatly from opportunistic computing resources, which can enable computationally intensive physics such as rare searches that would otherwise be impossible within the available resources. The Event Service is also designed for highly efficient data handling in data intensive processing, utilizing dynamic data movement across powerful networks to minimize expensive disk storage demands. The data intensive, network centric, platform agnostic computing embodied by the Event Service and Yoda represents an increasingly important paradigm within the scientific computing community. We expect the system to mate well with emerging data intensive platforms.
Participants are Torre Wenaus, BNL and Vakho Tsulaia, LBNL.
Wednesday, Nov. 19. 4:00– 4:30 p.m.
Major facilities and science teams across the DOE laboratory system are increasingly dependent on the ability to efficiently capture, integrate and steward large volumes of diverse data. These data-intensive workloads are often composed as complex scientific workflows that require computational and data services across multiple facilities. ASCR’s current computational environment will need to be expanded to include new services to enable this. This project demonstrated a few core services that illustrated how a Virtual Data Facility could build upon ASCR’s computational infrastructure to better meet the needs of the DOE experimental and observational facilities and research teams.
Participants are Shane Canon and Brian Tierney, LBNL; Dan Olson, ANL; Michael Ernst, BNL; Kerstin Kleese—Van Dam, PNNL; and Galen Shipman, ORNL.
About Computing Sciences at Berkeley Lab
The Computing Sciences Area at Lawrence Berkeley National Laboratory provides the computing and networking resources and expertise critical to advancing Department of Energy Office of Science research missions: developing new energy sources, improving energy efficiency, developing new materials, and increasing our understanding of ourselves, our world, and our universe.
Founded in 1931 on the belief that the biggest scientific challenges are best addressed by teams, Lawrence Berkeley National Laboratory and its scientists have been recognized with 13 Nobel Prizes. Today, Berkeley Lab researchers develop sustainable energy and environmental solutions, create useful new materials, advance the frontiers of computing, and probe the mysteries of life, matter, and the universe. Scientists from around the world rely on the Lab’s facilities for their own discovery science. Berkeley Lab is a multiprogram national laboratory, managed by the University of California for the U.S. Department of Energy’s Office of Science.
DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit energy.gov/science.