Extreme Data Science
The scientific computing community has been addressing the challenge of extreme data long before "big data" became a popular concept. Computing Sciences researchers have been developing the tools and technologies, as well as providing the critical networking and computer resources, to help scientists access, share, analyze, visualize and understand their data.
For the Department of Energy, extreme data is associated with the concept of big science, which relies on powerful instruments to tackle some of the most challenging scientific problems of our time. At the National Energy Research Scientific Computing Center (NERSC), extreme data is defined as the massive datasets generated by experimental facilities or large-scale modeling and simulation results from supercomputers. The DOE Office of Science supports a number of unique experimental facilities, each of which supports a broad user community and produces valuable scientific data. Such facilities include the Joint Genome Institute, the Advanced Light Source, the Relativistic Heavy Ion Collider, the Large Hadron Collider in Switzerland and telescopes and space probes. As the Office of Science’s primary scientific computing facility supporting 4,500 users at national labs and universities, NERSC both generates massive data sets and serves as a repository for data that is shared and analyzed by users around the world. Such experimental and computational facilities are beyond the scope of what individual institutions can provide.
The Energy Sciences Network, or ESnet, is the Department of Energy’s high-performance network managed by Berkeley Lab. In late 2012, ESnet rolled out a 100 gigabits-per-second national network to accommodate the growing scale of scientific data. Traffic on ESnet is growing twice as fast as on the commercial Internet and ESnet staff have developed specialized applications and architectures to speed the transfer of massive datasets.
In the Computational Research Division, the Scientific Data Management and the Visualization groups are leaders in developing technologies to help scientists search, analyze and visualize their data. For example, the FastBit indexing application developed at the lab allows users to search huge databases 100 times faster than commercial tools. Berkeley Lab is also the leader of DOE’s Scalable Data Management, Analysis and Visualization (SDAV) Institute to develop new tools to help scientists manage and visualize data to help speed discovery.
In short, the nation’s research community is facing a problem with scientific data and a new generation of supercomputers, capable of exaflops performance, is needed to analyze that data. In this regard, exascale computing is a subset of the technical challenges posed by the extreme data problem.