Berkeley Lab Explores Frontiers of Deep Learning for Science
Computational Researchers Testing Advanced Machine Learning Tools for HPC at NERSC
December 8, 2015
Contact: Kathy Kincade, firstname.lastname@example.org, 510-495-2124
Deep learning is not a new concept in academic circles or behind the scenes at “Big Data” companies like Google and Facebook, where algorithms for automated pattern recognition are a fundamental part of the infrastructure. But when it comes to applying these same tools to the extra-large scientific datasets that pass through the supercomputers at NERSC on a daily basis, it’s a different story.
Now a collaborative effort at Berkeley Lab is working to change this scenario by applying deep learning software tools developed for high performance computing environments to a number of “grand challenge” science problems running computations at NERSC and other supercomputing facilities.
“We are trying to assess if deep learning can be successfully applied to scientific datasets in climate studies, neutrino experiments, neuroscience and more,” said Prabhat, Data and Analytics Services Group Lead at Berkeley Lab. “Scientists routinely produce massive datasets with supercomputers and experimental and observational devices. The key challenge is how can we automatically mine this data for patterns, and deep learning does exactly that.”
Deep learning, a subset of machine learning, is the latest iteration of neural network approaches to machine learning problems. Machine learning algorithms enable a computer to analyze pointed examples in a given dataset and find patterns in those datasets, plus make predictions about what other patterns it might find.
Deep learning algorithms are designed to learn hierarchical, nonlinear combinations of input data. They get around the typical machine learning requirements of designing custom features and produce state-of-the-art performance for classification, regression and sequence prediction tasks. While the core concepts were developed over three decades ago, the availability of large data, coupled with datacenter hardware resources and recent algorithmic innovations, have enabled companies like Google, Baidu and Facebook to make advances in image search and speech recognition problems.
Surprisingly, deep learning has so far not made similar inroads in scientific data analysis, largely because the algorithms have not been designed to work on high-end supercomputing systems such as those found at NERSC.
“One of the challenges is to translate what we know in deep learning from speech, images and text to domains such as atmospheric simulations and astronomical sky surveys,” said Amir Khosrowshahi, chief technology officer and co-founder of Nervana, a startup providing deep learning as a cloud service. The company has been beta testing Neon, its open source deep learning library, at NERSC for the past year. “Domains where data can be mapped naturally to image or video analysis problems, as with atmospheric simulations, are often the easiest to approach with the latest deep learning algorithms.”
Three Case Studies
Take climate data analysis, for example. Modern climate simulations produce massive amounts of data, requiring sophisticated pattern recognition on terabyte- to petabyte-sized datasets to identify, say, extreme weather events associated with climate change. NERSC's Data and Analytics Services Group has been working with research groups at Berkeley Lab to test how Neon’s deep learning libraries might help streamline this process. And so far the results have been very positive.
“We are finding that, in practice, we get state-of-the-art results using deep learning compared to other methods,” Prabhat said. “In climate simulation datasets, for example, we observe 95% accuracy in finding tropical cyclone patterns. We have had similar success with finding atmospheric rivers and weather fronts in satellite and reanalysis datasets.” Preliminary results from these investigations will be presented at the 2015 American Geophysical Union meeting, Dec. 14-18, in San Francisco
The Daya Bay neutrino experiment is also testing how deep learning algorithms might enhance its data analysis efforts. Daya Bay is an international neutrino-oscillation experiment designed to determine the last unknown neutrino mixing angle θ13using anti-neutrinos produced by the Daya Bay and Ling Ao Nuclear Power Plant reactors. Data collection began in 2011 and continues today.
“Like all particle physics, Daya Bay acquires a huge amount of data, and while it uses quite sophisticated analysis, it still requires a lot of hand tuning and physics knowledge,” said Wahid Bhimji, data architect in NERSC’s Data and Analytics Services Group who oversaw the implementation of a deep learning pipeline for Daya Bay at NERSC this past summer. “The idea here is to try and use deep learning to automatically reduce the features in the data. We have found that we can pick out the interesting physics without human involvement." This is the first time unsupervised deep learning has been done in particle physics, he noted.
Kris Bouchard, a computational systems neuroscientist in Berkeley Lab’s Biological Systems and Engineering Division, has been working with NERSC, UC Berkeley and the University of California, San Francisco to apply a deep learning library called Theano to analyze recordings of the human brain during speech production.
“The fundamental problem we are trying to address is to decode or translate brain signals into the action being performed while the signal is recorded,” he said. “The traditional tools that have been applied for this purpose are standard machine learning approaches, which are likely ill-suited for the ‘deep’ structure of brain data.”
Deep learning gives them a more powerful and flexible method for recognizing patterns in brain signals, he added. “Using deep learning, our team has decoded produced speech syllables with 39% accuracy, which is 23 times greater than chance, 200% better than traditional algorithms on the same data and is the new state-of-the-art in the field,” Bouchard said. Results from this collaboration are being presented this week at the 2015 Neural Information Processing Systems Conference in Montreal.
“Emerging deep learning and machine learning workloads have the potential to transform how scientists and enterprises gain the insights needed to solve some of their greatest challenges,” said Charles Wuischpard, vice president Data Center Group and general manager of the HPC Platform Group at Intel. “These workloads require high performance computing and big data manipulation techniques to accelerate research and remove the model training bottleneck. We’re excited to be working with researchers at Berkeley Lab and NERSC to enable a whole new level of deep neural network training on Intel® Xeon® processor and Intel Xeon Phi TM processor-based systems to enable truly iterative research. We look forward to the countless discoveries to come."
“NERSC is taking a leadership role in exploring deep learning for science,” Prabhat said. “The MANTISSA project has produced state-of-the-art results in conventional object recognition tasks, as well as climate science, high-energy physics and neuroscience. Through collaborations with Nervana Systems, we have deployed software tools for the broader scientific community. And we are actively pursuing collaborations with Intel and UC Berkeley to deploy massively parallel, high performance implementations of deep learning libraries on the Cori platform. Deep learning has revolutionized commercial applications; we are very excited to bring such capabilities to simulation, experimental and observational datasets.”
About Computing Sciences at Berkeley Lab
The Computing Sciences Area at Lawrence Berkeley National Laboratory(Berkeley Lab) provides the computing and networking resources and expertise critical to advancing Department of Energy Office of Science (DOE-SC) research missions: developing new energy sources, improving energy efficiency, developing new materials, and increasing our understanding of ourselves, our world, and our universe. ESnet, the Energy Sciences Network, provides the high-bandwidth, reliable connections that link scientists at 40 DOE research sites to each other and to experimental facilities and supercomputing centers around the country. The National Energy Research Scientific Computing Center (NERSC) powers the discoveries of 7,000-plus scientists at national laboratories and universities. NERSC and ESnet are both Department of Energy Office of Science National User Facilities. The Computational Research Division (CRD) conducts research and development in mathematical modeling and simulation, algorithm design, data storage, management and analysis, computer system architecture and high-performance software implementation.
Berkeley Lab addresses the world's most urgent scientific challenges by advancing sustainable energy, protecting human health, creating new materials, and revealing the origin and fate of the universe. Founded in 1931, Berkeley Lab's scientific expertise has been recognized with 13 Nobel prizes. The University of California manages Berkeley Lab for the DOE’s Office of Science. The DOE Office of Science is the United States' single largest supporter of basic research in the physical sciences and is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.