Berkeley Lab Researchers Optimizing Spark for HPC
CRD group using Intel grant to ensure the successful adoption of this data analytics framework for HPC
November 13, 2015
Contact: Kathy Kincade, firstname.lastname@example.org, 510-495-2124
A team of scientists from Berkeley Lab’s Computational Research Division (CRD) has been awarded a two-year, $110,000 grant by Intel to support their goal of enabling data analytics software stacks—notably Spark—to scale out on next-generation high performance computing (HPC) systems.
Functioning as an Intel Parallel Computing Center (IPCC), the new research effort will be led by Costin Iancu and Khaled Ibrahim, both computational scientists in CRD’s Computer Languages and Systems Software Group.
Spark is an open source computing framework for processing large datasets. It was developed in 2009 in the University of California, Berkeley’s AMPLab by then Ph.D. student Matei Zaharia and went open source in 2010 before being donated to the Apache Software Foundation in 2013. Spark’s ability to cache datasets in memory makes it well suited for large data analysis, especially on systems with large memory space. Programmers can write programs for the Spark runtime environment using Java, Python or Scala, and these programs can be executed in either a standard batch execution mode or using an interactive shell. Spark’s speed and flexibility make it ideal for rapid, iterative processes such as machine learning.
“Spark evolved in the commercial sector and is run in data centers, where the hardware is very distributed and the compute nodes assume there is a local disk,” Iancu said. “In the data center, the I/O system is optimized for latency and the networks are optimized more for throughput (or bandwidth). But if you move Spark to HPC systems, the opposite is true: the I/O systems care more about bandwidth and the networks care more about latency.”
Through the new IPCC project, Iancu and Ibrahim will address the differences between Spark as it has evolved on traditional data center system architectures versus what HPC platforms require in order to successfully adapt it to the HPC ecosystem and make it highly scalable. In the first phase of the project they will systematically redesign the Spark stack to accommodate the different performance characteristics of Lustre-based HPC systems. In the second phase of the project, they will re-examine memory management in Spark to accommodate the deeper vertical memory hierarchies present in HPC systems.
“The challenging part of the data analytic framework is the data movement,” Ibrahim said. “Where does this movement come from, the filesystem or the compute node or from movement between the nodes? So we are looking at optimizing the data movement vertically from memory to disk and also horizontally between compute nodes. We will also look at how to optimize the computation within the compute nodes.”
All of this is intended to support the project’s overarching goal: scalability. For the first year they will focus on improving execution efficiency at the scale of 1,000 cores. But that is only the beginning, Iancu emphasized.
“When we deploy Spark on an HPC system”—and he is quick to point out not just Cray architectures—“we will be able to improve its scalability to tens of thousands of cores by adapting it to the system architecture(s). Our goal is to improve Spark performance in the software stack and figure out how to make it evolve with the technology.”
They will also be testing it out on NERSC’s new Cori system as part of the center’s Burst Buffer Early Users program.
“We want to extend the use of Spark from data analytics to include more scientific computing,” Ibrahim said. “Typical applications for these frameworks are graph analytics and distributed databases, but we would like to include more scientific computing applications.”
About Computing Sciences at Berkeley Lab
The Computing Sciences Area at Lawrence Berkeley National Laboratory(Berkeley Lab) provides the computing and networking resources and expertise critical to advancing Department of Energy Office of Science (DOE-SC) research missions: developing new energy sources, improving energy efficiency, developing new materials, and increasing our understanding of ourselves, our world, and our universe. ESnet, the Energy Sciences Network, provides the high-bandwidth, reliable connections that link scientists at 40 DOE research sites to each other and to experimental facilities and supercomputing centers around the country. The National Energy Research Scientific Computing Center (NERSC) powers the discoveries of 7,000-plus scientists at national laboratories and universities. NERSC and ESnet are both Department of Energy Office of Science National User Facilities. The Computational Research Division (CRD) conducts research and development in mathematical modeling and simulation, algorithm design, data storage, management and analysis, computer system architecture and high-performance software implementation.
Berkeley Lab addresses the world's most urgent scientific challenges by advancing sustainable energy, protecting human health, creating new materials, and revealing the origin and fate of the universe. Founded in 1931, Berkeley Lab's scientific expertise has been recognized with 13 Nobel prizes. The University of California manages Berkeley Lab for the DOE’s Office of Science. The DOE Office of Science is the United States' single largest supporter of basic research in the physical sciences and is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.