NERSC and HDF Group Optimize HDF5 Library to Improve I/O Performance

June 28, 2010

Contact: John Hules, JAHules@lbl.gov , 510-486-6008

This plot illustrates performance improvements that were derived from the tuning work performed at NERSC in collaboration with the HDF Group and others. GCRM is the Global Cloud Resolving Climate Model from Colorado State University, and Chombo is the adaptive mesh refinement framework from Berkeley Lab—two very demanding, I/O intensive codes. The baseline (gray) is the original performance, and colored bars on top of the baseline show the performance benefits derived from the optimization methods. More recent tests have achieved ~10,000k MB/s write bandwidth for certain configurations of both codes.

A common complaint among air travelers on short trips is that the time it takes to get in and out of the airplane and airports can be as long as the flight itself. In computer terms, that's a classic input/output (I/O) problem. Supercomputer users sometimes face a similar problem: the computer tears through the calculations with amazing speed, but the time it takes to write the resulting data to disk ends up slowing down the whole job.

There are several layers of software that deal with I/O on high performance computing (HPC) systems. The filesystem software, such as Lustre or GPFS, is closest to the hardware and deals with the physical access and storage issues. The I/O library, such as HDF5 or netCDF, is closest to the scientific application, and maps the application's data abstractions onto storage abstractions. I/O middleware, such as MPI-IO, coordinates activities between the other layers.

Getting these three layers of software to work together efficiently can have a big impact on a scientific code's performance. That's why the U.S. Department of Energy's (DOE's) National Energy Research Scientific Computing Center (NERSC) has partnered with the nonprofit Hierarchical Data Format (HDF) Group to optimize the performance of the HDF5 library on modern HPC platforms.

The HDF5 library is the third most commonly used software library package at NERSC and the DOE Scientific Discovery through Advanced Computing (SciDAC) program, according to recent surveys, and is the most commonly used I/O library across DOE computing platforms. HDF5 is also a critical part of the NetCDF4 I/O library, whose users include the CCSM4 climate modeling code, which provides major input to the Intergovernmental Panel on Climate Change's assessment reports.

Because parallel performance of HDF5 had been trailing on newer HPC platforms, especially those using the Lustre filesystem, NERSC has worked with the HDF Group to identify and fix performance bottlenecks that affect key codes in the DOE workload, and to incorporate those optimizations into the mainstream HDF5 code release so that the broader scientific and academic community can benefit from the work.

The HDF5 tuning effort began in January 2009, when NERSC sponsored a workshop to assess HDF5 performance issues. The meeting brought together NERSC staff, DOE Office of Science application scientists, Cray developers, and MPI-IO developers to sketch out a strategy for performance tuning of HDF5. NERSC then initiated a collaborative effort to implement that strategy. The result has been up to 33 times improvement in parallel I/O performance—running close to the achievable peak performance of the underlying file system—with scaling up to 40,960 processors. The illustration shows the performance improvements on two scientific codes, with the bottom gray bars representing the original, pre-tuned performance.

Tuning I/O for high levels of parallelism

Why the big performance difference between untuned and tuned I/O? Mark Howison, a computer scientist in the NERSC Analytics Team who has played a major role in this project, explains that parallel processing requires writing data in parallel, and that can get complicated.

"If you imagine a parallel filesystem, it's really large, and so you don't want to have just one server trying to keep track of where everything is," Howison says. "You want to spread it out over a bunch of servers so you don't create bottlenecks when people try to access the filesystem. On Franklin [NERSC's Cray XT-4 system] we have 48 file servers, so every 48th chunk of a file belongs to a different server. This process of breaking a file into pieces and distributing them over a bunch of servers is called striping. As the file gets bigger, you're reusing the same 48 servers over and over again in a cyclical pattern."

"I/O is especially important as scientists start needing to write out larger data sets because they can run bigger, better, faster simulations,"

—Mark Howison, a Computer Scientist in NERSC's Analytics Team

While the stripe count is determined by the number of servers, the stripe size can be changed. "You can make it 64 KB or 1 MB or 100 MB or 1 GB, depending on how big a file you're trying to write out," Howison explains. "It's really important to pick the right stripe size relative to how big your file is." At the filesystem level—Lustre in this case—selecting the correct stripe size and aligning I/O operations to stripe boundaries improves parallel performance.

Striping is easier when I/O is regular, that is, when each processor is writing the same amount of data in a nice, even pattern. But many scientific codes are irregular, with processors writing different amounts of data. And three-dimensional data adds further complications. "Any type of 3D I/O is challenging because eventually it has to be turned into a 1D list of bytes," Howison says. "That process of flattening a 3D data set so it can be written as a file can be complicated."

In these cases, efficient I/O requires aggregating data into regular-sized pieces before writing it out. "That's what collective buffering does," Howison says. "It takes all the irregular data and tries to break it into regular size pieces, then writes them. So it's basically a way of turning an irregular pattern into a regular pattern. The key thing we did was, when collective buffering creates these regular pieces, make sure they're the same size as a stripe." This optimization happens at the MPI-IO middleware level. A similar optimization called chunking, which divides data into stripe-sized pieces, can be done closer to the data, in HDF5 itself.

Another optimization in HDF5 is removing serialization points. Howison explains: "Serialization points are points where HDF5 has to stop and do some operation that is really critical, and so it has to stop everybody from writing or reading and just do this one operation. It's also called a synchronization point or barrier. We want as few of those as possible. Imagine if you're running with 32,000 cores on Franklin, and all of a sudden all those have to stop so that this one operation can happen. That's really expensive to stop everybody."

One common example of such an operation is file truncation, where HDF5 goes back and trims the size of a file that is larger than required by the data. In some scenarios, where disk or tape space is precious, this is a useful operation. "But Franklin has terabytes of disk space," Howison points out, "so if a file is 10 percent larger than it needs to be, that's not really a big deal. It's more important that you don't waste a big simulation's time. Wasting time is worse than wasting space. So we went into HDF5 and made it possible to remove some of these operations."

Aggregating small HDF5 operations such as writing metadata also speeds up I/O. Metadata is often defined as "data about data," such as a library card catalog. The catalog card is much smaller than the book it describes, and in the same way, the metadata for a data file is usually tiny. But if each of 10,000 processors wants to write a small piece of metadata, that will slow down the I/O significantly. So aggregating the metadata—having a subset of processors collect the metadata and write it to disk when it reaches a certain size—is much more efficient.

"I/O is especially important as scientists start needing to write out larger data sets because they can run bigger, better, faster simulations—they need to have those I/O resources," Howison says. He is currently writing an HDF5 I/O tutorial describing Lustre optimizations and how and when to use them.

Collaborators in the HDF5 tuning project have included John Shalf, Prabhat, Wes Bethel, Andrew Uselton, Katie Antypas, Shane Canon, David Skinner, and Nick Wright from NERSC; Noel Keen, and Hongzhang Shan from Berkeley Lab's Computational Research Division; Quincey Koziol and John Mainzer from the HDF Group; Rob Latham and Rob Ross from Argonne National Laboratory; David Knaak from Cray Inc.; and the Lustre Center for Excellence at Oak Ridge National Laboratory.

Experimental results

In a paper being prepared for publication,"Tuning HDF5 for Lustre File Systems," Howison, Koziol, Knaak, Mainzer, and Shalf describe how they tested their optimization strategy. They selected three HPC applications— GCRM, VORPAL, and Chombo—to represent three common I/O patterns found in the DOE Office of Science computing workload.

"Scientists should not be burdened with investigating the details of each parallel file system they run their codes on. Instead, I/O middleware should insulate them from those details and adapt itself to the underlying file system in order to provide high-performing default behavior while also providing high-level abstractions for tuning performance to reflect application behavior,"

—Prabhat, a Computer Scientist in NERSC's Analytics Team

The Global Cloud Resolving Model (GCRM) is a climate simulation developed at Colorado State University that runs at resolutions fine enough to accurately simulate cloud formation and dynamics. VORPAL is a particle-in-cell plasma simulation code developed by Tech-X Corporation that can predict the dynamics of electromagnetic systems, plasmas, and rarefied as well as dense gases. Chombo is an adaptive mesh refinement package from Berkeley Lab used to implement finite difference methods for solving partial differential equations on block-structured grids for a broad variety of scientific disciplines, including combustion and astrophysics.

For each of these applications, the researchers implemented a standalone benchmark to model the I/O pattern of the application. They ran the benchmarks in a variety of configurations on three supercomputers that use the Lustre file system: JaguarPF, a Cray XT5 system at Oak Ridge National Laboratory; Franklin, a Cray XT4 system at NERSC; and Hopper Phase 1, a Cray XT5 system at NERSC.

Test results showed that their optimizations outperformed the baseline for all three benchmarks, on all three test systems, and across a range of parallelism, from 640 to 40,960 processor cores. Write bandwidth, a key criterion of I/O performance, increased by a range of 1.4x to 33x over the original approaches used by the applications. These optimizations are now part of the standard HDF5 library and are available to the worldwide scientific community.

Future research

Further optimizations of HDF5 are planned in a three-year project that is just beginning. Called "Bringing Exascale I/O Within Science's Reach: Middleware for Enabling and Simplifying Scientific Access to Extreme Scale Parallel I/O Infrastructure," the project is a collaboration between Berkeley Lab, the HDF Group, and Pacific Northwest National Laboratory (PNNL), with Prabhat from Berkeley Lab and NERSC as principal investigator. They are partnering with several science applications from the areas of climate, groundwater, and accelerator modeling.

One of three thrust areas in the Exascale I/O project is extending the scalability of I/O middleware, especially HDF5, to make effective use of current and future computational platforms. Changes to the HDF5 library will fall into three categories:

optimizations that improve performance in all circumstances
low-level options for knowledgeable developers desiring maximum performance
performance tuning choices that developers can use to indicate high-level application behavior

One of the project's objectives is to add an autotuning capability to HDF5, which would automate the process of probing a storage system for operating parameters with a set of storage analysis tools that can be run on any HPC system. These I/O analysis tools will analyze the storage system, gathering relevant operating parameters that can be used by the HDF5 library to tune an application's I/O requests to better match the underlying storage system.

"Scientists should not be burdened with investigating the details of each parallel file system they run their codes on," says Prabhat. "Instead, I/O middleware should insulate them from those details and adapt itself to the underlying file system in order to provide high-performing default behavior while also providing high-level abstractions for tuning performance to reflect application behavior."

"That said," he adds, "fine-grained tuning options must also be available to application developers who want to squeeze the last bit of performance from the underlying system, at the possible expense of a more detailed knowledge of HDF5 and the layers below it."

Collaborators on the Exascale I/O project include Prabhat, Wes Bethel, Mark Howison, and Kesheng (John) Wu from Berkeley Lab; Quincey Koziol from the HDF Group; and Karen Schuchardt and Bruce Palmer from PNNL.

About Computing Sciences at Berkeley Lab

High performance computing plays a critical role in scientific discovery. Researchers increasingly rely on advances in computer science, mathematics, computational science, data science, and large-scale computing and networking to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab’s Computing Sciences Area researches, develops, and deploys new foundations, tools, and technologies to meet these needs and to advance research across a broad range of scientific disciplines.