FastBit: An Efficient Indexing Technology for Billions of Objects
January 1, 2005
Three members of the Scientific Data Management Group - John Wu, Arie Shoshani and Ekow Otoo - have been granted a patent for their "Word Aligned Bitmap Compression Method and Data Structure." This technology is currently used in a software called FastBit to compress bitmap indices. When answering a user query, FastBit is often 10 times as fast the searching method used by one of the leading commercial database management systems.
One of the first applications that employs FastBit is in the STAR high-energy physics experiment at Brookhaven National Laboratory. FastBit, which uses the patented Word- Aligned Hybrid (WAH) compression method, has also been applied to spatio-temporal data, such as the Direct Numerical Simulation of hydrogen-air mixture (Jacqueline Chen, Sandia National Laboratories) and Terascale Supernova Initiative's simulation of supernova explosions (Anthony Mezzacappa, Oak Ridge National Laboratory).
"Our tests on a set of high-energy physics data show that WAH-compressed bitmap indices are often an order of magnitude faster than the best known bitmap indexing schemes in commercial systems," according to John Wu, the lead developer of FastBit.
Indexing methods such as FastBit and B+tree are used extensively by database management systems to provide fast processing time for user queries. To answer queries on read-only data, such as those from many scientific applications, bitmap indices are known to be more efficient than B+trees. However, without compression, the bitmap indices may require too much space to be useful. While compression reduces the storage requirement, it usually makes query processing slower. To make query processing more efficient, a number of specialized compression schemes have been proposed, with the best-known one called the Byte-aligned Bitmap Code (BBC).
By using a compressed index that can be search without its full decompression, Fastbit uses significantly less space than conventional indexing methods, but is still very efficient.
As can be seen from the table below, FastBit's WAH indexing method creates an index that is slightly larger than other compressed bitmap indexing methods, but the time needed to process a query is less, often much less. "It's really a space-time tradeoff," Wu said.
FastBit is currently being used by several DOE research projects and has yielded several success stories:
- GridCollector, the software module for the STAR analysis framework, uses FastBit's searching capability to provide STAR analysts with a new way of accessing collision data called “events.” Instead of naming the data files as was previously done, analysts can now select events based on physically meaningful attributes known as tags. Through GridCollector, analysis programs only read the selected events, instead of every event in the selected data files. Since most analysis jobs use only a small fraction of the events in data files, the GridCollector can significantly improve the turnaround time.
- Tracking features in the analysis of combustion simulation data is more efficient. By using FastBit and compressed bitmaps, the FastBit team was able to significantly speed up the problem of tracking ignition kernels in a combustion simulation of a hydrogen-air mixture. This approach addressed the difficult problem of identifying data indicating the ignition kernel from the rest of the simulation data and tracking the progression of flames over time.
- DEX, or Dexterous Data Explorer, is a collaboration between the Scientific Data Management Group and the Visualization Group in CRD. DEX uses FastBit to provide query-based visualization of large scientific datasets. A preliminary version of the software was demonstrated at the Supercomputing 2004 conference on both combustion datasets and supernova simulation datasets.
- Compressed bitmaps are also used in a view-dependent isosurface software. At Supercomputing 2004, a preliminary version of the software was demonstrated to display in real-time the isosurfaces of large complex data produced from a simulation of Richtmyer-Meshkov Instability, which is the impulsive-acceleration limit of the Rayleigh-Taylor instability in computational fluid dynamics. In this application, compressed bitmaps are used record what data items are visible from a particular viewing angle. This allows the software to extract the minimal amount of data items required for visualization and make it very efficient to render very large complex isosurfaces as the user changes viewing angles.
The effectiveness of FastBit has also attracted the attention of other institutions as well. The ROOT developers at CERN have started working on incorporating FastBit into their software. Since ROOT software is used by all major high-energy physics projects around the world, fully integrating FastBit into it would far extend the user community in the scientific realm. There is also interest from the commercial sector.
"We have even learned that at least one foreign telecommunications providers has implemented WAH compression for their data analysis," Wu said
About Computing Sciences at Berkeley Lab
The Lawrence Berkeley National Laboratory (Berkeley Lab) Computing Sciences organization provides the computing and networking resources and expertise critical to advancing the Department of Energy's research missions: developing new energy sources, improving energy efficiency, developing new materials and increasing our understanding of ourselves, our world and our universe.
ESnet, the Energy Sciences Network, provides the high-bandwidth, reliable connections that link scientists at 40 DOE research sites to each other and to experimental facilities and supercomputing centers around the country. The National Energy Research Scientific Computing Center (NERSC) powers the discoveries of 6,000 scientists at national laboratories and universities, including those at Berkeley Lab's Computational Research Division (CRD). CRD conducts research and development in mathematical modeling and simulation, algorithm design, data storage, management and analysis, computer system architecture and high-performance software implementation. NERSC and ESnet are DOE Office of Science User Facilities.
Lawrence Berkeley National Laboratory addresses the world's most urgent scientific challenges by advancing sustainable energy, protecting human health, creating new materials, and revealing the origin and fate of the universe. Founded in 1931, Berkeley Lab's scientific expertise has been recognized with 13 Nobel prizes. The University of California manages Berkeley Lab for the DOE’s Office of Science.
DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.