Technology for Speeding Up Searches of Large Databases Wins R&D 100 Award
August 1, 2008
FastBit, an indexing technology that allows users to search massive datasets up to 40 times faster than the best commercially available tools, has been recognized with a 2008 R&D 100 Award.
R&D Magazine will present the award to FastBit developers Kesheng “John” Wu, Arie Shoshani, Ekow Otoo, and Kurt Stockinger of the Scientific Data Management Group in Berkeley Lab’s Computational Research Division, at a special October ceremony in Chicago, Ill. (Stockinger has since left the group to accept a position in Switzerland.
According to Wu, who led the project’s development, FastBit was originally designed for particle physicists who need to sort through billions of data records just to find tens or hundreds of key pieces of information. For example, the technology has played a crucial role in a high energy physics experiment called STAR, where colliding particles generate billions of subatomic events, but only a few hundred of these events have the most distinctive signatures of the new state of matter called quark-gluon plasma (QGP). FastBit helped the researchers quickly identify evidence of QGPs, which was a key objective of the STAR experiment.
Wu notes that the key to FastBit’s success is an innovative bitmap compression method called the Word Aligned Hybrid (WAH) compression. Computer scientists have long known that bitmap indexing methods are the most efficient for accelerating searching operations on large datasets that do not change over time, which is the case for most science experiments.
Commonly used commercial indexing methods, such as B-tree and similar tree-based indexes, are inefficient for scientific research because they cannot locate thousands of records in analyses of large datasets. Instead, these technologies are optimized for finding a small number of specific records, such as one’s bank account information.
Additionally, B-tree indexes inflate the volume of the data by a factor of three or four, which can be extremely cumbersome when dealing with terabyte-size datasets. Meanwhile, many tree-based indexing techniques are specialized to the task at hand – meaning they are tailored for searching regions on a single variable, such as finding all hot regions in a climate database – but do not work well when multiple variables are searched together, such as finding hot regions where wind velocity is high.
Although bitmap indexes proved to be ings. These indexes could only search limited types of data, primarily those vari- ables with a relatively small number of possible values, such as the sex of a customer or the state in which the customer lives. Another limitation was size – bitmap indexes were too large to be stored in computer memory, so to answer a query, the relevant part of the index had to be read into the computer’s memory before the necessary computation could be performed. This significantly slowed down a search because the time needed to read parts of the index into the computer’s memory tends to be much greater than the time needed to perform the actual query.
The FastBit team overcame these issues by inventing a compression method that was demonstrated to be 10 times faster than its widely used commercial counterpart. In addition to being compute- efficient, it is also effective in reducing the index size. The size of a FastBit index is on average one-third the size of the original data, which is about one-tenth the size of the most widely used B-tree index. FastBit also significantly improved the speed of searching operations with a number of techniques, including a vertical data organization, an innovative bitmap compression technique, and several new bitmap encoding methods.
“Since we began developing FastBit to help physicists search massive datasets resulting from large-scale experiments, we’ve seen an explosion of data in many other fields, too. We released FastBit under an open-source license in 2007, [and] it has attracted a lot of interest in new areas, such as network traffic data analysis, and business applications, drug discovery and web content delivery,” said Wu.
“In short, FastBit contains significant innovations that are well recognized and have wide impacts in science, technology and education.”
The R&D100 Award for the public release of FastBit is the latest in a series of honors from the research community. FastBit received an award from the 2005 International Supercomputing Conference for enabling Grid-based analysis of high energy physics data. In the High Performance Analytics Challenge at the Supercomputing 2005 conference, Fastbit developers’ work also received an honorable mention for its contributions to network traffic analysis. In addition, FastBit publications have appeared in journals and conferences spanning both the database research field and visualization field, including ACM Transactions on Database Systems, Very Large Database conference and IEEE Visualization conference. FastBit has played crucial roles in two Ph.D. theses from the University of California Berkeley and the University of Illinois Urbana-Champaign. The project was supported by the Department of Energy’s Scientific Discovery through Advanced Computing program, which develops software for accelerating computational science.
For more information about FastBit, go to: http://sdm.lbl.gov/fastbit/.
About Computing Sciences at Berkeley Lab
The Lawrence Berkeley National Laboratory (Berkeley Lab) Computing Sciences organization provides the computing and networking resources and expertise critical to advancing the Department of Energy's research missions: developing new energy sources, improving energy efficiency, developing new materials and increasing our understanding of ourselves, our world and our universe.
ESnet, the Energy Sciences Network, provides the high-bandwidth, reliable connections that link scientists at 40 DOE research sites to each other and to experimental facilities and supercomputing centers around the country. The National Energy Research Scientific Computing Center (NERSC) powers the discoveries of 6,000 scientists at national laboratories and universities, including those at Berkeley Lab's Computational Research Division (CRD). CRD conducts research and development in mathematical modeling and simulation, algorithm design, data storage, management and analysis, computer system architecture and high-performance software implementation. NERSC and ESnet are DOE Office of Science User Facilities.
Lawrence Berkeley National Laboratory addresses the world's most urgent scientific challenges by advancing sustainable energy, protecting human health, creating new materials, and revealing the origin and fate of the universe. Founded in 1931, Berkeley Lab's scientific expertise has been recognized with 13 Nobel prizes. The University of California manages Berkeley Lab for the DOE’s Office of Science.
DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.