SDSC, UC San Diego, LBNL Team Wins SC09 'Storage Challenge' Award
November 25, 2009
Palomar Transient Factory/ Peter Nugent (Berkeley Lab)
GALAXY NEXT DOOR: This false-color image of our glowing galactic neighbor, the Andromeda Galaxy, was created by layering 400 individual images captured by the PTF camera in February 2009. In one pointing, the camera has a seven square degree field of view, equivalent to approximately 25 full moons.
A research team from the San Diego Supercomputer Center (SDSC) at UC San Diego and the University of California's Lawrence Berkeley National Laboratory (Berkeley Lab) has won the Storage Challenge competition at SC09, the leading international conference on high-performance computing, networking, storage and analysis held Nov. 14-20 in Portland, Oregon.
The research team based its Storage Challenge submission for the annual conference on the architecture of SDSC's recently announced Dash high-performance computer system, a "super-sized" version of flash memory-based devices such as laptops, digital cameras and thumb drives that also employs vSMP Foundation software from ScaleMP, Inc. to provide virtual symmetric multiprocessing capabilities.
Dash is the prototype for a much larger flash-memory HPC system called Gordon, which is scheduled to go online in mid-2011 after SDSC was awarded a five-year, $20 million grant from the National Science Foundation (NSF) this summer to build and operate the powerful system. Both Dash and Gordon are designed to accelerate investigation of a wide range of data-intensive science problems by providing cost-effective data performance more than 10 times faster than most other HPC systems in use today.
The hypothesis of the team's Challenge, called "Data Intensive Science: Solving Scientific Unknowns by Solving Storage Problems," was that solid state drives (SSDs) based on NAND (Not AND) flash technology are "ready for prime time" in that that they are reliable and cheap enough to improve input/output density (I/O rate) by more than 10 times, or greater than one order of magnitude.
“The current data storage I/O rate is far lower than the ever-increasing rate of enthusiasm among researchers and scientists, who are now drowning in a sea of data because of this differential,” said SDSC’s Arun Jagatheesan, team leader for this year's competition. "With the SC09 Storage Challenge, our team demonstrated the prototype of a data-intensive supercomputer that can bridge this gap. Our mission in this challenge was to design, build, deploy and commission a prototype of such a supercomputer at challenging construction and operational costs, without compromising the data-intensive performance of the scientific applications.”
Berkeley Lab's Peter Nugent and Janet Jacobsen provided the project a production database that they could use for the data challenge, and did so in a short time.
“We re-created a snapshot of the Palomar Transient Factory (PTF) database that we have operating at NERSC. This database contains the results of processing images from the PTF sky survey, as well as the results of analyzing the images for transient objects,” Jacobsen said. "One of the keys to being able to identify transient objects for follow-up telescope time is to search rapidly through the 100 million 'candidate' transient objects in the database. For supernovae, it is particularly important to detect the supernova before it reaches its peak brightness. This is why being able to test the speed of the query execution on the Dash compute system was of interest to us and the PTF collaboration."
An additional DASH Dash also tested the speed of co-adding 720 images from PTF in two different optical filters as a means to test capabilities critical for future projects. "Our goal with this was to see how faint we could detect objects in order to gauge how well PTF could be used for target selection for future generation surveys such as BigBOSS," Nugent said.
The database was installed on the flash SSD file system of Dash, and LBNL provided queries for the SDSC team to run to help tune their system. Additionally, the Berkeley Lab team ran queries on Dash and at NERSC to compare the performance of the database queries. They found that queries run on the Dash system ran significantly faster than on the database server at NERSC.
"A major challenge for the scientific user community is to deal with storage latency issues in our systems," said SDSC Interim Director Mike Norman, who is also the principal investigator for the center’s upcoming Gordon supercomputer. "Even though not all scientific problems are data-intensive, many of them are, and this challenge illustrated that we can overcome latency issues with innovative approaches and partnerships. We’re looking forward to helping the NSF and others meet the needs of this new generation of data-intensive science."
"Moving a physical disk-head to accomplish random I/O is so last-century," said Allan Snavely, associate director of SDSC, co-principal investigator for SDSC’s Gordon system and project leader for Dash. "Indeed, Charles Babbage designed a computer based on moving mechanical parts almost two centuries ago. With respect to I/O, it's time to stop trying to move protons and just move electrons. With the aid of flash SSDs, we can do latency-bound file reads more than 10 times faster and more efficiently than anything being done today."
The Storage Challenge team considered several ideas, which in the process resulted in making changes to the traditional storage architecture, including both hardware and software, to achieve its goal. The changes included incorporating a large (750 GB) RAMFS (random-access memory file system) with 1 TB (terabyte) of flash SSD file system to dramatically accelerate scientific database searches such as those used in the Palomar Transient Factory database, a fully automated, wide-field survey aimed at a systematic exploration of the optical transient sky using a new 8.1 square degree camera installed on the 48-inch Samuel Oschin telescope at the Palomar Observatory in southern California.
According to Nugent, co-leader of Berkeley Lab's Computational Cosmology Center, astrophysics is transforming from a data-starved to a data-swamped discipline, fundamentally changing the nature of scientific inquiry and discovery. New technologies are enabling the detection, transmission, and storage of data of hitherto unimaginable quantity and quality and these data volumes are rapidly overtaking the computational resources required to make sense of the data within the current frameworks. Time-variable "transient" phenomena, which in many cases are driving new observational efforts, add additional complexity and urgency to knowledge extraction: to maximize science returns, additional follow-up resources must be selectively brought to bear after transients are discovered while the events are still ongoing.
The difficulty in deriving new knowledge from this avalanche of data is amplified by the immediacy with which other resources must be marshaled: simply obtaining well-sampled data does not ensure discovery nor insight into the nature of transients. In most cases, initial classification and discovery is only the trigger to more observations with different telescopes and at different wavelengths. The totality (discovery and follow-up) of the observations becomes the gateway to gaining a deeper understanding of the physical phenomena.
Transient surveys like the PTF and the La Silla Supernova Search (each generating about 100 gigabytes per night) are paving the way for future surveys like LSST (16TB/night) to assess their effectiveness and maximize their scientific potential. Two of the major bottlenecks are I/O issues related to image processing and performing large, random queries across multiple databases. PTF typically identifies on order 100 new transients every minute it is on-sky (along with 1,000 spurious detections). These objects must be vetted and preliminarily classified in order to assign the appropriate follow-up resources to them in less than 24 hours, if not in real-time. This often requires performing more than 100 queries though eight different and very large (~100GB - 1 TB) databases every minute.
"In the SC09 Storage Challenge, we presented the architecture of our Dash prototype and the excellent results we have already obtained," noted Jagatheesan. "We believe our approach provides cost-effective data performance, and we hope that other academic and non-academic data centers throughout the HPC community can benefit from our approach and experiences."
In addition to Jagatheesan and Norman, SDSC team members included Jiahua He, Allan Snavely, Maya Sedova, Sandeep Gupta, Mahidhar Tatineni, Jeffrey Bennett, Eva Hocks, Larry Diegel and Thomas Hutton from SDSC; Steven Swanson (UC San Diego); Peter Nugent and Janet Jacobsen (Lawrence Berkeley National Laboratory); and Lonnie Heidtke, of Instrumental Inc., a Bloomington, Minn.-based provider of professional services focused on advanced and high-performance computing.
The Storage Challenge is a competition showcasing applications and environments that effectively use the storage subsystem in high-performance computing, which is often the limiting system component. Submissions were based upon tried and true production systems as well as research or proof-of-concept projects not yet in production. Judging was based on present measurements of performance, scalability and storage subsystem utilization as well as innovation and effectiveness.
About Computing Sciences at Berkeley Lab
The Computing Sciences Area at Lawrence Berkeley National Laboratory(Berkeley Lab) provides the computing and networking resources and expertise critical to advancing Department of Energy Office of Science (DOE-SC) research missions: developing new energy sources, improving energy efficiency, developing new materials, and increasing our understanding of ourselves, our world, and our universe. ESnet, the Energy Sciences Network, provides the high-bandwidth, reliable connections that link scientists at 40 DOE research sites to each other and to experimental facilities and supercomputing centers around the country. The National Energy Research Scientific Computing Center (NERSC) powers the discoveries of 7,000-plus scientists at national laboratories and universities. NERSC and ESnet are both Department of Energy Office of Science National User Facilities. The Computational Research Division (CRD) conducts research and development in mathematical modeling and simulation, algorithm design, data storage, management and analysis, computer system architecture and high-performance software implementation.
Berkeley Lab addresses the world's most urgent scientific challenges by advancing sustainable energy, protecting human health, creating new materials, and revealing the origin and fate of the universe. Founded in 1931, Berkeley Lab's scientific expertise has been recognized with 13 Nobel prizes. The University of California manages Berkeley Lab for the DOE’s Office of Science. The DOE Office of Science is the United States' single largest supporter of basic research in the physical sciences and is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.