NetLogger Helps Supernova Factory Improve Data Analysis
May 1, 2005
The Nearby Supernova Factory (SNfactory) project, established at Berkeley Lab in 2002, aims to dramatically increase the discovery of nearby Type 1a supernovae by applying assembly-line efficiencies to the collection, analysis and retrieval of large amounts of astronomical data.
To date, the program has resulted in the discovery of about 150 Type 1a supernovae – about three times the entire number reported before the project was started. Type Ia supernovae are important celestial bodies because they are used as “standard candles” for gauging the expansion of the universe.
Contributing to the SNfactory's remarkable discovery rate is its custom-developed “data pipeline” software. The pipeline fills with up to 50 gigabytes (billion bytes) of data per night from wide-field cameras built and operated by the Jet Propulsion Laboratory's Near Earth Asteroid Tracking program (NEAT).NEAT uses remote telescopes in Southern California and Hawaii.
Around 25,000 new images are captured each day, and the goal is to complete all processing before the next day’s images arrive. Image data is copied in real time from the Mt. Palomar Observatory in Southern California to a mass storage system at NERSC. Then the image data is copied to a large shared disk array on a 344-node cluster called PDSF. Each image is 8 MB (uncompressed), and the processing of each image requires between 5 and 25 reference images, for a total disk space requirement of about 0.5 TB each day.
Supernovae are found by comparing recently acquired telescope images with older reference images. If there is a source of light in the new image that did not exist in the old image, it could be a supernova. Subtracting the new image from the reference image identifies new light sources. This process is quite delicate: aligning the images, matching the point-spread functions, and matching the photometry and bias all require precise calibration.
Because of the high demand put on all the resources in the pipeline, making sure that the data flow smoothly and can be analyzed quickly and correctly is critical to the overall success. While there are a number of tools for evaluating the performance of single systems, identifying the workflow bottlenecks in a distributed system such as the SNfactory requires a different type of application.
For the past 10 years, Brian Tierney and others in the Collaborative Computing Technologies Group have been developing the Netlogger toolkit as part of the Distributed Monitoring Framework project.
NetLogger is a set of libraries and tools to support end- to-end monitoring of distributed applications. During the past few months, the team has been working closely with the SNfactory project to help debug and tune their application. “NetLogger has been extremely useful in the debugging and commissioning of our data processing pipeline,” said Stephen Bailey, one of the lead developers on SNfactory project. “It has helped us identify bugs and processing bottlenecks in order to improve our efficiency and data quality. It additionally has allowed real time monitoring of the data processing to quickly identify problems that need immediate attention. This debugging, commissioning, and monitoring would have taken much longer without NetLogger.”
Tierney and Bailey, along with Dan Gunter of the Collaborative Computing Technologies Group, have written a paper entitled “Scalable Analysis of Distributed Workflow Traces,” which will be presented at the 2005 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'05) to be held June 27-30 in Las Vegas. The paper can be found at <http://dsd.lbl.gov/publications/NetLogger- SNFactory.pdf>.
“The first problem the SNFactory scientists asked us to solve was to figure out why some of their workflows where failing without any error messages as to the cause,” Tierney said. “Even when error messages were generated, the SNfactory application produced thousands of log files, and it was very difficult to locate the log messages related to failed workflows. NetLogger was very useful for easily characterizing where the failures were occurring so they would know where to focus debugging efforts.” The figure below shows a typical workflow for the SNfactory application on a single cluster node. CPU and network data is shown at the bottom.
This figure actually demonstrates a bug in the SNfactory processing that went undetected for several months before NetLogger analysis.
The SNfactory application processes a group of images together, starting with uncompressing the images, and then doing image calibration and subtraction. The next step is to generate a skyflat image, which is a calibration image that is formed from a median combination of several of other images. The skyflat is used to correct other images to adjust for the sky brightness on a given night, which can vary due to humidity, cloud cover, and so on. The skyflat calibration image is then applied to all images within the job. Under some conditions it was determined erroneously that the skyflat calibration was not necessary. All lifelines except the two nearly vertical ones near the beginning should have converged at the setskyflat event.
About Computing Sciences at Berkeley Lab
The Lawrence Berkeley National Laboratory (Berkeley Lab) Computing Sciences organization provides the computing and networking resources and expertise critical to advancing the Department of Energy's research missions: developing new energy sources, improving energy efficiency, developing new materials and increasing our understanding of ourselves, our world and our universe.
ESnet, the Energy Sciences Network, provides the high-bandwidth, reliable connections that link scientists at 40 DOE research sites to each other and to experimental facilities and supercomputing centers around the country. The National Energy Research Scientific Computing Center (NERSC) powers the discoveries of 6,000 scientists at national laboratories and universities, including those at Berkeley Lab's Computational Research Division (CRD). CRD conducts research and development in mathematical modeling and simulation, algorithm design, data storage, management and analysis, computer system architecture and high-performance software implementation. NERSC and ESnet are DOE Office of Science User Facilities.
Lawrence Berkeley National Laboratory addresses the world's most urgent scientific challenges by advancing sustainable energy, protecting human health, creating new materials, and revealing the origin and fate of the universe. Founded in 1931, Berkeley Lab's scientific expertise has been recognized with 13 Nobel prizes. The University of California manages Berkeley Lab for the DOE’s Office of Science.
DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.