A-Z Index | Phone Book | Careers

New Metadata Organizer Streamlines JGI Data Management

JAMO poster to be presented December 11 at the Bay Area Scientific Computing Day

December 5, 2013

Kathy Kincade, +1 510 495 2124, kkincade@lbl.gov

A new centralized data management system now up and running at the Department of Energy’s National Energy Research Scientific Computing (NERSC) facility is helping researchers at the Joint Genome Institute (JGI) deal more efficiently and effectively with the vast amounts of data yielded by their increasingly complex bioinformatics projects.

Before JAMO: Groups shared data in many different ways across the file system, making data and files hard to share.

 

The JGI Archive and Metadata Organizer (JAMO), which came online in August after the Data ‘n Archive (DnA) file system went live, started as a collaboration between the Sequence Data Management (SDM), Quality Assurance and Quality Control and Genome Assembly groups at JGI. It has been a cross-program effort at JGI and cross-divisional effort between JGI and NERSC.

Kjiersten Fagnan, bioinformatics and HPC consultant, NERSC User Services Group.

“These groups recognized the need for a centralized data management system for the vast quantities of data and analysis generated at JGI,” said Kjiersten Fagnan, bioinformatics and HPC consultant in the User Services Group at NERSC and a driving force behind the JAMO project. “The retirement of the old house file system afforded a unique opportunity to fundamentally alter the way they manage their data usage.”

“It quickly became apparent that we could not simply copy everything onto the new filesystem and that we would need to put much of our legacy data onto the NERSC tape system,” noted Alex Copeland, Genome Assembly group lead. “I began talking with Chris Beecroft and Alex Boyd about what tools they were using for archiving our current production data to tape. They agreed that it was possible and that it made sense to use the same system for all JGI data, including our legacy data, and this effort led to creating what is now called JAMO.”

JAMO is JGI’s first implementation of a hierarchical data management system. Users can register important data with JAMO and set an expiration policy. JAMO then migrates the data to the archive and copies it to the DnA file system, where it can be read but not modified. If a user would like to modify the file, they must register a new file with the system and it will be treated as a new version of the same file.

JAMO also tracks how long the data is on disk. Once the expiration policy is hit, the file is deleted from spinning disk and only exists in the HPSS. If a user requests a file that is only on the archive, JAMO will restore that file to the DnA.

After JAMO: With the data location saved in a centralized repository, all groups can query this resource in a consistent way saving time and resources.

 

“By taking advantage of HPSS through JAMO, this will allow us to optimize usage of the more expensive spinning disk,” Fagnan said.

In making the shift from the house file system to JAMO, both the data and the scripts that use the data had to change, she added.

“We took advantage of this opportunity to develop a system that, if used, would mean that JGI would not have to go through this painful process again because the scripts could be file-system agnostic and data management would be done by JAMO,” she explained.

In addition to SDM, Fagnan and her colleagues at NERSC had the opportunity to work with almost every group at the JGI to help them move data from the old house file system to JAMO and DnA.

Jason Hick, Group Leader, NERSC Storage Systems Group.

The feedback so far has been overwhelmingly positive, Fagnan added.

“The JGI is excited about being able to find old data in minutes as opposed to hours,” she said. “We are also looking forward to using the JAMO system to generate reproducible pipelines and workflows by storing detailed metadata with each project analysts complete. We have made this project feel collaborative, which has provided momentum at JGI for doing software development projects right.”

SDM worked closely with the NERSC Archival Storage team to ensure their software has an optimal interface to NERSC’s High Performance Storage System (HPSS). In particular, Jason Hick, Wayne Hurlbert and Nick Balthaser provided information and a review of the processes being used by SDM to ensure efficient access and use of the archive, Fagnan noted.

“NERSC has a strong interest in supporting our user’s data management needs, and the Archival Storage Team was pleased to help the SDM group innovate in their development of software to improve the management of JGI data,” Hick said.

On December 11, SDM’s Boyd will present a poster about JAMO at the Bay Area Scientific Computing Day (BASCD), which is being hosted by Berkeley Lab. BASCD is an annual one-day meeting designed to foster interactions and collaborations between researchers in scientific computing and computational science and engineering from the San Francisco Bay Area.


About Computing Sciences at Berkeley Lab

The Lawrence Berkeley National Laboratory (Berkeley Lab) Computing Sciences organization provides the computing and networking resources and expertise critical to advancing the Department of Energy's research missions: developing new energy sources, improving energy efficiency, developing new materials and increasing our understanding of ourselves, our world and our universe.

ESnet, the Energy Sciences Network, provides the high-bandwidth, reliable connections that link scientists at 40 DOE research sites to each other and to experimental facilities and supercomputing centers around the country. The National Energy Research Scientific Computing Center (NERSC) powers the discoveries of 6,000 scientists at national laboratories and universities, including those at Berkeley Lab's Computational Research Division (CRD). CRD conducts research and development in mathematical modeling and simulation, algorithm design, data storage, management and analysis, computer system architecture and high-performance software implementation. NERSC and ESnet are DOE Office of Science User Facilities.

Lawrence Berkeley National Laboratory addresses the world's most urgent scientific challenges by advancing sustainable energy, protecting human health, creating new materials, and revealing the origin and fate of the universe. Founded in 1931, Berkeley Lab's scientific expertise has been recognized with 13 Nobel prizes. The University of California manages Berkeley Lab for the DOE’s Office of Science.

DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.