Center’s Microbial Data Management System to be Featured at Conference

May 1, 2005

The Integrated Microbial Genomes (IMG) System developed by CRD’s Biological Data Management and Technology Center (BDMTC) will be a featured demonstration at the 13th annual international conference on Intelligent Systems for Molecular Biology. The conference, sponsored by the International Society for Computational Biology, will be held June 25-29 in Detroit.

Biological Data Management and Technology Center Marks First Year

The Biological Data Management and Technology Center (BDMTC) at Lawrence Berkeley National Laboratory marked its first anniversary with the release of the Integrated Microbial Genomes (IMG) system, a complex biological data management system BDMTC developed in collaboration with the Microbial Genome Analysis Program (MGAP) at the Joint Genome Institute (JGI).

As a community resource, IMG integrates JGI’s microbial genome data with publicly available microbial genome data, providing a powerful comparative context for microbial genome analysis. At JGI, an enhanced version of IMG provides support for advanced data curation and annotation carried out by MGAP scientists.

While IMG is the first academic “product” BDMTC undertook, its success “demonstrates the viability of the center’s rationale”, said BDMTC head Victor Markowitz, who launched the center in January 2004. “BDMTC is based on the premise that addressing effectively biological data management challenges requires extensive data management and system development experience and expertise consolidated in a central core,” according to Markowitz. Particularly encouraging has been the response of MGAP scientists to the systematic approach followed in developing IMG, from gathering and analysis of user requirements through the public release of the system.

“Although I was closely involved for years in the development of another microbial genome data system at Integrated Genomics, this is the first time I have experienced a well organized process in which requirements are documented, clarified, and continuously refined, and development follows a strict yet clear and predictable schedule” said Nikos Kyrpides, head of MGAP and IMG’s scientific lead. “I fully appreciate the value of a disciplined development process which I now consider as critical to ensuring that the system addresses the needs of the scientific users.”

In presentations about IMG, Kyrpides emphasizes the benefits biologists gain from system documentation, the development process, and even data modeling abstractions, a view he hopes more biologists will start to share.

Rationale for BDMTC

Biological data management involves data generation and acquisition, data modeling, data integration and data analysis. Data management poses challenges on several fronts. First, there are the increasing amounts of experimental data generated by life science applications. Next is the difficulty of qualifying data generated using inherently imprecise tools and techniques. Finally, there is the complexity of integrating data residing in diverse and poorly correlated repositories.

At research institutions such as LBNL and the University of California, San Francisco (UCSF), biological data management systems have typically been developed with an eye toward rapid development and low cost. This often meant that minimal consideration was given to requirements analysis, system development practices, system evolution, maintenance and scalability. While such as approach was perceived as less expensive because it could be achieved without experienced data management professionals and software engineers, the savings came at the expense of overall system quality, including reliability, maintenance and evolution.

The problems associated with academic systems and software have been recognized and addressed in NIH reports, in particular “The Biomedical Information Science and Technology Initiative” (BISTI) report prepared by the Working Group on Biomedical Computing Advisory Committee to the NIH Director and the NIH Roadmap for Accelerating Medical Discovery to Improve Health. Both documents recommend employing advanced data management technologies for developing interoperable biomedical databases and software engineering practices for delivering robust and reliable systems and tools.

Following NIH’s recommendations requires expertise in several areas, such as data modeling, data integration, database administration, data sharing and security, software engineering, software and data management quality control. Due to the complexity and cost involved, few public institutions can afford to acquire such expertise. Therefore a central core such as BDMTC could provide an effective solution to this problem. BDMTC’s premise is also consistent with DOE’s Genome to Life (GTL) program which envisions consolidated computing infrastructure facilities in the form of software, biocomputing and data centers. In particular, a “seamless and effectively centralized capability to deal with data” in the form of data centers collecting and integrating effectively large scale biological data is seen as key to GTL’s success.

Exploring Partnership Possibilities

Over the course of its first year, members of BDMTC approached a number of academic organizations in the Bay Area, both to assess their data management needs and to identify potential areas for collaboration. The organizations included the Berkeley Structural Genomics Center (BSGC), the Joint Genome Institute (JGI), the P50 Integrative Cancer Biology Program (ICBP) in the Life Sciences Division at LBNL, and the Immune Tolerance Network (ITN) at UCSF.

The analysis of JGI’s data management goals subsequently led to the development of IMG. BSGC’s data management needs, in particular in the area of experimental data tracking and work scheduling, were examined in order to prepare the Laboratory Information Management Systems (LIMS) and data management component of BSGC’s PSI-II application for a Large Scale Structural Genomics Center. ICBP’s data management core provided a concrete framework for exploring caBIG related opportunities for providing better data management support for NCI sponsored programs and centers.

BDMTC has also pursued collaborations with the Immune Tolerance Network (ITN) at UCSF and was part of UCB’s proposal for a National Center for Biomedical Computing (NCBC): the former was not finalized because of budget cuts, and the latter was not selected for funding. However these initiatives provide additional evidence for BDMTC’s potential to establish collaborations. In particular, NIH’s call for establishing NCBC envisions software development and data management cores similar in scope to BDMTC.

Challenges and Plans

While there is clearly a growing need for enhanced data management tools, there is also a preference among many academic life science groups for a “do it yourself” approach, rather than collaborating with other groups or centers. An additional challenge is posed by the emphasis put on experimental results over data management, which may entail reconciling the relatively low budgets these groups assign to data management and the cost associated with outside collaborations.

In 2005, BDMTC will continue to pursue collaboration opportunities, primarily at LBNL, UC Berkeley and UCSF. Helping research groups realize that collaborations could lead to potentially higher quality data management results, reduced effort duplication, and savings coming from sharing resources and expertise will be part of BDMTC’s outreach efforts.

Developing productive collaborations will require a change in the way groups tend to operate, Markowitz said. Life science groups would benefit from a higher level of collaboration in the area of biological data management system and bioinformatics tool development. Data management and software engineering groups need to improve their ability to support life science applications through enhanced understanding of these applications. Prerequisites for such an endeavor include finding incentives to encourage collaborations and raising the awareness of the critical role played by biological data management in competing for large-scale projects or centers such as those envisioned by the GTL and NIH programs.

About Computing Sciences at Berkeley Lab

High performance computing plays a critical role in scientific discovery. Researchers increasingly rely on advances in computer science, mathematics, computational science, data science, and large-scale computing and networking to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab’s Computing Sciences Area researches, develops, and deploys new foundations, tools, and technologies to meet these needs and to advance research across a broad range of scientific disciplines.