After 10 Years, IMG Still Revolutionizing Genomics
October 5, 2015
Contact: Linda Vu, +1 510 495 2402, [email protected]
In 2005, the Integrated Microbial Genome (IMG) data management system was launched to support comparative analysis of genomes sequenced at the Department of Energy’s Joint Genome Institute (JGI). At that time, the system had only a few registered users and contained about 3,000 genomes.
Today, IMG is one of the largest publicly available data management and analysis systems for microbial genome and metagenome datasets, containing about 50,000 datasets. The system also has more than 13,500 registered users from 93 countries across six continents, has contributed to thousands of published papers and has served as a tool for teaching genome and metagenome comparative analysis at numerous universities and colleges around the globe.
On this milestone anniversary, the two researchers from Lawrence Berkeley National Laboratory (Berkeley Lab) who have led the development of IMG—Victor M Markowitz, leader of the Lab’s Biosciences Computing Group, and Nikos C Kyrpides, head of JGI’s Prokaryotic Super Program—reflect on the development, evolution and impact of this system.
Both are co-authors of the paper “Ten years of maintaining and expanding a microbial genome and metagenome analysis system,” which was published in the October 1 issue of Trends in Microbiology.
Question: Looking back, to what do you credit the success of IMG?
VICTOR: When we started developing the IMG system 10 years ago, there were several similar data management and analysis systems for microbial genomics but no systems for the emerging area of metagenomics. Over the last decade, some of those systems survived, some ceased to exist and new systems appeared. What made IMG unique was the tight partnership between Nikos and his colleagues at JGI, with the computer scientists and software engineers in Berkeley Lab’s Biosciences Computing Group (formerly known as the Biological Data Management and Technology Center).
From day one, IMG’s development was anchored in the scientists’ view of how the system would serve their analysis needs. They provided us with very detailed requirements and use cases which drove our engineering decisions. Supporting the scientific users of the system was and remains the main focus of the system. The steadfast support we got from JGI, which stood by us in good and bad times, was also essential to our success.
NIKOS: I agree with Victor that the source of IMG’s success has been the very close collaboration between JGI’s scientists and Berkeley Lab’s Biosciences Computing Group. I cannot emphasize enough how important this collaboration is. We have seen time and time again when computer scientists and software engineers believe they know what the biologists want and then build a system that we can’t use. Or a biologist with some computer programming experience tries to create a system that doesn’t come close to what the computer scientists can build.
The first years were very difficult in terms of communication. Computer scientists and biologists speak very different languages, so often it’s very hard to bridge the communication gap between them. But the persistence, understanding and confidence that we had in each other—trusting that the other team knew what to do to make this endeavor successful—ultimately made this collaboration so impactful.
Question: From a scientific perspective, what makes IMG unique?
Nikos: In my view, IMG provides one of the best—if not the best—integration of genomics data that I have seen. Everyone agrees that we need to integrate data, but there are various ways to do this. At the very minimum, you can collect data in different ways—genes, pathways, various types of omics data or even metadata—put this information into a database and that’s it. Users will not be able to compare the different types of data. So they collect the data, but there is no real data integration there.
The strength of IMG is that its integration allows scientists to compare across different data types. We have to work with diverse types of data from many different datasets. For example, if we have a lot of metagenome datasets from the human body but we don’t know where they came from, which body site or the exact conditions of the individual from which those samples were collected, the datasets are of little use. Yes, we can compare the datasets, we can identify commonalities or differences between datasets, but we cannot explain them. The explanations emerge from integrating different types of data and metadata. With integration the meaning of comparingdata is revealed and we are able to interpret the results in a comprehensive global data context.
The strength of IMG’s integration is directly related to our strong partnership with NERSC (National Energy Research Scientific Computing Center). The reason is that before datasets are integrated into IMG, they require a lot of processing which involve massive computations. Today we have more than 35 billion genes in IMG, and supporting gene-based comparisons between datasets wouldn’t have been possible without the computing infrastructure at NERSC.
Question: What have been some of your biggest challenges over the last 10 years?
VICTOR: The rapid growth in the number of sequence datasets generated by new sequencing technology platforms and the massive increase in the number of genes in individual metagenome datasets have been the toughest challenge we’ve encountered.
Since IMG serves as a vehicle for processing and distributing the datasets sequenced at JGI, the top priority for IMG is to serve JGI scientists and their collaborators. Increasing the number of datasets included into IMG improves the efficiency of the analysis supported by the system, so we try to get as many genome and metagenome datasets into the system as possible. In addition to all the microbial genomes and metagenomes sequenced at the JGI that are included into IMG, we integrate all genome datasets available from public sequence data archives. In addition, we encourage users to submit their own datasets (sequenced elsewhere) for processing and integration into IMG.
As the cost of sequencing dropped, we saw an explosion of data being generated around 2010. As expected, the data management technology that we were using at the time just could not sustain the growth. The 2010-2012 period, was tough as we had to scramble for solutions to sustain the data growth. The computational challenge of integrating new genome and metagenome datasets into IMG was facilitated by the supercomputing infrastructure at NERSC. To address the data management challenge, we extended IMG’s data base infrastructure using open source data stores and by developing custom access methods. But this solution is difficult to scale and ideally would be replaced by gaining access to high performance data management resources and new database technology platforms, hopefully also at NERSC.
NIKOS: Ten years ago, the scale of genomics data was completely different than today. When IMG was launched, we had only a few thousand genomes in the system. The amount of data included into the database on a quarterly basis, was significantly less than what is included into IMG on a weekly basis today.
When we started IMG in 2005, scientists who were submitting proposals to JGI had one or maybe a handful of genomes that they wanted to sequence and analyze. Today, scientist are using IMG to compare several hundred to thousands of genomes and metagenomes. For instance if you want to understand a single organism like E.coli, we now have thousands of genomes for this organism in the IMG database. So scale has been one of our biggest challenges, and will continue to be one of our greatest challenges moving forward.
Question: What are some of the goals for the next 10 years of IMG?
NIKOS: The key is integration of data, especially metagenomic data. In the beginning, our goal was just to provide a system that would integrate all of the microbial genomes sequenced at JGI. But it quickly became obvious that we needed to integrate the genomes sequenced at JGI with those sequenced elsewhere. So we began integrating all publicly available microbial genomic data from resources like NCBI (National Center for Biotechnology Information) into IMG. But we haven’t been as aggressive in integrating publicly available metagenomes into IMG because this is a tougher challenge that will require substantially more resources.
Right now, there is no centralized resource that integrates all publicly available assembled metagenomic, datasets. This is one of the biggest challenges and one of our biggest goals for the future. It is also one area where our partnership with NERSC is very valuable. Our relationship with NERSC gives us access to high performance computing and storage, which allows us to process larger datasets than ever before. As we improve our pipelines, we will be able to start processing the increased number of datasets that are being submitted to IMG from other centers.
And as JGI moves into the field of functional genomics, we now have the capabilities to do more than just sequencing, including DNA synthesis and metabolomics. All of these technologies are generating new types of data. Eventually, we would like to integrate these new data types into IMG. This will be a huge step forward in facilitating discovery.
VICTOR: In sustaining IMG as a leading genome and metagenome data management and analysis system we are facing two major challenges: maintaining a rapidly growing production system in an academic environment with inherently tight funding and lack of access to advanced data management technology.
The limitations of the data management tools we have access to were and continue to be the main hurdle for IMG’s future. We succeeded to overcome partially the limitations of the data management tools available to us, by experimenting with and deploying various open source data stores wrapped with custom data access methods. But these solutions have their own limitations, are difficult to scale and need to be replaced within the next two to three years, ideally by gaining access to cutting-edge high performance data management resources and new database technology. Such resources are available commercially at a high cost, but unfortunately are not currently provided by academic computing centers like NERSC. We would like to work closely with NERSC to overcome this challenge.
For more information on IMG:
About Computing Sciences at Berkeley Lab
High performance computing plays a critical role in scientific discovery. Researchers increasingly rely on advances in computer science, mathematics, computational science, data science, and large-scale computing and networking to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab’s Computing Sciences Area researches, develops, and deploys new foundations, tools, and technologies to meet these needs and to advance research across a broad range of scientific disciplines.
Founded in 1931 on the belief that the biggest scientific challenges are best addressed by teams, Lawrence Berkeley National Laboratory and its scientists have been recognized with 13 Nobel Prizes. Today, Berkeley Lab researchers develop sustainable energy and environmental solutions, create useful new materials, advance the frontiers of computing, and probe the mysteries of life, matter, and the universe. Scientists from around the world rely on the Lab’s facilities for their own discovery science. Berkeley Lab is a multiprogram national laboratory, managed by the University of California for the U.S. Department of Energy’s Office of Science.
DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit energy.gov/science.