DOE User Facilities Join Forces to Tackle Biology’s Big Data

Inaugural Collaborative Science Call Yields Six Proposals Melding Genomics, Supercomputing

July 25, 2017

JGI Contact: Massie Ballon
NERSC Contact: Linda Vu
Email: cscomms@lbl.gov

Photo of the Cori Supercomputer at NERSC

Users can look for patterns across data sets in the DOE JGI’s Integrated Microbial Genomes and Microbiomes (IMG/M) database with the help of NERSC’s supercomputer Cori. (Roy Kaltschmidt, Berkeley Lab)

Six proposals have been selected to participate in a new partnership between two U.S. Department of Energy (DOE) user facilities through the “Facilities Integrating Collaborations for User Science” (FICUS) initiative. The expertise and capabilities available at the DOE Joint Genome Institute (JGI) and the National Energy Research Scientific Computing Center (NERSC) – both at the Lawrence Berkeley National Laboratory (Berkeley Lab) - will help researchers explore the wealth of genomic and metagenomic data generated worldwide through access to supercomputing resources and computational science experts to accelerate discoveries.

“As we bring researchers into the FICUS program, we are introducing a new user community to the power of supercomputers. Scientists will use whatever tools are readily available to investigate a hypothesis and to date, only a small set of biological tools have needed a supercomputer, but this is changing quickly,” says Kjiersten Fagnan, who serves a dual role as the DOE JGI’s Chief Informatics Officer and NERSC’s Data Science Engagement Group Lead.

The JGI-NERSC FICUS call is the latest partnership since the collaborative science initiative was formed in 2014 by the Office of Biological and Environmental Research (BER) to harness the combined expertise and resources of two of the national user facilities stewarded by the DOE Office of Science in support of DOE’s energy, environment, and basic research missions. NERSC is now the latest DOE User Facility to participate in FICUS, with prospects growing for the inclusion of others in the future.

Through the JGI-NERSC FICUS call, users can query across all available data to look for patterns across data sets in the DOE JGI’s Integrated Microbial Genomes and Microbiomes (IMG/M) database with the help of NERSC’s supercomputer Cori, resulting in a more powerful analysis with increased capacity for novel discoveries. As many of these researchers are new to computing, a member of NERSC’s Data Science Engagement Team will be assigned to work with each FICUS project and DOE JGI staff will assess their needs and help them develop tools and workflows. Ultimately, these tools and scientific findings will be made publicly available via a NERSC science gateway.

The accepted proposals include:

Patricia (Patsy) Babbitt of the University of California (UC), San Francisco aims to develop tools to mine the IMG/M database for enzyme superfamilies—functionally diverse collections of enzymes that share a common ancestor and fold, as well as active site architectures and reaction mechanism or other chemical capability. By profiling certain enzyme families from different environments, it may be possible to identify associations that will suggest functions for otherwise unknown proteins. As test cases, the Babbitt group is working with enzyme superfamilies involved in the biodegradation of insecticides, heavy metals and explosives.

David Baker at the University of Washington will access the metagenomics and metatranscriptomic data sets available in the IMG/M database to expand the structural universe of eukaryotic proteins. By mining the raw and annotated genome sequences, the team hopes to find more homologs within protein families that can then be used to develop computational methods that can build accurate models of how the proteins fold, providing testable clues to potential functions. The proposal builds upon a previous collaboration in which Baker’s lab utilized the sequence data in the IMG database to determine accurate 3D models of structures for 614 protein families (12 percent of which had not yet been structurally characterized).

Phillip Brooks of UC Davis proposes to speed up comparative genome sequence analysis by first calculating “signatures” of more than 5,000 private microbial genomes and then tackling all of the metagenomes in the IMG/M database using a technique called MinHash. The indexes would be a step toward developing technologies that could lead to faster and more accurate taxonomic organization of genomes contained in metagenomes, enabling more informative comparative analyses of metagenomics datasets.

Ed DeLong of the University of Hawaii at Manoa aims to develop a global catalog of microbial small RNAs (sRNAs)—highly structured, non-coding RNA molecules, 50 to 500 bases in length—from the publicly available metatranscriptomes and metagenomes in IMG/M, as well as data sets generated from a two-year time-series study by his own lab. Microbial sRNAs function as regulators of metabolic processes, and many are currently known to be involved in environmentally significant processes.

Steve Hallam of Canada’s University of British Columbia aims to reconstruct modular pathways mediating core biogeochemical cycles such as carbon, nitrogen, sulfur and iron. His team has already been able to map out a subset of phylogenetic reference trees for carbon and nitrogen metabolic pathways on their own, but they want to develop a scalable process for charting global biogeochemical cycles using fast phylogenetic mapping of functional anchor genes from the publicly available metatranscriptomes and metagenomes in IMG/M. This work will provide a community-driven framework in which to reconstruct the interconnected network of microbial mediated biogeochemical cycles with quantitative taxonomic resolution and inform modeling efforts to predict microbial community responses to environmental perturbation.

Kostas Konstantinidis of Georgia Institute of Technology wants to develop new approaches to analyze soil microbial communities at the individual population level. The team will start with sequencing, assembling and binning genome populations from permafrost soil metagenomes from samples collected at the Carbon in Permafrost Experimental Heating Research (CiPEHR) site near Alaska’s Denali National Park and then compare the population data they find with metagenomes in the IMG/M database. They hope to assess which populations are widespread within specific soil ecosystems and establish them as model organisms for studying carbon cycling within the corresponding ecosystems, as well as what gene functions are differentially abundant and thus selected by different ecosystems.

According to Fagnan, the diversity of NERSC’s science workload, which ranges from cosmology to nanoscience, makes this facility an ideal partner for this FICUS initiative. Additionally, NERSC has a long history of collaborating with the DOE JGI, so the staff is very familiar with the DOE JGI’s data and computational needs. Through a memorandum of understanding revised in 2011, NERSC has been providing high-performance and high-throughput computing support for the DOE JGI, and currently stores all of the Institute’s data. The two facilities have also worked closely over the years to get JGI’s sequence data processing and integration pipelines—as well as data processing tasks, such as sequencing quality control for base calling, detection of contamination, sequence alignment and assembly gene prediction—to run efficiently on the center’s supercomputers. Most recently, the DOE JGI has been able to dramatically improve the performance of metagenome assembly by leveraging Cori’s burst buffer resource.

“I really believe that the future of computing is going to be dominated by biology. The volumes of biological data that need to be synthesized, aggregated and interrogated will require supercomputers,” said Fagnan. “If you look at the data sets being generated and the questions that people have, you can see that researchers are going to have to combine different datasets—like genomics, metabolomics, protein crystal structures and potentially even brain scans and more—to find answers. This work cannot be done on a laptop or small cluster.”

The full list of approved projects is available from JGI.

The U.S. Department of Energy Joint Genome Institute (DOE JGI) works hand in hand with the National Energy Research Scientific Computing Center (NERSC), and the Energy Sciences Network (ESnet) to advance the frontiers of science. These three unique resources reside at Lawrence Berkeley National Laboratory and are known as National User Facilities.

About Computing Sciences at Berkeley Lab

High performance computing plays a critical role in scientific discovery. Researchers increasingly rely on advances in computer science, mathematics, computational science, data science, and large-scale computing and networking to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab’s Computing Sciences Area researches, develops, and deploys new foundations, tools, and technologies to meet these needs and to advance research across a broad range of scientific disciplines.