How do we know who’s in our world? Genomics researchers study and map the genomes of all kinds of genetic entities, from the blue whale all the way down to mobile genetic elements (MGEs), which have genetic material but are not considered living. But many organisms and MGEs aren’t compatible with lab environments, so researchers use metagenomics to gain access to their genomes.
Metagenomics is a method of identifying all the genetic entities in a particular environment, whether it’s the human gut or a corner of the Amazon rainforest. Researchers take comprehensive environmental samples and analyze the genetic profile of the entire sample at once—everything found in soil, water, and all the other parts of the picture. (“Most of the samples come from poop,” one researcher noted.) Reference databases of known genomes help them identify and sort what they find.
Researchers in Berkeley Lab’s Applied Mathematics and Computational Research Division (AMCR) and at the Joint Genome Institute (JGI), with the help of supercomputers at the National Energy Research Scientific Computing Center (NERSC) and support from the Department of Energy’s Exabiome Exascale Computing Project, have developed new tools to advance the field of metagenomics and expand scientists’ understanding of our world’s biodiversity.
Shining Light on Microbial Dark Matter
One tool created by Berkeley Lab researchers poses the question: How do we know what we don’t know?
In metagenomics research, scientists compare data from environmental samples to protein families with which they’re already familiar. But only a small portion of the data actually matches up with those known protein families. So what about the rest – the so-called microbial functional dark matter, which has no matches in any database and comprises as much as 93% of proteins in the average experimental sample?
Previously, this microbial functional dark matter was simply discarded from the data as unknown unknowns. But researchers at JGI used AI clustering algorithms developed by AMCR researchers and running on NERSC’s Cori supercomputer to group and identify billions of protein sequences, vastly increasing the number of protein sequences recognized by science. Their work was published in Nature this month.
The project analyzed unknown proteins from 26,931 metagenomics databases, having already subtracted the genes with matches to known proteins in a database of over 100,000 reference genomes. They identified 1.17 billion protein sequences of longer than 35 amino acids with no previously known match and organized them into 106,198 sequence clusters of more than 100 members each. Then they annotated the families according to their taxonomic, habitat, geographic, and gene neighborhood distributions and in some cases predicted their structures in 3D. By organizing them in this way, researchers can begin to understand what each protein’s function might be across species.
“The vision here is to expand the known set of proteins and their functionality,” said Aydin Buluç, a senior scientist in AMCR, who worked on the project. “There’s this whole set of new data that has a lot of members across the universe of life, and it must be doing something important because it exists in hundreds of different species. And then an experimentalist can go and try to see what it might be doing.”
Previously, no tools existed to analyze datasets of this size. To make it happen, Buluç, AMCR research scientist Oguz Selvitopi, and Berkeley Lab affiliate research scientist Ariful Azad used HipMCL, a massively parallel implementation of the Markov Clustering algorithm the researchers previously developed for the Exascale Computing Project’s Exabiome program. Using 2,500 compute nodes on the Cori system – about a quarter of the system – they took advantage of HipMCL’s ability to run on distributed-memory computers and crunch these previously uncrunchable numbers.
“Usually high-quality clustering algorithms like Markov clustering are used to discover these protein families,” said Buluç. “But when we started this project, there was no way to use this Markov cluster algorithm for a dataset of this size – we’re talking about a billion proteins and tens of billions of edges. So we developed a high-performance version that can run at NERSC scale.”
NERSC-scale performance, it turns out, marries the previously opposing needs of scientists, said Selvitopi, offering them the best of both worlds.
“These kinds of analyses require immense computational resources,” said Selvitopi. “In such situations, biologists often rely on fast but low-quality methods or wait weeks to get results of high-quality analyses, both of which are undesirable. With our tool, both of these problems are solved: a high-quality clustering approach that can generate results within a few hours for the largest datasets.”
Identifying MGEs
Another new metagenomics tool developed using NERSC supercomputers is geNomad, a computational framework for identifying and classifying MGEs in metagenomics data sets. MGEs are “selfish” genetic entities—that is, their only goal is to replicate themselves—and yet they are unable to self-replicate without host cells. They are found in virtually all ecosystems and include viruses and small genetic molecules called plasmids, which are found within cells separate from chromosomal DNA.
In recent years, the number of viruses identified through metagenomic analysis has exploded, but JGI project scientist Antonio Pedro Camargo said plasmid identification has moved more slowly. Plasmids are hard to pick out of the genetic soup of environmental samples, and the tools to identify them haven’t been as accurate. Using geNomad, researchers analyzed over 2.7 trillion base pairs of sequencing data, leading to the discovery of millions of MGEs – a leap forward especially for research on plasmids. A paper on the subject was published in Nature in September 2023.
“I hope this will start an era where people will look into plasmids in metagenomic data because we know the importance of plasmids in some contexts,” said Camargo of the work. “Plasmids spread antibiotic resistance genes and virulence genes that make a bacteria that is non-pathogenic into a pathogenic bacteria. So we know some stuff that plasmids can do in some types of bacteria, but we don’t know the full diversity of plasmids in nature and the other things they might be involved in.”
Camargo used two systems at NERSC, Perlmutter and Cori, to develop geNomad. He took advantage of Cori’s many CPUs to process large amounts of sequence data and for a process known as protein clustering, taking proteins from many different organisms and sorting similar ones into groups. He also used a small GPU testbed that had been incorporated into Cori to train some models – but when it came time to train the neural network models for geNomad, Camargo switched systems and used Perlmutter’s many powerful GPUs to do the primary training work.
Intensive computing resources went into the production of geNomad – but when it came time to release it to the world, everyday use by a broad range of scientists was top of mind. Camargo intentionally built the tool to be accessible to a broad range of researchers: he ensured that it would be usable on personal computers and built a web app to make it reachable from anywhere. Additionally, he made it usable by scientists who prefer not to use the command line. That ease of use, he said, was a key part of his process.
“One thing that makes me very excited is developing tools that are easy for people to use,” said Camargo. “We have a web app for people who don’t want to use the command line, but even with the command line version, I put a lot of work into making it as friendly as it could be. And it’s interesting to me that something people can download onto their computers isn’t even that heavy. To do the protein clustering, I had to use hundreds of nodes – but now people can run that on their laptops. I think it’s really cool.”
Overall, both geNomad and the illumination of so much microbial dark matter represent a significant advancement in using HPC as a tool for biologists, said Selvitopi:
“In the bigger computational picture, I think HPC is starting to show itself to be an effective medium for tackling computationally challenging problems and enabling discovery in life sciences.”
About Computing Sciences at Berkeley Lab
High performance computing plays a critical role in scientific discovery. Researchers increasingly rely on advances in computer science, mathematics, computational science, data science, and large-scale computing and networking to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab's Computing Sciences Area researches, develops, and deploys new foundations, tools, and technologies to meet these needs and to advance research across a broad range of scientific disciplines.