NERSC Summer Student Enhances Genomic Sequencing Alignment on GPU Architectures

August 29, 2022

By Kathy Kincade
Contact: cscomms@lbl.gov

LeAnn Lindsey

This was LeAnn Lindsey’s second summer interning at the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory (Berkeley Lab) through the Computing Sciences Summer Student program, and it was a productive one. The third-year computer science Ph.D. student from the University of Utah leveraged lessons learned from last summer to develop and apply a GPU-accelerated computer code that addresses a challenging problem in biological sequencing alignment.

Sequence alignment is used in many bioinformatics software applications. It is a method of aligning biological sequences (DNA, RNA, or protein) to understand their function and structure. This reveals important information about these biological units that allows us to develop new treatments, study climate change, and optimize crop production, among many other vital applications.

In preparation for pre-exascale and exascale supercomputing systems, NERSC developed a GPU-accelerated sequence alignment library, and Lindsey’s task this summer was to add a “traceback” capability to this library to enhance performance. Typically in a sequence alignment algorithm, all the possible alignments and computed and then the optimal alignment is “traced back” in the reverse direction, an operation that is not easy to make run well on GPUs

“Traceback in sequence alignment is quite complicated, and this is further compounded when we consider limitations and challenges of GPU computing,” said Muaaz Gul Awan, an application performance specialist at NERSC and Lindsey’s mentor both this summer and last. “The fundamental sequence alignment algorithm [originally written in the 1980s] is everything a GPU is not optimized for, so developing a performant traceback algorithm for GPUs was a real challenge.”

Integrating Code into MetaHipMer

During her first internship at NERSC, Lindsey wrote the traceback kernel, focusing on making sure it was correct before switching gears to its performance. Her focus this year was to integrate it into MetaHipMer, a widely used software package for constructing genomes from a collection of short fragments of DNA where the correct order of DNA is not known ahead of time and the fragments belong to many different genomes. MetaHipMer was developed by the ExaBiome team (led by Berkeley Lab), and it is one of the fastest algorithms that exists for metagenomic sequence alignment. Metagenomics refers to the study of genomes from environmental samples containing many organisms.

“Our goal was to make MetaHipMer run faster on the Perlmutter GPU architecture because traditional assemblers on small-scale systems can take as long as two weeks to finish,” Lindsey said. “There is so much data out there, if you want to study really large datasets like a cancer dataset or an environmental dataset that is terabytes large, you are going to need something that runs very, very fast.”

What traceback provides is the exact sequence alignment between two sequences, and that can be used to create multi-sequence alignment for many different applications, she added.

“Over the two summers that LeAnn has worked with us, she not only learned GPU programming but also came up with several novel ideas for tackling this problem,” Awan said. “Her current implementation is outperforming state-of-the-art traceback codes by about 10x!”

Now that her code has been adopted into MetaHipMer, this will help the wider bioinformatics community process ever-larger datasets on modern supercomputers by making the sequencing software run faster and more efficiently on GPUs. In the meantime, Lindsey is putting the finishing touches on a paper about this research and looking forward to continuing her collaboration with Awan over the coming year.

“Muaaz and I have a plan for two different projects that are related to this work,” she said. “For example, the work we did for this paper is for short reads, and we want to modify it for long-read aligners because they are becoming more utilized and they have a lot of memory challenges.”

The highlight of this summer, she added, “was presenting my work to the ExaBiome group. It was so exciting to have integrated my kernel into their software and present to them about this.”

About Computing Sciences at Berkeley Lab

High performance computing plays a critical role in scientific discovery. Researchers increasingly rely on advances in computer science, mathematics, computational science, data science, and large-scale computing and networking to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab’s Computing Sciences Area researches, develops, and deploys new foundations, tools, and technologies to meet these needs and to advance research across a broad range of scientific disciplines.