Berkeley Lab’s ENDURABLE: An Aggregate Data Standard for AI Modeling

August 31, 2020

Computational Biosciences Group Photo 03 29 2019

Berkeley Lab's Computational Biosciences Group. Back Row, Left to Right: Chris Mungall, Oliver Rübel, Hans Johansen, Peter Zwart, and Andrew Tritt. | Front Row, Left to Right: Héctor García Martin, Ben Brown, Talita Perciano, Kristofer Bouchard, and Aydın Buluç, (Photo by Margie Wylie, Berkeley Lab)

This month, the Department of Energy (DOE) announced $8.5 million for projects to make artificial intelligence (AI) models and data more accessible and reusable to accelerate exploration in AI research and development. One of these newly funded projects is “ENDURABLE: Benchmark datasets for AI with queryable metadata,” spearheaded by Lawrence Berkeley National Laboratory’s (Berkeley Lab’s) Computational Research Division (CRD).

The focus of this DOE-funding is to apply Findable, Accessible, Interoperable, and Reusable (FAIR) Data Principles so that science data can drive innovations in AI. The FAIR principles were originally proposed and endorsed in 2016 by an international collaboration of universities, industry, funding agencies, and scholarly publishers.

“With ENDURABLE, we aim to provide the scientific community with tools to aggregate data robustly and train our deep learning models,” said Kristofer Bouchard, Berkeley Lab research scientist and acting Computational Biosciences Group Lead in CRD.

According to Bouchard, much of the data currently aggregated in science, or combined from a diverse collection of scientific studies, is in a form that machine learning algorithms can’t effectively use. Because there is no standard way of describing the data for these algorithms, researchers cannot see what’s driving the machine learning results.

“When you are doing any sort of data science, the first battle is formatting data, structuring data, and cleaning it up. This can be a complex process and at this stage, everyone could be doing something different. This means that the input into machine learning algorithms is going to be different depending on how the data is initially processed and packaged,” said Andrew Tritt, Berkeley Lab Data Engineer. “Our hope for ENDURABLE, is to give researchers a common starting off point when they put their data into a machine learning algorithm.”

As part of the project, Tritt notes that the Computational Biosciences Group will design a data standard so that different types of complex data from various sources can be aligned and served to neural network training programs and machine learning models. The design will also build on Berkeley Lab’s efforts for other DOE-funded projects, like the Exabiome project, as well as the National Institute of Health (NIH) funded BRAIN Initiative project Neurodata Without Borders (NWB), which won an R&D 100 award in 2019. Initially, the team will be working with genomic data from the Genome Taxonomy Database, but their ultimate goal is to build tools that will be useful for a variety of aggregated datasets like those in the National Microbiome Data Collaborative.

“Berkeley Lab’s Computational Biosciences Group truly lives at the intersection of advanced computing techniques and tools to solve the real problems that bioscientists face every day. That may sound easy, but it’s actually really hard for a variety of both technical and sociological reasons,” said Bouchard. “What our group does is get the right people together so that when opportunities to take on some of these challenges arise, we are uniquely positioned to do so. ENDURABLE is a great example of that.”

The projects were chosen by competitive peer review under DOE Funding Opportunity Announcement and a companion announcement for DOE laboratories, sponsored by the Office of Advanced Scientific Computing Research (ASCR) within DOE’s Office of Science.

About Computing Sciences at Berkeley Lab

High performance computing plays a critical role in scientific discovery. Researchers increasingly rely on advances in computer science, mathematics, computational science, data science, and large-scale computing and networking to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab’s Computing Sciences Area researches, develops, and deploys new foundations, tools, and technologies to meet these needs and to advance research across a broad range of scientific disciplines.