A-Z Index | Directory | Careers

Data Management

We are engaged in research that improves our ability to collect, store, and present scientific data needed to enable discoveries. The sources of the data include large-scale user facilities, distributed scientific experiments, and supercomputer simulations spanning science domains. The users of the data need access to process the data and perform analyses alongside their massive simulation, on major data centers dedicated for science, on their personal computers, or on cloud resources.

To serve this range of different needs, we are developing a wide range of high-performance algorithms and tools for general use. We also collaborate with science domains to develop specific tools needed for individual projects. Our work includes building data transformation and processing pipelines and developing data storage tools and techniques, advanced scientific workflow tools, I/O technologies, data indexing and searching, in situ feature-extraction algorithms, and software platforms. Building a strong user-facing component to enter the pipeline and produce final data products, as well as tools for data movement and cybersecurity is an essential element of our work.



The ExaIO project is part of the Exascale Computing Project (ECP) to deliver the Hierarchical Data Format version 5 (HDF5) library and the UnifyFS file system to efficiently address exascale storage and I/O challenges. HDF5 improvements include asynchronous I/O, caching, sub-filing, etc. The ExaIO team supports numerous ECP applications in achieving superior I/O performance with HDF5. Contact: Suren Byna (Byna on the Web)


FasTensor allows users to express their analysis operations by considering operations that need to be carried on one element of their data array at a time. This functional approach frees users from considering the data management tasks commonly taking up a majority of the source code lines in an MPI-based data analysis program. FasTensor exploits the structural locality in the multidimensional arrays to automate file operations, data partitioning, communication, parallel execution, and common data management operations. Our implementation takes advantage of the C++ program to enable complex user analysis operations to be executed efficiently on large computers. Contact: John Wu

Data Reduction Scheme Based on Locally Exchangeable Measures

Scientific simulations and online streaming applications such as power grid monitoring are generating so much data quickly that compression is essential to reduce storage requirement or transmission capacity. To achieve better compression, one is often willing to discard some repeated information. These lossy compression methods are primarily designed to minimize the Euclidean distance between the original data and the compressed data. But this measure of distance severely limits either reconstruction quality or compression performance. We propose a new class of compression method by redefining the distance measure with a statistical concept known as exchangeability. This approach reduces the storage requirement and captures essential features, while reducing the storage requirement. Contact: Alex Sim

In-network Data Caching Strategies

The volume of data moving through a network increases with new scientific experiments and simulations. Network bandwidth requirements also increase proportionally to deliver data within a certain timeframe. We observe that a significant portion of the popular dataset is transferred multiple times to different users as well as to the same user for various reasons. In-network data caching for the shared data has been shown to reduce the redundant data transfers and consequently save network traffic volume. In addition, overall application performance is expected to improve with in-network caching because access to the locally cached data results in lower latency. Contact: Alex Sim

Proactive Data Containers

Proactive Data Containers (PDC) software provides an object-centric API and a runtime system with a set of data object management services. These services allow automating data placement in the memory and storage hierarchy, performing data movement asynchronously, and providing scalable metadata operations to query data and metadata. Contact: Suren Byna (Byna on the Web)


Modern scientific and engineering simulations track the time evolution of billions of elements. For such large runs, storing most time steps for later analysis is not a viable strategy. It is far more efficient to analyze the simulation data while it is still in memory. We present a novel design for running multiple codes in situ: using coroutines and position-independent executables we enable cooperative multitasking between simulation and analysis, allowing the same executables to post-process simulation output, as well as to process it on the fly, both in situ and in transit. Contact: Dmitriy Morozov


ENDURABLE: Benchmark Datasets and AI models with queryable metadata. The scientific goal of ENDURABLE is to enable easily accessible, standardized data and metadata for building AI models to solve microbiome science problems. We are building off of our Hierarchical Data Modeling Framework (HDMF), a state-of-the-art data standardization framework to build benchmark datasets for microbiome data problems. We are storing curated data from National Microbiome Data Collaborative (NMDC) datasets for doing AI research. These tools have been utilized in the training DL models for taxonomic classification. The significance and impact is to enable AI researchers can use massive data repositories for developing AI models to solve problems in microbiome science. Contact: Kristofer Bouchard

LinkML: Linked Data Modeling Language

LinkML is a flexible data modeling language that allows you to describe the structure of rich interconnected data and metadata, making use of ontologies and vocabularies. LinkML can be used in conjunction with JSON documents, tabular/spreadsheet data, relational databases, and linked data formats. It is in use across a number of projects in the life and environmental sciences, including the National Microbiome Data Collaborative (NMDC). Contact: Chris Mungall

Knowledge Graph Hub

KG-Hub is a platform that provides software development patterns for the standardized construction, exchange, and reuse of knowledge graphs. It also includes libraries (NEAT) for performing embedding on knowledge graphs. Contact: Justin Reese

Neurodata Without Borders (NWB)

Neurodata Without Borders (NWB) is a data standard for neurophysiology, providing neuroscientists with a common standard to share, archive, use, and build common analysis tools for neurophysiology data. NWB is designed to store a variety of neurophysiology data, including data from intracellular and extracellular electrophysiology experiments, data from optical physiology experiments, and tracking and stimulus data. The project includes not only the NWB format, but also a broad range of software for data standardization, e.g., the Hierarchical Data Modeling Framework (HDMF), and application programming interfaces (PyNWB, MatNWB, NWBInspector) for reading, writing, and validating the data, tools for developing and sharing extensions to the NWB data standard (NDXCatalog), as well as high-value data sets that have been translated into the NWB data standard. Contact: Oliver Ruebel


RESTing is a REST Interface Generator with the goal to provide a Python-based module for simplifying client access to web server using Django REST framework and PostgreSQL. There are 4 main components are: (1)"webserver" directory: context for Docker image containing: i) "apache" directory: Apache 2 configuration files; ii) "ssl" directory and iii) "website" directory: Django website source code; (2)"postgres" directory: context for Docker image containing PostgreSQL configuration; (3) "resting" directory with the Python 3 module for simplifying client access to web server using Django REST framework; (4) "doc" directory: Sphinx documentation for the code in this repository. This tool was used to deploy covidscreen.lbl.gov database as part of LBNL COVID-19 emergency response. Contacts: Dani Ushizima, Kenny Higa, Cory Snavely

The High Touch Project

High Touch (HT) is an effort to provide a high fidelity packet capture service -- much like using a microscope to understand the details of network traffic. HT is pushing the boundaries of programmable hardware and software to accelerate packet processing that will allow engineers to specify traffic they want to bring into focus. We use data science to not only help guide the project's architectural decisions and software improvements but also to create data processing pipelines that will provide analytical feedback to network and security engineers or researchers interested in seeing details of a data transfer (or set of transfers) or apply specific machine learning and analysis for network data. If you ever wanted to use Python and Pandas to process packet data, HT provides the path. Contact: Michael Haberman


Berkeley Lab’s FasTensor Provides Pain-Free Big Data Analysis

January 26, 2022

With FasTensor, researchers in Berkeley Lab’s Scientific Data Management Group developed an open source tool to help its users efficiently process and analyze their massive datasets. Read More »

Perlmutter-Powered Deep-Learning Model Speeds Extreme Weather Predictions

November 29, 2021

Researchers from Berkeley Lab, Caltech, and NVIDIA trained the Fourier Neural Operator deep learning model to emulate atmospheric dynamics and provide high-fidelity extreme weather predictions across the globe a full five days in advance. Read More »

UniviStor: Next-Generation Data Storage for Heterogeneous HPC

April 1, 2019

The Proactive Data Containers project team at Berkeley Lab is developing object-oriented I/O services designed to simplify data movement, data management, and data reading services on next-generation HPC architectures. Read More »

Neurodata Without Borders Project Wins 2019 R&D100 Award

October 29, 2019

Led by Berkeley Lab in collaboration with the Allen Institute for Brain Science and multiple neuroscience labs, the NWB:N project has created a data standard and software ecosystem that is transforming neurophysiology research. Read More »

CRD's Ushizima to Discuss Using ML Algorithms to Screen Lung Images for COVID-19

October 15, 2020

On Friday, Oct. 16, Berkeley Lab scientist Daniela Ushizima will discuss early results of using computer vision algorithms to scan medical images of lungs and automatically identify lesions that could indicate COVID-19 at the 2020 Annual Meeting of the Academic Data Science Association. Read More »

Accelerating COVID-19 Triage and Screening

September 17, 2020

Berkeley Lab researcher Dani Ushizima and her team are working to amplify COVID-19 testing and surveillance by exploring a diverse set of data, including patient vitals, with suspected COVID-19 infection Read More »