Data Repositories and Sharing
Data sharing is a critical component of Berkeley Lab’s commitment to supporting open science. In the long-view data must be preserved and made generally available in a sustainable manner to support reproducible research. Additionally, research data must be broadly accessible. User-friendly web-interfaces make it possible for a much broader class of users to access and interact with scientific data.
Computing Sciences at Berkeley Lab develops and supports a number of data portals and repositories, along with infrastructure, that can serve open research data based on the FAIR (Findable, Accessible, Interoperable, Reproducible) principles. Well-managed persistent data publication platforms are vitally important to research data. They support long-term stewardship of the data, broadly enable data reuse, validate existing conclusions, improve data standardization, and support reproducible research. We work with the scientific user community to enable Science Gateways - online interfaces that give researchers, educators, and students easy access to shared resources that are specific to a science discipline.
Berkeley Lab operates NERSC, one of the key data centers of the DOE community, as well as the Energy Sciences Network, which together enable a robust and scalable backbone for deploying these services through flexible service platforms, large-scale data storage solutions, and high-performance networking.
DOE’s Environmental System Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) is a data repository for Earth and environmental science data. ESS-DIVE enables the scientific community to archive and manage critical environmental data around consistent standards and protocols. It seeks to expand access to and use of data generated by DOE-funded research.ESS-DIVE emphasized the FAIR data principles, to support the Findability, Accessibility, Interoperability, and Reusability of data ESS-DIVE allows data contributors to archive, manage and share various types of data in consistent formats, and obtain digital object identifiers that can be used to cite and track usage of the data. ESS-DIVE users are able to find and obtain data generated by ESS researchers that is organized for better interpretation, analysis, and integration. ESS-DIVE is designed as a scalable framework that incentivizes data providers to contribute well-structured, high-quality data to the archive and that enables the user community to easily build data processing, synthesis, and analysis capabilities using those data. Contact: Shreyas Cholia
- Ameriflux datasets provide the crucial linkage between organisms, ecosystems, and process-scale studies at climate-relevant scales of landscapes, regions, and continents, which can be incorporated into biogeochemical and climate models. When viewed as a whole, the network observations enable scaling of trace gas fluxes (CO2, water vapor) across a broad spectrum of times (hours, days, seasons, years, and decades) and space. AmeriFlux observations have been instrumental in defining the relationships between environmental drivers and responses of whole ecosystems, which can be spatialized using machine learning methods like neural networks or genetic algorithms informed by remote sensing products. The AmeriFlux Network Management Project will fund core AmeriFlux sites and will establish data management, and data QA/QC processes for those sites.
- FLUXNET is a global network of over 400 carbon flux measurement sensor towers that provide a wealth of long-term carbon, water and energy flux data and metadata. The data from these towers is critical to understanding climate change through cross-site, regional, ecosystem, and global-scale analyses. During this project, we developed a data server to support collaborative analysis of carbon-climate data. That fluxnet.fluxdata.org data server now contains global carbon flux data in two major releases. The LaThuile dataset was released in 2007 and has been in use by several hundred paper teams. The FLUXNET2015 dataset was released at the end of 2015 and will continue to be incremented over the course of 2016. These datasets support the global FLUXNET synthesis activity.
Harnessing the power of supercomputing and state-of-the-art electronic structure methods, the Materials Project provides open web-based access to computed information on known and predicted materials as well as powerful analysis tools to inspire and design novel materials. The Materials Project is a multi-institution, multi-national effort to compute the properties of all inorganic materials and provide the data and associated analysis algorithms for every materials researcher free of charge. The ultimate goal of the initiative is to drastically reduce the time needed to invent new materials by focusing costly and time-consuming experiments on compounds that show the most promise computationally. Contact: Shreyas Cholia
The long-term vision of the NMDC is to support microbiome data exploration through a sustainable data discovery platform that promotes open science and shared-ownership across a broad and diverse community of researchers, funders, publishers, societies, and other Collaborators. To support this shared long-term scientific vision of microbiome science, the NMDC mission is to work with the community to iteratively develop an integrated, open source microbiome science gateway that leverages existing resources and enables comprehensive access to multidisciplinary microbiome data and standardized, reproducible data products. Contact: Shreyas Cholia
The MyNERSC Portal provides a user-friendly web interface to enable NERSC users to manage their compute jobs and data at NERSC. We highlight some examples here to illustrate the power of this approach. (Dashboard and Tools require NERSC login.)
Nothing ruins a NERSC user’s day like running out of file storage space. After waiting in a queue for hours or days, a process can fail simply because a shared storage area must be managed across multiple collaborators. Traditional tools report individual usage, but project teams need a higher-level view. The Data Dashboard enables them to visualize team usage of project storage and readily determine opportunities to delete or archive data. Utilizing state-of-the-art algorithms and tools, the Data Dashboard delivers sensible reports from billions of lines of information.
Managing file permissions across a large scientific collaboration can get hairy. As collaborators come and go, project files can become inaccessible or unusable by those who need them, and project leaders need staff assistance to rectify such issues. NERSC’s PI Toolbox provides controls for principal investigators to change permissions and ownership on files and directories without the delay of obtaining staff help.
Sharing large volumes of data with collaborators can be very challenging. NERSC uses Globus to provide Shared Collections, places where scientists can share large volumes of data with the wider world or with a select few. Data transfers from these endpoints use highly efficient methods to quickly transfer large volumes of data. Contact: Lisa Gerhardt
Spin is a container-based platform at NERSC designed to deploy your science gateways, workflow managers, databases, API endpoints, and other network services to support scientific projects. Services in Spin are built with Docker containers and can easily access NERSC systems and storage. NERSC users are able to share the data and tools that enable science through the center’s science gateways. Contacts: Cory Snavely, Annette Greiner, Daniel Fulton
CRIC-database is an instance of a software stack to construct searchable image databases aka ‘searchable catalogs’, which includes a user interface with the ability to access cell collections with several classified examples, and that enables image classification and segmentation of new samples. The most popular set of microscopic images has been deployed under the name CRIC CERVIX collection, which contains over a thousand normal and abnormal images from seven lesion types as considered by the Bethesda System as part of an initiative to improve women's health and early cancer diagnosis. Contact: Dani Ushizima
As we face the threat of more frequent and more severe wildfires under climate change, Berkeley Lab researchers are building a managed data platform to help scientists better understand and predict wildland fire hazards, risks, and behaviors. Read More »
When the Dark Energy Science Collaboration needed a fast, secure, and convenient way to share large data sets with scientists outside of the collaboration, they turned to Berkeley Lab Computing Sciences for help, ultimately choosing an innovative data-sharing solution: the Modern Research Data Portal. Read More »
Developed by Berkeley Lab researchers, ESS-DIVE is a new digital archive that serves as a repository for hundreds of U.S. Department of Energy-funded research projects under the agency’s Environmental System Science umbrella. Read More »
Scientific American featured a Computing Sciences-powered project on its December cover as a top world-changing idea of 2013. The Materials Project aims to take the guesswork out of finding the best material for a job—be it a new battery electrode or a lightweight spacecraft body—by making the characteristics of every inorganic compound available to any interested scientist. Berkeley Lab Computing Sciences’ talent and resources combined to help grow this promising project into the world-changing idea it is today. Read More »
A computer code (Project Jupyter) co-developed by Berkeley Lab's Fernando Perez and embraced by the global science community over two decades has been hailed by Nature Magazine as one of “ten computer codes that transformed science.” Read More »