Berkeley Lab Cybersecurity Specialist Highlights Data Sharing Benefits, Challenges at NAS Meeting
December 4, 2018
There are many reasons why data isn’t widely shared by organizations that collect it or between scientists who analyze it in search of new scientific insights. This topic, and ways to overcome those barriers in a world of data-driven science, was the subject of a recent meeting of the Committee on Science, Engineering, Medicine, and Public Policy (COSEMPUP), a joint unit of the National Academy of Sciences, National Academy of Engineering, and the National Academy of Medicine, which took place on November 8, 2018, in Washington, D.C.
Science focuses on building knowledge about the universe through a combination of observation and experimentation. While some science can be done using mathematics and simulations, eventually science needs to be tested and confirmed in the real world. This requires data. And at a time when scientists are asking increasingly important and fundamental questions about the universe, it also requires unique and expensive instruments that generate very large amounts of data that need to be shared.
The COSEMPUP meeting was far from the first to address this essential activity, but it brought an unusual union of scientific experts to address needs, barriers, and incentives to make progress.
The core topic of the meeting was data sharing in biomedical science, so presentations focused on current challenges in biomedical data sharing and identifying potential solutions from other domains that could be applied, noted Sean Peisert, a leading cybersecurity researcher at Lawrence Berkeley National Lab and an invited speaker at the meeting. He discussed how the strategic use and combination of computer security and privacy-preserving techniques can be used to overcome certain data-sharing barriers and serve as a means to facilitate, enhance, and create incentives for increased data sharing in the sciences - thereby accelerating data-driven scientific discovery.
In particular, Peisert described how using varying combinations of current and future hardware and software techniques could help meet or exceed standards for data subject to government regulations, such as HIPAA or FISMA; address concerns regarding unregulated scientific data still containing individually private information; and provide solutions for proprietary data that might contain trade secrets.
At present, there are typically five solutions for data sharing that are used independently or in combinations, Peisert noted:
- We often don’t share data at all, which is a huge inhibitor to scientific research.
- We require people using data to come to the data, rather than being able to work with the data in their own computing environment, which often doesn’t scale in the cases in which data requires a long time for analysis and presence of many scientists at a particular remote facility for long periods is onerous.
- We put legal protections in place.
- We put elaborate security protections in place, such as “air gaps” — solutions that disconnect computing systems from any computer networks so that data on those systems cannot accidentally leak, or be maliciously stolen.
- We transform data, e.g., by redacting or “fuzzing” it in a way that it no longer presents as significant a risk if it is put in the wrong hands.
“But all these solutions have downsides, for various reasons, ranging from data not being shared at all to data being very difficult to use, to data losing significant research utility,” he said.
Fortunately, advances in computing technology and techniques have provided numerous advances that can reduce barriers to developing trustworthy data sharing solutions, including:
- Hardware-trusted execution environments, something that all of the major chip-makers have now deployed
- Software solutions for computing over encrypted data, which are no longer a trillion times too slow, as they were just a few years ago
- Differential privacy, a statistical technique for provably providing privacy guarantees while explicitly balancing research utility
- “Smart contract” components in blockchain technologies.
These techniques also increase trust in security and privacy and create positive incentives for data sharing as well, while enabling stronger tracking of data generation and use, thereby allowing data providers to be compensated for sharing useful data, Peisert emphasized.
In addition to Peisert, presenters at the meeting included:
- Carrie Wolinetz, Director of the Office of Science Policy at the National Institutes of Health, who outlined many of the key challenges to data sharing in biomedical sciences, along with many of the extremely important benefits to scientific progress and human health that stand to benefit from data sharing done well;
- J. Michael Gaziano, professor at Harvard Medical School and PI of the Veteran’s Affairs’ Million Veteran Program (MVP), who presented numerous key details of architecture and goals of the MVP project and the opportunities behind it.
- Bradley Malin, professor at Vanderbilt, who discussed the National Institutes of Health “All of Us” program, along with his own experiences with both technical and policy-related data sharing incentives.
- Alexander Szalay, professor at Johns Hopkins University, architect for the Science Archive of the Sloan Digital Sky Survey and former director of the National Virtual Observatory, who spoke about the extensive and careful curation and integration of data and tools that his organization has done to adhere to the “FAIR” (“Findable, Accessible, Interoperable and Reusable”) principles of data sharing in the SDSS and NVO.
- Beth Willman, deputy director, National Center for Optical-Infrared Astronomy, Association of Universities for Research in Astronomy, who gave the concluding presentation in which she presented the extensive approach being used by the Large Synoptic Survey Telescope.
About Computing Sciences at Berkeley Lab
The Lawrence Berkeley National Laboratory (Berkeley Lab) Computing Sciences Area provides the computing and networking resources and expertise critical to advancing Department of Energy Office of Science (DOE-SC) research missions: developing new energy sources, improving energy efficiency, developing new materials, and increasing our understanding of ourselves, our world and our universe.
ESnet, the Energy Sciences Network, provides the high-bandwidth, reliable connections that link scientists at 40 DOE research sites to each other and to experimental facilities and supercomputing centers around the country. The National Energy Research Scientific Computing Center (NERSC) powers the discoveries of 7,000-plus scientists at national laboratories and universities, including those at Berkeley Lab's Computational Research Division (CRD). NERSC and ESnet are both Department of Energy Office of Science National User Facilities. The Computational Research Division (CRD) conducts research and development in mathematical modeling and simulation, algorithm design, data storage, management and analysis, computer system architecture and high-performance software implementation. NERSC and ESnet are Department of Energy Office of Science User Facilities.
Berkeley Lab addresses the world's most urgent scientific challenges by advancing sustainable energy, protecting human health, creating new materials, and revealing the origin and fate of the universe. Founded in 1931, Berkeley Lab's scientific expertise has been recognized with 13 Nobel prizes. The University of California manages Berkeley Lab for the DOE’s Office of Science.
The DOE Office of Science is the United States' single largest supporter of basic research in the physical sciences and is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.