Berkeley Lab Cybersecurity Specialist Highlights Data Sharing Benefits, Challenges at NAS Meeting
December 4, 2018
There are many reasons why data isn’t widely shared by organizations that collect it or between scientists who analyze it in search of new scientific insights. This topic, and ways to overcome those barriers in a world of data-driven science, was the subject of a recent meeting of the Committee on Science, Engineering, Medicine, and Public Policy (COSEMPUP), a joint unit of the National Academy of Sciences, National Academy of Engineering, and the National Academy of Medicine, which took place on November 8, 2018, in Washington, D.C.
Science focuses on building knowledge about the universe through a combination of observation and experimentation. While some science can be done using mathematics and simulations, eventually science needs to be tested and confirmed in the real world. This requires data. And at a time when scientists are asking increasingly important and fundamental questions about the universe, it also requires unique and expensive instruments that generate very large amounts of data that need to be shared.
The COSEMPUP meeting was far from the first to address this essential activity, but it brought an unusual union of scientific experts to address needs, barriers, and incentives to make progress.
The core topic of the meeting was data sharing in biomedical science, so presentations focused on current challenges in biomedical data sharing and identifying potential solutions from other domains that could be applied, noted Sean Peisert, a leading cybersecurity researcher at Lawrence Berkeley National Lab and an invited speaker at the meeting. He discussed how the strategic use and combination of computer security and privacy-preserving techniques can be used to overcome certain data-sharing barriers and serve as a means to facilitate, enhance, and create incentives for increased data sharing in the sciences - thereby accelerating data-driven scientific discovery.
In particular, Peisert described how using varying combinations of current and future hardware and software techniques could help meet or exceed standards for data subject to government regulations, such as HIPAA or FISMA; address concerns regarding unregulated scientific data still containing individually private information; and provide solutions for proprietary data that might contain trade secrets.
At present, there are typically five solutions for data sharing that are used independently or in combinations, Peisert noted:
- We often don’t share data at all, which is a huge inhibitor to scientific research.
- We require people using data to come to the data, rather than being able to work with the data in their own computing environment, which often doesn’t scale in the cases in which data requires a long time for analysis and presence of many scientists at a particular remote facility for long periods is onerous.
- We put legal protections in place.
- We put elaborate security protections in place, such as “air gaps” — solutions that disconnect computing systems from any computer networks so that data on those systems cannot accidentally leak, or be maliciously stolen.
- We transform data, e.g., by redacting or “fuzzing” it in a way that it no longer presents as significant a risk if it is put in the wrong hands.
“But all these solutions have downsides, for various reasons, ranging from data not being shared at all to data being very difficult to use, to data losing significant research utility,” he said.
Fortunately, advances in computing technology and techniques have provided numerous advances that can reduce barriers to developing trustworthy data sharing solutions, including:
- Hardware-trusted execution environments, something that all of the major chip-makers have now deployed
- Software solutions for computing over encrypted data, which are no longer a trillion times too slow, as they were just a few years ago
- Differential privacy, a statistical technique for provably providing privacy guarantees while explicitly balancing research utility
- “Smart contract” components in blockchain technologies.
These techniques also increase trust in security and privacy and create positive incentives for data sharing as well, while enabling stronger tracking of data generation and use, thereby allowing data providers to be compensated for sharing useful data, Peisert emphasized.
In addition to Peisert, presenters at the meeting included:
- Carrie Wolinetz, Director of the Office of Science Policy at the National Institutes of Health, who outlined many of the key challenges to data sharing in biomedical sciences, along with many of the extremely important benefits to scientific progress and human health that stand to benefit from data sharing done well;
- J. Michael Gaziano, professor at Harvard Medical School and PI of the Veteran’s Affairs’ Million Veteran Program (MVP), who presented numerous key details of architecture and goals of the MVP project and the opportunities behind it.
- Bradley Malin, professor at Vanderbilt, who discussed the National Institutes of Health “All of Us” program, along with his own experiences with both technical and policy-related data sharing incentives.
- Alexander Szalay, professor at Johns Hopkins University, architect for the Science Archive of the Sloan Digital Sky Survey and former director of the National Virtual Observatory, who spoke about the extensive and careful curation and integration of data and tools that his organization has done to adhere to the “FAIR” (“Findable, Accessible, Interoperable and Reusable”) principles of data sharing in the SDSS and NVO.
- Beth Willman, deputy director, National Center for Optical-Infrared Astronomy, Association of Universities for Research in Astronomy, who gave the concluding presentation in which she presented the extensive approach being used by the Large Synoptic Survey Telescope.
About Computing Sciences at Berkeley Lab
High performance computing plays a critical role in scientific discovery. Researchers increasingly rely on advances in computer science, mathematics, computational science, data science, and large-scale computing and networking to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab’s Computing Sciences Area researches, develops, and deploys new foundations, tools, and technologies to meet these needs and to advance research across a broad range of scientific disciplines.