Lab Team Helping Smooth Flow of Water Data
March 16, 2007
A collaboration among Microsoft, Berkeley Lab and UC Berkeley is underway to make it easier for researchers to access and analyze collected data on water, with the goal of accelerating research in the increasingly important areas of water supply and climate change.
Called Microsoft e-Science, the project is part of the Berkeley Water Center’s effort to marshal expertise from public institutions and the private sector to enable researchers to easily access and work with water data. The year-old center is the brainchild of Berkeley Lab’s Computational Research Division (CRD), UC Berkeley’s College of Engineering and UC Berkeley’s College of Natural Resources.
Local, state and federal governments have long collected detailed information about water supplies, such as measuring river flows and water content in winter snows. They use the data to make allocation decisions for farms, businesses and residential consumers. However, these agencies use different methods to collect and archive data, posing a challenge for scientists who need to retrieve and integrate all those datasets in order to carry out comprehensive analyses.
The e-Science project strives to ease that headache for scientists. The project team already has developed a prototype data server, which runs on Microsoft SQL Server 2005. The team is now testing the system by loading data about Northern California’s Russian River watershed collected by various agencies.
“Because of the differences in the data, the loading of each data file presents a new challenge, and matching data across different data sets is difficult,” said Deb Agarwal, head of CRD’s Distributed Systems Department and the Berkeley Water Center’s IT Advisor. “There is a perception that once the data is in an archive, science is enabled on a grand scale. But data availability is only the first step in the process.”
The project team has already demonstrated the prototype server to the scientific community. For example, at the FLUXNET Synthesis Workshop in Italy last month, project team members Matt Rodriguez, a CRD scientist, and Catharine van Ingen from Microsoft showed scientists how they could use the server to find and plot cross-network data in minutes, rather than days. The data to be analyzed was 600 site-years of data, most of which had not been used before in cross-site analysis. Through use of the data server scientists can spend time exploring the data rather than collating them.
At the European Geosciences Union General Assembly 2007 in April, Agarwal, van Ingen and Dennis Baldocchi from UC Berkeley will discuss the server and their support of its users.
Microsoft’s support is critical because about 90 percent of the researchers accessing these data archives use Windows-based computers. Van Ingen brings expertise from her work as an engineering professor and software expert, as well as a Microsoft insider who knows where to turn for help in the company.
Developing the prototype server was an important milestone for the project. To build it, the project team started with the data archive of the AmeriFlux network of 149 research towers located around the Americas.
Using arrays of sensors, the towers provide continuous observations of ecosystem-level exchanges of CO2, water and energy, essentially recording how the ecosystem “breathes.” The AmeriFlux archive currently contains 192 million data points stored as hundreds of files.
Researchers analyzing this data currently download a copy of the data for use in local analysis. Since the data is continually being updated and corrected, each researcher typically ends up with a different version.
Working with Van Ingen and another expert from Microsoft, Stuart Ozer, Agarwal and her staff, Rodriguez and Monte Goode, designed the server to make the AmeriFlux data easier to use. The approach incorporated a database and a “data cube,” a type of database structure optimized for data mining.
While developing the server is a major part of the project, the long-term goal is to develop a portable system that can be maintained by the researchers themselves.
“Right now we’re at the edge of computer science and research, where we are developing tools that we hope will make this data server a natural research tool, a kind of ‘collaborative data server in a box,’ for science,” Agarwal said.
Learn more about the Berkeley Water Center at http://www-esd.lbl.gov/BWC/.
About Computing Sciences at Berkeley Lab
The Lawrence Berkeley National Laboratory (Berkeley Lab) Computing Sciences organization provides the computing and networking resources and expertise critical to advancing the Department of Energy's research missions: developing new energy sources, improving energy efficiency, developing new materials and increasing our understanding of ourselves, our world and our universe.
ESnet, the Energy Sciences Network, provides the high-bandwidth, reliable connections that link scientists at 40 DOE research sites to each other and to experimental facilities and supercomputing centers around the country. The National Energy Research Scientific Computing Center (NERSC) powers the discoveries of 6,000 scientists at national laboratories and universities, including those at Berkeley Lab's Computational Research Division (CRD). CRD conducts research and development in mathematical modeling and simulation, algorithm design, data storage, management and analysis, computer system architecture and high-performance software implementation. NERSC and ESnet are DOE Office of Science User Facilities.
Lawrence Berkeley National Laboratory addresses the world's most urgent scientific challenges by advancing sustainable energy, protecting human health, creating new materials, and revealing the origin and fate of the universe. Founded in 1931, Berkeley Lab's scientific expertise has been recognized with 13 Nobel prizes. The University of California manages Berkeley Lab for the DOE’s Office of Science.
DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.