A-Z Index | Directory | Careers

Lab Team Helping Smooth Flow of Water Data

March 16, 2007

A collaboration among Microsoft, Berkeley Lab and UC Berkeley is underway to make it easier for researchers to access and analyze collected data on water, with the goal of accelerating research in the increasingly important areas of water supply and climate change.

Called Microsoft e-Science, the project is part of the Berkeley Water Center’s effort to marshal expertise from public institutions and the private sector to enable researchers to easily access and work with water data. The year-old center is the brainchild of Berkeley Lab’s Computational Research Division (CRD), UC Berkeley’s College of Engineering and UC Berkeley’s College of Natural Resources.

Members of the Berkeley Water Center from UC Berkeley and Microsoft

Front row, Microsoft's Catharine van Ingen (left) and Stuart Ozer. Second row (l-r) Matt Rodriguez, Deb Agarwal, and Monte Good with the Computational Research Division.

Local, state and federal governments have long collected detailed information about water supplies, such as measuring river flows and water content in winter snows. They use the data to make allocation decisions for farms, businesses and residential consumers. However, these agencies use different methods to collect and archive data, posing a challenge for scientists who need to retrieve and integrate all those datasets in order to carry out comprehensive analyses.

The e-Science project strives to ease that headache for scientists. The project team already has developed a prototype data server, which runs on Microsoft SQL Server 2005. The team is now testing the system by loading data about Northern California’s Russian River watershed collected by various agencies.

“Because of the differences in the data, the loading of each data file presents a new challenge, and matching data across different data sets is difficult,” said Deb Agarwal, head of CRD’s Distributed Systems Department and the Berkeley Water Center’s IT Advisor. “There is a perception that once the data is in an archive, science is enabled on a grand scale. But data availability is only the first step in the process.”

The project team has already demonstrated the prototype server to the scientific community. For example, at the FLUXNET Synthesis Workshop in Italy last month, project team members Matt Rodriguez, a CRD scientist, and Catharine van Ingen from Microsoft showed scientists how they could use the server to find and plot cross-network data in minutes, rather than days. The data to be analyzed was 600 site-years of data, most of which had not been used before in cross-site analysis. Through use of the data server scientists can spend time exploring the data rather than collating them.

At the European Geosciences Union General Assembly 2007 in April, Agarwal, van Ingen and Dennis Baldocchi from UC Berkeley will discuss the server and their support of its users.

Microsoft’s support is critical because about 90 percent of the researchers accessing these data archives use Windows-based computers. Van Ingen brings expertise from her work as an engineering professor and software expert, as well as a Microsoft insider who knows where to turn for help in the company.

Developing the prototype server was an important milestone for the project. To build it, the project team started with the data archive of the AmeriFlux network of 149 research towers located around the Americas.

Using arrays of sensors, the towers provide continuous observations of ecosystem-level exchanges of CO2, water and energy, essentially recording how the ecosystem “breathes.” The AmeriFlux archive currently contains 192 million data points stored as hundreds of files.

Researchers analyzing this data currently download a copy of the data for use in local analysis. Since the data is continually being updated and corrected, each researcher typically ends up with a different version.

Photo of the Russian River in Northern California

Data on the Russian River was added to the prototype server

Working with Van Ingen and another expert from Microsoft, Stuart Ozer, Agarwal and her staff, Rodriguez and Monte Goode, designed the server to make the AmeriFlux data easier to use. The approach incorporated a database and a “data cube,” a type of database structure optimized for data mining.

While developing the server is a major part of the project, the long-term goal is to develop a portable system that can be maintained by the researchers themselves.

“Right now we’re at the edge of computer science and research, where we are developing tools that we hope will make this data server a natural research tool, a kind of ‘collaborative data server in a box,’ for science,” Agarwal said.

Learn more about the Berkeley Water Center.


About Computing Sciences at Berkeley Lab

High performance computing plays a critical role in scientific discovery. Researchers increasingly rely on advances in computer science, mathematics, computational science, data science, and large-scale computing and networking to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab’s Computing Sciences Area researches, develops, and deploys new foundations, tools, and technologies to meet these needs and to advance research across a broad range of scientific disciplines.