New Tools Mobilize Local Data to Study Global Environmental Issues

February 4, 2009

Tracking Carbon Exchange: Since May 2001, the Flux tower at California's Tonzi Ranch has been tracking rates of carbon dioxide exchange between the atmosphere, plants and soil (image: Youngryel Ryu).

As they strive to develop effective strategies for guarding water supplies, protecting endangered species and curbing greenhouse gases, environmental scientists are turning to innovative cyber-infrastructures and data-mining tools developed by an ongoing collaboration between researchers at Lawrence Berkeley National Laboratory, Microsoft Research, and the University of California, Berkeley.

The Microsoft eScience program is the primary funder of this project, which is one of numerous ventures cultivated by the Berkeley Water Center (BWC). Launched approximately three years ago by researchers from the Berkeley Lab and UC Berkeley’s Colleges of Engineering and Natural Resources, the BWC marshals expertise from public institutions and the private sector in support of projects that enable science and public policy researchers to more easily access and work with water and environmental datasets.

“The most cost-efficient way to impact issues like global climate change and water management is to develop cyber-architectures that organize data and foster scientific collaboration,” says Susan Hubbard, staff scientist in the Berkeley Lab’s Earth Sciences Division and associate director of the BWC.

Environmental scientists typically collect data on a project-by-project basis, in campaigns targeted at very specific topics. One study may use NASA satellites to track annual rainfall of deserts around the globe, while another project sponsored by the National Science Foundation (NSF) might measure the annual water tables of the Sahara desert with commercial sensors. The data are then typically stored in local archive systems and accessed by researchers associated with that particular project. These sites are scattered across the country, tend to be aligned with specific campaigns, and are funded by a variety of organizations.

According to Catharine van Ingen, partner architect with Microsoft Research, this system can be cumbersome at times because observations are stored in data archives and access centers in the same format that is deposited, and undergo only very simple checks and transformations, making the data difficult to share with other scientists. She notes that much of this information is not science-ready. To fulfill this requirement the data must cataloged, checked, and processed to eliminate obvious problems caused by battery loss, transcription errors, or environmental factors such as freezing rain or birds.

Aerial photo of the Wohler ponds along the Russian River, Sonoma County, CA

Winter Floods: This aerial photo shows the Wohler ponds along the Russian River during winter as indicated by the high flows and turbid water conditions. The Mirabel inflatable dam is only erected in summer, so is not evident in this picture (image: Sonoma County Water Agency).

In most cases, scientists also cannot withdraw data from these centers during non-business hours, and so many researchers opt to retain their observations on their own personal desktop computers. If other researchers want to use this data, they have to contact the lead scientist and have him/her e-mail this information to them.

“One of the greatest challenges of the next century will be developing cyber-architectures that allow scientists to easily navigate their digital assets. Today, the internet has given environmental researchers instant access to a wealth of field data. Now, they need a scientific ‘safety deposit box’ system that will not only store this information, but also organize it so it is searchable and ready for analysis,” says van Ingen.

Designing an Environmental Database for the 21st Century

According to Deb Agarwal, member of the BWC and head of the Advanced Computing for Science Department in the Computational Research Division at Berkeley Lab, the computing needs of many eScience researchers fall into the gap between the typical supercomputer user and the desktop computer user.

“An environmental dataset is often 1 terabyte or smaller in size. These datasets can be stored easily on a desktop hard drive. This means that the hardware needed to create a centralized database is extremely inexpensive and is not the limiting factor. Instead, usability and longevity of the data is the issue,” she says.

Agarwal’s team worked with existing Microsoft tools initially to develop a prototype database for data collected by the AmeriFlux network. For over 10 years, the AmeriFlux collaboration of field researchers has tracked carbon dioxide exchange between plants and soil on the ground with the planet’s atmosphere, on an hourly basis, and in more than 120 sites across North, Central and South America. The sites represent a range of ecosystems, from the Arctic tundra to North American prairies and Amazonian rainforests. Since its inception almost two years ago, years ago, the database, called the Fluxdata Scientific Data Server, has grown to include data from Fluxnet, which incorporates AmeriFlux counterparts around the world including, Asia, Africa, Australia, and Europe.

Photo of a water reservoir at Tonzi Ranch in October 2008

Watching Water: A water reservoir at Tonzi Ranch in October 2008 (image: Youngryel Ryu).

The Fluxdata Scientific Data Server now includes semi-automated ingest tools to automatically extract important aspects of incoming data; a database and schema to organize and archive information; data cubes that allow researchers to look at the data from multiple perspectives; and tools which automatically convert multiple data versions into one format. The new architecture also enables researchers to browse data and reports via Internet and collaborate with each other. This means scientists no longer need to download and interpret the raw data from a data collection center. Instead they can browse, mine, and do research on the data without needing to download and process it first.

Once this server architecture proved to be successful, eScience team members applied this “cyber-blueprint” to create searchable central repositories for the variety of field data collected from California’s Russian and Pajaro Rivers. Currently, the team members are collaborating with the National Marine Fisheries Service to aid research involving fish recovery efforts in Northern California coastal streams, and will soon develop a server than encompasses observational information about all the watersheds in California.

“In the past, the computing needs of environmental researchers have often been overlooked because they are rarely on the leading edge of computational or scale requirements of the scientific community, and collectively are not a big enough customer to be commercially profitable. Despite this, their computing challenges are substantial and solving them is essential to their work helping us understand climate change and our surrounding environment,” says Agarwal.

CRD currently hosts seven, soon to be eight, BWC Data Servers in the Advanced Computing for Science Department’s machine room. The machines are supplied by CRD and Microsoft, while Agarwal’s team of CRD scientists provide hardware support, networking, rack space, electricity, and console servers.

Current team members include, CRD Scientists Keith Jackson and Monte Goode. Past members of the team include Robin Weber, a UCB employee, and Matt Rodriguez, former CRD member. Kurt Spindler, a high school summer student from the Berkeley Lab’s Center for Science & Engineering Education program also worked on the project in the summer of 2008.

Impacting Science

According to Jim Hunt, professor of civil engineering at UC Berkeley and co-director of the BWC, relatively basic questions such as how the annual water balance in the Russian River watershed changed in the past decade were not exactly impossible to answer before the eScience data-mining tools were developed. However, the tasks of gathering data from a variety of organizations, reformatting the data to make it consistent, sifting out the important pieces of information, and calculating the balances, were so time-consuming and tedious task, that most scientists didn’t want to tackle the issue. He notes that the new eScience tools can produce this answer in minutes. In addition, the data cube architecture allows scientists to find many different relationships in the datasets.

“Everything in an ecosystem is interconnected. Changes in one particular ecosystem could have global consequences, and tools like the data cube make it easier for us to see the big picture.… We can now inquire about more complex relationships like how do the changes in a watershed’s annual water balance affect the amount of carbon dioxide in its surrounding atmosphere,” says Dennis Baldocchi, Professor Biometeorology at UC Berkeley.

“The answers to these types of questions will allow us to make accurate predictions about the future of such watersheds, and in-turn helps us develop more effective strategies for managing these resources,” adds Hunt.

Learn more about The Berkeley Water Center (BWC).

About Computing Sciences at Berkeley Lab

High performance computing plays a critical role in scientific discovery. Researchers increasingly rely on advances in computer science, mathematics, computational science, data science, and large-scale computing and networking to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab’s Computing Sciences Area researches, develops, and deploys new foundations, tools, and technologies to meet these needs and to advance research across a broad range of scientific disciplines.