New Tools Mobilize Local Data to Study Global Environmental Issues
February 4, 2009
Contact: Linda Vu, LVu@lbl.gov, 510.495.2402
As they strive to develop effective strategies for guarding water supplies, protecting endangered species and curbing greenhouse gases, environmental scientists are turning to innovative cyber-infrastructures and data-mining tools developed by an ongoing collaboration between researchers at Lawrence Berkeley National Laboratory, Microsoft Research, and the University of California, Berkeley.
The Microsoft eScience program is the primary funder of this project, which is one of numerous ventures cultivated by the Berkeley Water Center (BWC). Launched approximately three years ago by researchers from the Berkeley Lab and UC Berkeley’s Colleges of Engineering and Natural Resources, the BWC marshals expertise from public institutions and the private sector in support of projects that enable science and public policy researchers to more easily access and work with water and environmental datasets.
“The most cost-efficient way to impact issues like global climate change and water management is to develop cyber-architectures that organize data and foster scientific collaboration,” says Susan Hubbard, staff scientist in the Berkeley Lab’s Earth Sciences Division and associate director of the BWC.
Environmental scientists typically collect data on a project-by-project basis, in campaigns targeted at very specific topics. One study may use NASA satellites to track annual rainfall of deserts around the globe, while another project sponsored by the National Science Foundation (NSF) might measure the annual water tables of the Sahara desert with commercial sensors. The data are then typically stored in local archive systems and accessed by researchers associated with that particular project. These sites are scattered across the country, tend to be aligned with specific campaigns, and are funded by a variety of organizations.
According to Catharine van Ingen, partner architect with Microsoft Research, this system can be cumbersome at times because observations are stored in data archives and access centers in the same format that is deposited, and undergo only very simple checks and transformations, making the data difficult to share with other scientists. She notes that much of this information is not science-ready. To fulfill this requirement the data must cataloged, checked, and processed to eliminate obvious problems caused by battery loss, transcription errors, or environmental factors such as freezing rain or birds.
In most cases, scientists also cannot withdraw data from these centers during non-business hours, and so many researchers opt to retain their observations on their own personal desktop computers. If other researchers want to use this data, they have to contact the lead scientist and have him/her e-mail this information to them.
“One of the greatest challenges of the next century will be developing cyber-architectures that allow scientists to easily navigate their digital assets. Today, the internet has given environmental researchers instant access to a wealth of field data. Now, they need a scientific ‘safety deposit box’ system that will not only store this information, but also organize it so it is searchable and ready for analysis,” says van Ingen.
Designing an Environmental Database for the 21st Century
According to Deb Agarwal, member of the BWC and head of the Advanced Computing for Science Department in the Computational Research Division at Berkeley Lab, the computing needs of many eScience researchers fall into the gap between the typical supercomputer user and the desktop computer user.
“An environmental dataset is often 1 terabyte or smaller in size. These datasets can be stored easily on a desktop hard drive. This means that the hardware needed to create a centralized database is extremely inexpensive and is not the limiting factor. Instead, usability and longevity of the data is the issue,” she says.
Agarwal’s team worked with existing Microsoft tools initially to develop a prototype database for data collected by the AmeriFlux network. For over 10 years, the AmeriFlux collaboration of field researchers has tracked carbon dioxide exchange between plants and soil on the ground with the planet’s atmosphere, on an hourly basis, and in more than 120 sites across North, Central and South America. The sites represent a range of ecosystems, from the Arctic tundra to North American prairies and Amazonian rainforests. Since its inception almost two years ago, years ago, the database, called the Fluxdata Scientific Data Server, has grown to include data from Fluxnet, which incorporates AmeriFlux counterparts around the world including, Asia, Africa, Australia, and Europe.
The Fluxdata Scientific Data Server now includes semi-automated ingest tools to automatically extract important aspects of incoming data; a database and schema to organize and archive information; data cubes that allow researchers to look at the data from multiple perspectives; and tools which automatically convert multiple data versions into one format. The new architecture also enables researchers to browse data and reports via Internet and collaborate with each other. This means scientists no longer need to download and interpret the raw data from a data collection center. Instead they can browse, mine, and do research on the data without needing to download and process it first.
Once this server architecture proved to be successful, eScience team members applied this “cyber-blueprint” to create searchable central repositories for the variety of field data collected from California’s Russian and Pajaro Rivers. Currently, the team members are collaborating with the National Marine Fisheries Service to aid research involving fish recovery efforts in Northern California coastal streams, and will soon develop a server than encompasses observational information about all the watersheds in California.
“In the past, the computing needs of environmental researchers have often been overlooked because they are rarely on the leading edge of computational or scale requirements of the scientific community, and collectively are not a big enough customer to be commercially profitable. Despite this, their computing challenges are substantial and solving them is essential to their work helping us understand climate change and our surrounding environment,” says Agarwal.
CRD currently hosts seven, soon to be eight, BWC Data Servers in the Advanced Computing for Science Department’s machine room. The machines are supplied by CRD and Microsoft, while Agarwal’s team of CRD scientists provide hardware support, networking, rack space, electricity, and console servers.
Current team members include, CRD Scientists Keith Jackson and Monte Goode. Past members of the team include Robin Weber, a UCB employee, and Matt Rodriguez, former CRD member. Kurt Spindler, a high school summer student from the Berkeley Lab’s Center for Science & Engineering Education program also worked on the project in the summer of 2008.
According to Jim Hunt, professor of civil engineering at UC Berkeley and co-director of the BWC, relatively basic questions such as how the annual water balance in the Russian River watershed changed in the past decade were not exactly impossible to answer before the eScience data-mining tools were developed. However, the tasks of gathering data from a variety of organizations, reformatting the data to make it consistent, sifting out the important pieces of information, and calculating the balances, were so time-consuming and tedious task, that most scientists didn’t want to tackle the issue. He notes that the new eScience tools can produce this answer in minutes. In addition, the data cube architecture allows scientists to find many different relationships in the datasets.
“Everything in an ecosystem is interconnected. Changes in one particular ecosystem could have global consequences, and tools like the data cube make it easier for us to see the big picture.… We can now inquire about more complex relationships like how do the changes in a watershed’s annual water balance affect the amount of carbon dioxide in its surrounding atmosphere,” says Dennis Baldocchi, Professor Biometeorology at UC Berkeley.
“The answers to these types of questions will allow us to make accurate predictions about the future of such watersheds, and in-turn helps us develop more effective strategies for managing these resources,” adds Hunt.
For more information on the BWC, please visit: http://bwc.berkeley.edu/home/thrust_areas/mstci.html
About Computing Sciences at Berkeley Lab
The Lawrence Berkeley National Laboratory (Berkeley Lab) Computing Sciences organization provides the computing and networking resources and expertise critical to advancing the Department of Energy's research missions: developing new energy sources, improving energy efficiency, developing new materials and increasing our understanding of ourselves, our world and our universe.
ESnet, the Energy Sciences Network, provides the high-bandwidth, reliable connections that link scientists at 40 DOE research sites to each other and to experimental facilities and supercomputing centers around the country. The National Energy Research Scientific Computing Center (NERSC) powers the discoveries of 7,000-plus scientists at national laboratories and universities, including those at Berkeley Lab's Computational Research Division (CRD). CRD conducts research and development in mathematical modeling and simulation, algorithm design, data storage, management and analysis, computer system architecture and high-performance software implementation. NERSC and ESnet are Department of Energy Office of Science User Facilities.
Lawrence Berkeley National Laboratory addresses the world's most urgent scientific challenges by advancing sustainable energy, protecting human health, creating new materials, and revealing the origin and fate of the universe. Founded in 1931, Berkeley Lab's scientific expertise has been recognized with 13 Nobel prizes. The University of California manages Berkeley Lab for the DOE’s Office of Science.
DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.