China Clipper Project Aims to Improve Human-Data Interactions
September 30, 1998
A newly funded computer research program at Lawrence Berkeley National Laboratory could revolutionize the way scientific instruments, computers and humans work together to gather, analyze and use data.
The program, funded by the U.S. Department of Energy, will build on efforts over the past 10 years to gather, store and make information available over computer networks. The program is called “China Clipper,” in reference to the 1930s commercial air service which spanned the Pacific Ocean and opened the door to the reliable, global air service taken for granted today.
“I believe that our China Clipper project epitomizes the research environment we will see in the future,” says Bill Johnston, leader of the Imaging and Distributed Computing Group at Berkeley Lab. “It will provide an excellent model for on-line scientific instrumentation. Data are fundamental to analytical science, and one of my professional goals is to greatly improve the routine access to scientific data — especially very large datasets — by widely distributed collaborators, and to facilitate its routine computer analysis.”
The idea behind China Clipper, like the pioneering air service, is to bring diverse resources closer together. In this case, scientific instruments such as electron microscopes and accelerators would be linked by networks to data storage “caches” and computers. China Clipper will provide the “middleware” to allow these separate components, often located hundreds or thousands of miles apart, to function as a single system. Johnson is scheduled to discuss the work of the Lab in this area at an IEEE symposium on High Performance Distributed Computing next week.
Modern scientific computing involves organizing, moving, visualizing, and analyzing massive amounts of data from around the world, as well as employing large-scale computation. The distributed systems that solve large-scale problems involve aggregating and scheduling many resources. Data must be located and staged, cache and network capacity must be available at the same time as computing capacity, etc.
Every aspect of such a system is dynamic: locating and scheduling resources, adapting running application systems to availability and congestion in the middleware and infrastructure, responding to human interaction, etc. The technologies, the middleware services, and the architectures that are used to build useful high-speed, wide area distributed systems, constitute the field of data intensive computing.
Enhancing data intensive computing will make research facilities and instruments at various DOE sites available to a wider group of users. Berkeley Lab scientists are developing China Clipper in collaboration with their counterparts at the Stanford Linear Accelerator Center, Argonne National Laboratory and the Department of Energy’s Energy Sciences Network, or ESnet.
“This will lead to a substantial increase in the capabilities of experimental facilitie,” predicts Johnston.
As an example of the benefits, Johnston cites a Cooperative Research and Development Agreement project called “WALDO” (for Wide Area Large Data Object). In this project, Johnston's group, Pacific Bell, the NTON optical network testbed project at Lawrence Livermore National Lab and others worked with Kaiser Permanente to produce a prototype on-line, distributed, high-data-rate medical imaging system. The project allowed cardio-angiography data to be collected directly from a scanner in a San Francisco hospital. The system was connected to a high-speed Bay Area network and data was collected, processed, and stored at Berkeley Lab, and accessed by cardiologists at the Kaiser Oakland hospital.
One result of the Kaiser project was a demonstration that physicians could have immediate access to the numerous medical images from each patient. Currently, such images are processed and kept by a central office and doctors at the referring hospitals only see one or two images after a couple of weeks, however, with the WALDO real-time acquisition and cataloguing approach, they had access in a few hours.
The vision guiding this work is that getting faster access to data will allow scientists to conduct their work more efficiently and gain new insights. Much research involves starting out with a scientific model of what’s supposed to occur, then conducting an experiment and comparing the actual results with what was expected. Figuring out the how and why of this difference is where the real science happens, Johnston says. China Clipper is expected to lead to better utilization of instrumentation for experiments and provide fast comparisons of actual experiments and computational models, thereby giving researchers better tools for testing scientific theories.
Because the test-and-compare procedure must often be conducted over and over to obtain reliable results, streamlining the process each time around could significantly increase the rate of scientific discovery.
Evolution of an idea
According to Johnston, China Clipper is the culmination of a decade of research and development of high-speed, wide area, data intensive computing. The first demonstration of the project’s potential came during 1989 hearings held by then-Senator Al Gore on his High Performance Computing and Communications legislation. Because the Senate room had no network connections at the time, a simulated transmission of images over a network at various speeds was put together. The successful effort introduced legislators to the implications of network bandwidth.
Johnston’s group continued its work, evolving from scientific visualization to the idea of operating scientific instruments on line. This work is lead by Bahram Parvin in collaboration with the Lab’s Materials Sciences and Life Sciences. Last year, several group members patented their system which provides automatic computerized control of microscopic experiments. The system collects video data, analyzes the data and then sends a signal to the instruments to carry out such delicate tasks as cleaving DNA molecules and controlling the shape of growing micro-crystals.
One key aspect of successful data-intensive computing — accessing data cached at various sites — was developed by Berkeley Lab for a DARPA-funded project. Called Distributed-Parallel Storage System, or DPSS, this technology successfully provided an economical, high performance and highly scalable design for caching large amounts of data for use by many different users. Brian Tierney continues this project with his team in NERSC’s Future Technologies Group.
In May, a team from Berkeley Lab and SLAC conducted an experiment using DPSS to support high energy physics data analysis. The team achieved a sustained data transfer rate of 57 MBytes per second, demonstrating that high-speed data storage systems could use distributed caches to make data available to systems running analysis codes.
Overcoming the hurdles
With the development of various components necessary for data intensive computing, the number of obstacles has dwindled. One of the last remaining issues, that of scheduling and allocating resources over networks, is being addressed by “differentiated services.” This technology, resulting from work by Van Jacobson’s Network Research Group, specially marks some data packets for priority service as they move across networks. A demonstration by Berkeley Lab in April showed that priority-marked packets arrived at eight times the speed of regular packets when sent through congested network connectoins. Differentiated services would ensure that designated projects could be conducted by reserving sufficient resources.
The next big step, says Johnston, is to integrate the various components and technologies into a cohesive and reliable package — a set of “middleware services” that let applications easily use these new capabilities.
“We see China Clipper not so much as a ‘system,’ but rather as a coordinated collection of services that may be flexibly used for a variety of applications,” says Johnston. “Once it takes off, we see it opening new routes and opportunities for scientific discovery.
About Computing Sciences at Berkeley Lab
The Lawrence Berkeley National Laboratory (Berkeley Lab) Computing Sciences organization provides the computing and networking resources and expertise critical to advancing the Department of Energy's research missions: developing new energy sources, improving energy efficiency, developing new materials and increasing our understanding of ourselves, our world and our universe.
ESnet, the Energy Sciences Network, provides the high-bandwidth, reliable connections that link scientists at 40 DOE research sites to each other and to experimental facilities and supercomputing centers around the country. The National Energy Research Scientific Computing Center (NERSC) powers the discoveries of 6,000 scientists at national laboratories and universities, including those at Berkeley Lab's Computational Research Division (CRD). CRD conducts research and development in mathematical modeling and simulation, algorithm design, data storage, management and analysis, computer system architecture and high-performance software implementation. NERSC and ESnet are DOE Office of Science User Facilities.
Lawrence Berkeley National Laboratory addresses the world's most urgent scientific challenges by advancing sustainable energy, protecting human health, creating new materials, and revealing the origin and fate of the universe. Founded in 1931, Berkeley Lab's scientific expertise has been recognized with 13 Nobel prizes. The University of California manages Berkeley Lab for the DOE’s Office of Science.
DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.