China Clipper Project Aims to Improve Human-Data Interactions

September 30, 1998

NOTE: This archived news story is made available as-is. It may contain references to programs, people, and research activities that are no longer active at Berkeley Lab. It may include links to web pages that no longer exist or refer to documents no longer available.

A newly funded computer research program at Lawrence Berkeley National Laboratory could revolutionize the way scientific instruments, computers and humans work together to gather, analyze and use data.

The program, funded by the U.S. Department of Energy, will build on efforts over the past 10 years to gather, store and make information available over computer networks. The program is called “China Clipper,” in reference to the 1930s commercial air service which spanned the Pacific Ocean and opened the door to the reliable, global air service taken for granted today.

“I believe that our China Clipper project epitomizes the research environment we will see in the future,” says Bill Johnston, leader of the Imaging and Distributed Computing Group at Berkeley Lab. “It will provide an excellent model for on-line scientific instrumentation. Data are fundamental to analytical science, and one of my professional goals is to greatly improve the routine access to scientific data — especially very large datasets — by widely distributed collaborators, and to facilitate its routine computer analysis.”

The idea behind China Clipper, like the pioneering air service, is to bring diverse resources closer together. In this case, scientific instruments such as electron microscopes and accelerators would be linked by networks to data storage “caches” and computers. China Clipper will provide the “middleware” to allow these separate components, often located hundreds or thousands of miles apart, to function as a single system. Johnson is scheduled to discuss the work of the Lab in this area at an IEEE symposium on High Performance Distributed Computing next week.

Data-intensive computing

Modern scientific computing involves organizing, moving, visualizing, and analyzing massive amounts of data from around the world, as well as employing large-scale computation. The distributed systems that solve large-scale problems involve aggregating and scheduling many resources. Data must be located and staged, cache and network capacity must be available at the same time as computing capacity, etc.

Every aspect of such a system is dynamic: locating and scheduling resources, adapting running application systems to availability and congestion in the middleware and infrastructure, responding to human interaction, etc. The technologies, the middleware services, and the architectures that are used to build useful high-speed, wide area distributed systems, constitute the field of data-intensive computing.

Enhancing data-intensive computing will make research facilities and instruments at various DOE sites available to a wider group of users. Berkeley Lab scientists are developing China Clipper in collaboration with their counterparts at the Stanford Linear Accelerator Center, Argonne National Laboratory and the Department of Energy’s Energy Sciences Network, or ESnet.

“This will lead to a substantial increase in the capabilities of experimental facilities,” predicts Johnston.

Faster turnaround

As an example of the benefits, Johnston cites a Cooperative Research and Development Agreement project called “WALDO” (for Wide Area Large Data Object). In this project, Johnston's group, Pacific Bell, the NTON optical network testbed project at Lawrence Livermore National Lab and others worked with Kaiser Permanente to produce a prototype online, distributed, high-data-rate medical imaging system. The project allowed cardio-angiography data to be collected directly from a scanner in a San Francisco hospital. The system was connected to a high-speed Bay Area network and data was collected, processed, and stored at Berkeley Lab, and accessed by cardiologists at the Kaiser Oakland hospital.

One result of the Kaiser project was a demonstration that physicians could have immediate access to the numerous medical images from each patient. Currently, such images are processed and kept by a central office and doctors at the referring hospitals only see one or two images after a couple of weeks, however, with the WALDO real-time acquisition and cataloging approach, they had access in a few hours.

Better research
The vision guiding this work is that getting faster access to data will allow scientists to conduct their work more efficiently and gain new insights. Much research involves starting out with a scientific model of what’s supposed to occur, then conducting an experiment and comparing the actual results with what was expected. Figuring out the how and why of this difference is where the real science happens, Johnston says. China Clipper is expected to lead to better utilization of instrumentation for experiments and provide fast comparisons of actual experiments and computational models, thereby giving researchers better tools for testing scientific theories.

Because the test-and-compare procedure must often be conducted over and over to obtain reliable results, streamlining the process each time around could significantly increase the rate of scientific discovery.

Evolution of an idea
According to Johnston, China Clipper is the culmination of a decade of research and development of high-speed, wide area, data intensive computing. The first demonstration of the project’s potential came during 1989 hearings held by then-Senator Al Gore on his High Performance Computing and Communications legislation. Because the Senate room had no network connections at the time, a simulated transmission of images over a network at various speeds was put together. The successful effort introduced legislators to the implications of network bandwidth.

Johnston’s group continued its work, evolving from scientific visualization to the idea of operating scientific instruments on line. This work is lead by Bahram Parvin in collaboration with the Lab’s Materials Sciences and Life Sciences. Last year, several group members patented their system which provides automatic computerized control of microscopic experiments. The system collects video data, analyzes the data and then sends a signal to the instruments to carry out such delicate tasks as cleaving DNA molecules and controlling the shape of growing micro-crystals.

One key aspect of successful data-intensive computing — accessing data cached at various sites — was developed by Berkeley Lab for a DARPA-funded project. Called Distributed-Parallel Storage System, or DPSS, this technology successfully provided an economical, high performance and highly scalable design for caching large amounts of data for use by many different users. Brian Tierney continues this project with his team in NERSC’s Future Technologies Group.

In May, a team from Berkeley Lab and SLAC conducted an experiment using DPSS to support high energy physics data analysis. The team achieved a sustained data transfer rate of 57 MBytes per second, demonstrating that high-speed data storage systems could use distributed caches to make data available to systems running analysis codes.

Overcoming the hurdles
With the development of various components necessary for data intensive computing, the number of obstacles has dwindled. One of the last remaining issues, that of scheduling and allocating resources over networks, is being addressed by “differentiated services.” This technology, resulting from work by Van Jacobson’s Network Research Group, specially marks some data packets for priority service as they move across networks. A demonstration by Berkeley Lab in April showed that priority-marked packets arrived at eight times the speed of regular packets when sent through congested network connectoins. Differentiated services would ensure that designated projects could be conducted by reserving sufficient resources.

The next big step, says Johnston, is to integrate the various components and technologies into a cohesive and reliable package — a set of “middleware services” that let applications easily use these new capabilities.

“We see China Clipper not so much as a ‘system,’ but rather as a coordinated collection of services that may be flexibly used for a variety of applications,” says Johnston. “Once it takes off, we see it opening new routes and opportunities for scientific discovery.

About Computing Sciences at Berkeley Lab

High performance computing plays a critical role in scientific discovery. Researchers increasingly rely on advances in computer science, mathematics, computational science, data science, and large-scale computing and networking to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab’s Computing Sciences Area researches, develops, and deploys new foundations, tools, and technologies to meet these needs and to advance research across a broad range of scientific disciplines.