LBNL’s DataMover Reaches Milestone with Automated Transfer of 18,000 Files in a Single Request
November 30, 2004
Amidst the hype and hoopla at the recent SC2004 conference in Pittsburgh, Lawrence Berkeley National Laboratory’s Scientific Data Management Research Group demonstrated the robustness of the group’s DataMover by putting the application through its workaday paces. In doing so, the group reached a milestone when, with a single request, 17,870 data files were moved seamlessly from Brookhaven National Lab in New York to LBNL, both of which are operated by the U.S. Department of Energy.
What made the transfer significant was that it was steered by LBNL’s Eric Hjort from the conference in Pittsburgh, and that the number of files moved was the highest ever. But it was just another day of moving data for Hjort, who oversees the moving of files generated at Brookhaven’s STAR experiment to the High Performance Storage System (HPSS) at the National Energy Research Scientific Computing (NERSC) Center at LBNL. DataMover automates all aspects of the transfer once the user determines which directory and all its files are to be moved and where they will be moved to.
Once the application starts, DataMover communicates with the source and target hierarchical resource managers (HRMs) at BNL and NERSC. The HRMs are Grid middleware components developed by the Scientific Data Management Research Group that manage staging and archiving of files from/to HPSS. The DataMover extracts the directory structure from the source HPSS through HRM, generates the corresponding directory structure at the target HPSS through HRM, and puts the list of requested files in the target hierarchical resource manager (HRM). The target HRM then contacts the HRM at the data source to stage the files and uses GridFTP to transfer the data.
Without DataMover, users would have to manually locate each of the files, then transfer them one by one. Because STAR generates about 400 terabytes of data each year, automating the transfer is critical. The use of GridFTP with large windows, as well as staging, transferring, and archiving multiple files concurrently enable effective end-to-end transfer rates.
“A nice milestone and one more step in the path to success for the LBNL/SRM team. I have to renew my thanks and gratitude for such a tool: it has made our lives so much easier in STAR for the past few years,” said Jerome Lauret of Brookhaven. “Thanks and congratulations for a successful demo.”
The DataMover also automatically addresses “transient failures,” such as failed network connections or problems in a storage system at either end, by automatically retrying until the connection is re-established. File tracking logs also help users monitor problems such as network slowdowns, files transfers and bottlenecks. In addition, the massive transfer operation also records the files in a file catalog in the target NERSC site.
“In the past, users had to baby-sit all the transfers,” said Alex Sim, a member of the Scientific Data Management Research Group and one of the developers of the application. “This application automates all those processes.” Various components of this system were developed by Sim, Junmin Gu and Vijaya Natarajan under the leadership of Arie Shoshani, who is the PI for the Storage Resource Management project at LBNL.
The DataMover can work with any storage system that supports an SRM middleware interface. The DataMover is also currently being used by the Earth Systems Grid climate research collaboration between Berkeley Lab, Oak Ridge and Lawrence Livermore national labs, and the National Center for Atmospheric Research. Recently DataMover transferred 4,224 files containing 770 gigabytes from the HPSS at Oak Ridge to NCAR’s specialized Mass Storage System. For this purpose the HPSS-HRM was adapted to work with NCAR’s MSS.
About Computing Sciences at Berkeley Lab
The Computing Sciences Area at Lawrence Berkeley National Laboratory(Berkeley Lab) provides the computing and networking resources and expertise critical to advancing Department of Energy Office of Science (DOE-SC) research missions: developing new energy sources, improving energy efficiency, developing new materials, and increasing our understanding of ourselves, our world, and our universe. ESnet, the Energy Sciences Network, provides the high-bandwidth, reliable connections that link scientists at 40 DOE research sites to each other and to experimental facilities and supercomputing centers around the country. The National Energy Research Scientific Computing Center (NERSC) powers the discoveries of 7,000-plus scientists at national laboratories and universities. NERSC and ESnet are both Department of Energy Office of Science National User Facilities. The Computational Research Division (CRD) conducts research and development in mathematical modeling and simulation, algorithm design, data storage, management and analysis, computer system architecture and high-performance software implementation.
Berkeley Lab addresses the world's most urgent scientific challenges by advancing sustainable energy, protecting human health, creating new materials, and revealing the origin and fate of the universe. Founded in 1931, Berkeley Lab's scientific expertise has been recognized with 13 Nobel prizes. The University of California manages Berkeley Lab for the DOE’s Office of Science. The DOE Office of Science is the United States' single largest supporter of basic research in the physical sciences and is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.