Data Tracking Increases Scientific Productivity

July 20, 2011

HPSS Storage

New supercomputers and networks are contributing to record levels of scientific productivity. In fact, every new system installed at NERSC over the last 10 years has generated about 50 percent more data than its predecessor. To effectively meet the increasing scientific demand for storage systems and services, the center’s staff must first understand how data moves within the facility. Until recently, the process of obtaining these insights was extremely tedious because the statistics came from multiple sources, including network router statistics, client and server transfer logs, storage and accounting reports—all saved as very large, independently formatted text files.

Now a dynamic database created by the NERSC Storage Systems Group continually collects statistics from all of these sources and compiles them into a single, searchable repository. The system also automatically generates daily email reports and graphs that illustrate how data moves in and out of the facility’s HPSS archival storage system, which is the largest repository of scientific data at the center.

“The daily reports help us understand the frequency, amount, and method of data movement between the archive and other systems in the center,” says Jason Hick, who heads the Storage Systems Group. “Because 50 percent of all data movement activities within the center involve HPSS, this capability allows us to identify user bottlenecks and gives us an indication of how we should invest in storage solutions.”

Strategy for Planning Hardware and Software

“When all we had were text files of various daily logs and reports, identifying specific events was like looking for a needle in a haystack, and quantifying trends proved to be extremely time consuming,” says Michael Welcome of the Storage Systems Group, who spent several years developing the database as a side project.

But thanks to the new 50 gigabyte database, which contains several years of historical information with all the current information, Welcome notes that a quick query will allow any analyst to instantly find a historical event—like when a new piece of hardware was installed—and quantify its usefulness within seconds.

For instance, a query performed by Hick showed that NERSC’s Cray XT4 (Franklin) and XE6 (Hopper) systems contributed to the largest amount of data movement within the storage archive last year, which was not surprising. But the tool also revealed that NERSC’s Data Transfer Nodes (DTNs)—which are dedicated servers for performing transfers between local storage resources like HPSS and the NERSC Global Filesystem, and wide area resources like Leadership Computing Facilities at Argonne and Oak Ridge—are major data movement systems for HPSS. As a result, the Storage Systems Group determined that the center should invest in more DTNs to aid users. Currently, those systems are being configured for installation.

Another aspect of the database that has been especially useful for storage analysts is the client software statistics, which look at the software used to transfer data to HPSS over time. “These statistics have been critical in directing our efforts to improve software solutions in support of user data movement over time,” says Hick.

A query of the HPSS software used between May 1, 2010 and 2011 shows that the vast majority of data transfers to or from HPSS used the HSI client, which provides a UNIX-like interface into the system, or the HTAR client, which is similar to UNIX tar and is recommended for archiving small files. Due to the high usage of HSI on the Franklin and pre-production Hopper systems, the center made specific improvements in the way this client was deployed on Hopper. According to Hick, these improvements allowed users to achieve twice the bandwidth to HPSS for multi-gigabyte sized files.

Speeding Up Data Access for Users

For many data centers, tape libraries represent a cost, energy, and space efficient solution for storing ever-increasing amounts of scientific data. NERSC houses four SL8500 libraries, each composed of four library storage modules (LSMs), and each module contains about 2,500 tape cartridges and a collection of drives to read them. In total the center has about 40,000 tapes for user data archives and HPSS backups.

When a scientist makes a request to retrieve data from the tape library, a robot locates the tape where the requested information is stored, grabs it, and drops it into a tape drive that will read it to the user. This entire process occurs within a few seconds, unless all of the drives in that particular LSM are full. In that case, the robot will look for an available drive in another LSM.

“Movements between LSMs are relatively slow,” says Welcome, “because the cartridge either has to be deposited in an elevator and moved to another LSM within the same library, or go through a pass-through port into another library. The user will observe this as slower access time to their data.” This really becomes a problem when the user is requesting data spread across numerous cartridges in LSMs with unavailable tape drives.

According to Welcome, the new database allows him to easily identify and monitor such events by including a section in the daily report that shows cartridge movements between LSMs. These reports allow storage analysts to track cartridge movements across LSMs daily, and determine if re-arranging cartridges and drive locations will speed up data access for the user.

Now that the usefulness of this database is becoming increasingly apparent, Hick notes that he would like to see the database grow to include statistics from the center’s largest disk storage repository, the NERSC Global File system (NGF). The team would also like to design a graphic user interface (GUI) and analytics framework on top of the database so that all NERSC staff can use this information on demand for troubleshooting and decision-making.

About Computing Sciences at Berkeley Lab

High performance computing plays a critical role in scientific discovery. Researchers increasingly rely on advances in computer science, mathematics, computational science, data science, and large-scale computing and networking to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab’s Computing Sciences Area researches, develops, and deploys new foundations, tools, and technologies to meet these needs and to advance research across a broad range of scientific disciplines.