HPSS Disk Cache Upgrade Caters to Capacity, Bandwidth
Analysis of NERSC Users’ Data-Access Habits Reveals Sweet Spot for Short-term Storage
October 16, 2015
Contact: Kathy Kincade, +1 510 495 2124, email@example.com
NERSC users today are benefiting from a business decision made three years ago by the center’s Storage Systems Group (SSG) as they were looking to upgrade the High-Performance Storage System (HPSS) disk cache: rather than focus primarily on bandwidth challenges, why not improve capacity as well?
The question grew out of NERSC’s efforts to explore whether “the cloud” could serve as a viable—and possibly less expensive—platform for the center’s ever-increasing data storage needs. While that project ultimately determined that NERSC’s petabyte-scale data archives would be too much for a cloud-based system, it also yielded invaluable insights into NERSC users’ data utilization habits.
“We learned something about how the HPSS was being used that changed how we architected it going forward,” said Jason Hick, SSG group lead. “We found that the vast majority of the data that comes into NERSC is accessed within the first 90 days.”
Previously, five days of peak I/O was NERSC’s metric for sizing its disk cache, Hick explained.
“But we found it wasn’t a true reflection of how the cache was being used,” he said.
For archiving purposes, NERSC uses HPSS, which implements hierarchical storage management, comprising a disk farm front end for short-term storage and a tape back end for long-term storage. When new files come in, they are immediately copied to tape so that they exist in both places. When the disk cache starts to get full, an algorithm selects and removes files that have already been migrated to tape.
“The disk cache serves two functions,” said Wayne Hurlbert, a staff engineer in the SSG. “It buffers data that has to go to tape, and it holds files for a certain period of time so users can access them more quickly.”
Big Cache Benefits
With the new insights into when users most often access their data, NERSC chose to go in a new direction with its hardware choices for the cache upgrade, which was rolled out in early 2015. Rather than buying SAS drives or fiber channel drives that are designed to support high performance bandwidth needs, they opted to go with a more capacity-oriented technology involving Nearline-SAS drives.
“We had resisted going away from the high-end disks, both for performance and reliability,” said Hurlbert, who has been instrumental in deploying the disk cache upgrade. “But at this point we realized we couldn’t afford to do what we wanted to do capacity-wise without going to a second-tier technology. The benefit of second-tier is that you can get a lot of capacity, the performance is still reasonable and the reliability appears so far to be reasonable as well.”
With the new arrays, the HPSS disk cache can now retain 30 days’ worth of data—roughly 1.5 PB—thus reducing I/O bottlenecks and improving users’ productivity.
“The bigger your cache the better because users are going to get their data sooner,” Hurlbert said. “At the point when we chose the new arrays, the cache was getting too small and was ending up being simply a tape buffer. The data wasn’t staying there even two days. That is the benefit of having a big cache: the files will remain there longer so users don’t have to wait for them to come back from tape.”
Another key benefit is improved performance, Hick emphasized. “If we’d used the previous technology, optimized for bandwidth, we would have had a tiny disk cache,” he said. “We were excited to find that there is value in having a larger disk cache for our users.”
The upgrade offers advantages for NERSC staff as well, noted Nick Balthaser, a storage systems engineer in the SSG who has also been instrumental in deploying the cache upgrade.
“Because the arrays are new, they’re more stable,” he said. “The previous ones were getting old, which caused a lot of interrupts, especially after hours. It is definitely a benefit to reduce the staff effort required to maintain hardware.”
In fact, the SSG has seen a decrease in consult tickets since deploying the new disk cache, according to Hick. Given that these tickets can take up to a week to resolve in some cases, that is good news for users and staff alike.
“This was a major reconfiguration of the cache, a behind-the-scenes operation that Wayne pulled off pretty seamlessly,” Balthaser said. “Users may notice that they’re getting their data faster, but they won’t really know why.”
Looking ahead, the group is optimistic that the results of the upgrade will enable NERSC to eventually expand to a 60- or 90-day disk cache. They are also evaluating NERSC’s new sponsored storage model, which is currently deployed for the NERSC Global File System (NGF) but could be a good fit for the HPSS as well. In the sponsored storage model, users who need larger-than-average allocations can opt to pay a one-time fee for a five-year allocation of additional storage.
“The average working dataset these days is 50 TB, and we are starting to get requests for allocations on HPSS up to 100 TB,” said Hick. “With NGF, when the allocations get above order of tens of terabytes, we transfer them to sponsored storage. With the bigger storage requests coming in for HPSS, we are now considering the sponsored storage model for this system as well.”
About Computing Sciences at Berkeley Lab
The Lawrence Berkeley National Laboratory (Berkeley Lab) Computing Sciences organization provides the computing and networking resources and expertise critical to advancing the Department of Energy's research missions: developing new energy sources, improving energy efficiency, developing new materials and increasing our understanding of ourselves, our world and our universe.
ESnet, the Energy Sciences Network, provides the high-bandwidth, reliable connections that link scientists at 40 DOE research sites to each other and to experimental facilities and supercomputing centers around the country. The National Energy Research Scientific Computing Center (NERSC) powers the discoveries of 6,000 scientists at national laboratories and universities, including those at Berkeley Lab's Computational Research Division (CRD). CRD conducts research and development in mathematical modeling and simulation, algorithm design, data storage, management and analysis, computer system architecture and high-performance software implementation. NERSC and ESnet are DOE Office of Science User Facilities.
Lawrence Berkeley National Laboratory addresses the world's most urgent scientific challenges by advancing sustainable energy, protecting human health, creating new materials, and revealing the origin and fate of the universe. Founded in 1931, Berkeley Lab's scientific expertise has been recognized with 13 Nobel prizes. The University of California manages Berkeley Lab for the DOE’s Office of Science.
DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.