HPSS Disk Cache Upgrade Caters to Capacity, Bandwidth

Analysis of NERSC Users’ Data-Access Habits Reveals Sweet Spot for Short-term Storage

October 16, 2015

Contact: Kathy Kincade, +1 510 495 2124, kkincade@lbl.gov

NERSC users today are benefiting from a business decision made three years ago by the center’s Storage Systems Group (SSG) as they were looking to upgrade the High-Performance Storage System (HPSS) disk cache: rather than focus primarily on bandwidth challenges, why not improve capacity as well?

The question grew out of NERSC’s efforts to explore whether “the cloud” could serve as a viable—and possibly less expensive—platform for the center’s ever-increasing data storage needs. While that project ultimately determined that NERSC’s petabyte-scale data archives would be too much for a cloud-based system, it also yielded invaluable insights into NERSC users’ data utilization habits.

“We learned something about how the HPSS was being used that changed how we architected it going forward,” said Jason Hick, SSG group lead. “We found that the vast majority of the data that comes into NERSC is accessed within the first 90 days.”

Previously, five days of peak I/O was NERSC’s metric for sizing its disk cache, Hick explained.

“But we found it wasn’t a true reflection of how the cache was being used,” he said.

For archiving purposes, NERSC uses HPSS, which implements hierarchical storage management, comprising a disk farm front end for short-term storage and a tape back end for long-term storage. When new files come in, they are immediately copied to tape so that they exist in both places. When the disk cache starts to get full, an algorithm selects and removes files that have already been migrated to tape.

“The disk cache serves two functions,” said Wayne Hurlbert, a staff engineer in the SSG. “It buffers data that has to go to tape, and it holds files for a certain period of time so users can access them more quickly.”

Big Cache Benefits

With the new insights into when users most often access their data, NERSC chose to go in a new direction with its hardware choices for the cache upgrade, which was rolled out in early 2015. Rather than buying SAS drives or fiber channel drives that are designed to support high performance bandwidth needs, they opted to go with a more capacity-oriented technology involving Nearline-SAS drives.

“We had resisted going away from the high-end disks, both for performance and reliability,” said Hurlbert, who has been instrumental in deploying the disk cache upgrade. “But at this point we realized we couldn’t afford to do what we wanted to do capacity-wise without going to a second-tier technology. The benefit of second-tier is that you can get a lot of capacity, the performance is still reasonable and the reliability appears so far to be reasonable as well.”

With the new arrays, the HPSS disk cache can now retain 30 days’ worth of data—roughly 1.5 PB—thus reducing I/O bottlenecks and improving users’ productivity.

“The bigger your cache the better because users are going to get their data sooner,” Hurlbert said. “At the point when we chose the new arrays, the cache was getting too small and was ending up being simply a tape buffer. The data wasn’t staying there even two days. That is the benefit of having a big cache: the files will remain there longer so users don’t have to wait for them to come back from tape.”

Fewer Interrupts

Another key benefit is improved performance, Hick emphasized. “If we’d used the previous technology, optimized for bandwidth, we would have had a tiny disk cache,” he said. “We were excited to find that there is value in having a larger disk cache for our users.”

The upgrade offers advantages for NERSC staff as well, noted Nick Balthaser, a storage systems engineer in the SSG who has also been instrumental in deploying the cache upgrade.

“Because the arrays are new, they’re more stable,” he said. “The previous ones were getting old, which caused a lot of interrupts, especially after hours. It is definitely a benefit to reduce the staff effort required to maintain hardware.”

In fact, the SSG has seen a decrease in consult tickets since deploying the new disk cache, according to Hick. Given that these tickets can take up to a week to resolve in some cases, that is good news for users and staff alike.

“This was a major reconfiguration of the cache, a behind-the-scenes operation that Wayne pulled off pretty seamlessly,” Balthaser said. “Users may notice that they’re getting their data faster, but they won’t really know why.”

Looking ahead, the group is optimistic that the results of the upgrade will enable NERSC to eventually expand to a 60- or 90-day disk cache. They are also evaluating NERSC’s new sponsored storage model, which is currently deployed for the NERSC Global File System (NGF) but could be a good fit for the HPSS as well. In the sponsored storage model, users who need larger-than-average allocations can opt to pay a one-time fee for a five-year allocation of additional storage.

“The average working dataset these days is 50 TB, and we are starting to get requests for allocations on HPSS up to 100 TB,” said Hick. “With NGF, when the allocations get above order of tens of terabytes, we transfer them to sponsored storage. With the bigger storage requests coming in for HPSS, we are now considering the sponsored storage model for this system as well.”

About Computing Sciences at Berkeley Lab

High performance computing plays a critical role in scientific discovery. Researchers increasingly rely on advances in computer science, mathematics, computational science, data science, and large-scale computing and networking to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab’s Computing Sciences Area researches, develops, and deploys new foundations, tools, and technologies to meet these needs and to advance research across a broad range of scientific disciplines.