But an HPC center is only as fast as its network, and in 2025, a storage network overhaul is enhancing the user experience and laying the groundwork for future systems.
A reliable and performant storage fabric – that is, a unified system of computing, storage, and networking – is integral to HPC for science. Users need a secure and easily accessible place to store their research data and a way to move it in and out of storage. At NERSC, improvements to the network fabric have reduced latency and significantly increased available bandwidth to the Community, Global Common, and Homes File Systems, known collectively as the NERSC Global File (NGF) System.
Since NERSC moved to Shyh Wang Hall at Berkeley Lab in 2015, its storage network has been based on 56Gb/sec Infiniband FDR technology. After nearly a decade, those original switches needed replacing to support up-and-coming technologies. So in 2023, the NERSC networking and storage teams put their heads together to begin planning for what might come next. After assessing the available technology and the possible future direction of HPC, the teams settled on a high-bandwidth, low-latency protocol known as RDMA over Converged Ethernet (RoCE, pronounced “rocky”).
RoCE works with a traditional ethernet network and can be routed across different networks, offering increased flexibility with regards to both architecture and vendors. The open nature of Ethernet can result in earlier access to higher speeds and greater numbers of nodes; additionally, an Ethernet network can be upgraded organically according to the future needs of the center.
“We’re an open science lab. The open standard is Ethernet, and RoCE was built on the open standard,” said NERSC network engineer Ashwin Selvarajan, who worked on the project. “At that point we decided that we would be running an open network, which would be standards-compliant, and we would be able to choose solutions from many vendors.”
Working Together, Linking to the Future
The new setup, composed of 36 top-of-rack switches, four routers, and approximately 200 400Gb/s links, was so new that it had not yet reached the open market. Getting it up and running required a safe testing environment and a roadmap to make sure everything would work well for users, a team effort requiring support and expertise from across NERSC, including the Operations Technology, Computational Systems, Networking, and Storage Systems Groups. Together, they developed a testing plan and set up a separate parallel environment – including a dedicated and isolated file system, network, and set of the DVS gateway nodes that deliver NGF to the compute nodes on Perlmutter – to allow large-scale testing on Perlmutter without disrupting users. In addition to the tests and troubleshooting that might come with standing up any new system, they discovered a particular issue with those DVS nodes that would have caused slower performance for users. But with help from the Computational Systems Group and the vendor, Hewlett-Packard Enterprise (HPE), the team resolved the issue, clearing one more hurdle to regular use.
After thoroughly evaluating the new setup, staff began the process of building it out. Between January 2024 and March 2025, the teams worked together to build the new network alongside the existing one. Then, capping off years of preparation, they made the final switch.
“The final cutover took five minutes, but we spent a lot of time planning and rehearsing beforehand so that it would be seamless,” said NERSC storage engineer Ravi Cheema, who coordinated the storage team.
The transition from one network setup to another was the biggest network infrastructure project in about a decade, said Selvarajan: “This is the largest infrastructure upgrade the network team has handled since we moved into this space, in terms of the number of switches in the fabric, the fabric itself, and the number of links.”
Speedups Behind the Scenes
Network upgrades aren’t always recognizable to users, but keeping the connections between supercomputers and storage systems fast makes a huge improvement to the user experience.
“We’ve increased the basic speeds by 100%,” said Selvarajan. “That’s going to be helpful for users. Whatever reads and writes they’re doing, whatever they’re doing to move data in and out of the system, analyzing, writing back to the file system, all those things are going to speed up by the same amount.”
“This upgrade is a big win for users. In addition to increasing the available bandwidth, it also reduces some of the sources of instability due to network congestion. I expect that workloads that use the Community File System should have an improved experience.” said Lisa Gerhardt, the Acting Deputy of Operations.
Benefits to current users are just the beginning. The NERSC networking team will continue working to make the storage network even more efficient as long as Perlmutter is in service. They’re also working to ensure that the connection between the NERSC storage fabric and NERSC’s upcoming supercomputer, Doudna, will deliver next-generation performance. Ultimately, this collaborative effort opens the door to better scientific computing across the board.“This upgrade is a great example of how NERSC works together to improve service for its users to enable and empower cutting-edge research,” said NERSC Data Center Department Head Charles Schwartz. “Through forward-looking thinking, technological know-how, and effective teamwork, the Operations Technology, Computational Systems, Networking, and Storage Systems groups all worked collaboratively to further open science by removing, reducing and easing technological barriers.”
# # #
The National Energy Research Scientific Computing Center (NERSC) is the mission computing facility for the U.S. Department of Energy Office of Science, the nation’s single largest supporter of basic research in the physical sciences.
Located at Lawrence Berkeley National Laboratory (Berkeley Lab), NERSC serves 11,000 scientists at national laboratories and universities researching a wide range of problems in climate, fusion energy, materials sciences, physics, chemistry, computational biology, and other disciplines. An average of 2,000 peer-reviewed science results a year rely on NERSC resources and expertise, which has also supported the work of seven Nobel Prize-winning scientists and teams.
About Computing Sciences at Berkeley Lab
High performance computing plays a critical role in scientific discovery. Researchers increasingly rely on advances in computer science, mathematics, computational science, data science, and large-scale computing and networking to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab's Computing Sciences Area researches, develops, and deploys new foundations, tools, and technologies to meet these needs and to advance research across a broad range of scientific disciplines.