Supercomputing-Scale AI

The methods and applications that are driving transformative change in science through AI require vast computing resources to train and optimize ever bigger models, and couple these with existing large-scale scientific simulations and data pipelines.

To do this we can exploit supercomputing expertise and resources developed in the DOE open-science community over many years, but doing so also requires considerable new development in scalable algorithms, software, AI hardware design, system deployment, and benchmarking.

Berkeley Lab hosts NERSC, the mission supercomputing center for DOE open-science. NERSC’s latest system Perlmutter is a world-leading AI supercomputer, with over 6,000 A100 GPUs, high-performance file systems and networking, and optimized software for AI in science. NERSC is also planning for its next system “NERSC-10”, and beyond, working with vendors to ensure these systems are optimized to build AI into complex scientific workflows.

Our researchers also develop scalable libraries for AI, as well as software tooling to exploit these large-scale computing resources, and are heavily involved in industry-wide benchmarking. We also work to empower the science community for AI at supercomputing scale through outreach and training.

Projects

Perlmutter

The Perlmutter system is a world-leading AI supercomputer consisting of over 6,000 Nvidia A100 GPUs, an all-flash filesystem, and a novel high-speed network. The National Energy Research Scientific Computing Center (NERSC) at Berkeley Lab also works closely with vendors to ensure optimized software for AI at large computing scale, as well as consulting, joint projects, and training to enable the community to exploit these resources. Contact: Steven Farrell

MLPerf HPC

MLPerf HPC is a machine learning performance benchmark suite for scientific ML workloads on large supercomputers. It measures the time to train deep learning models on massive scientific datasets as well as full system scale throughput for training many models concurrently. MLPerf HPC has had two successful annual submission rounds featuring results on systems around the world, including the Perlmutter system at NERSC. Contact: Steven Farrell

Science Search

The goal of this project is to investigate using machine learning techniques to generate automated metadata that will enable search on data. Contact: Lavanya Ramakrishnan

FourCastNet

FourCastNet, short for Fourier Forecasting Neural Network, is a global data-driven weather forecasting model that provides accurate short to medium-range global predictions at high resolution. FourCastNet accurately predicts high-resolution, fast-timescale variables such as the surface wind speed, precipitation, and atmospheric water vapor. It can generate forecasts with extreme computational savings compared to standard numerical weather prediction models. It has important implications for planning wind energy resources and predicting extreme weather events such as tropical cyclones, extra-tropical cyclones, and atmospheric rivers. Contact: Shashank Subramanian

ExaLearn

As supercomputers become ever more capable in their march toward exascale levels of performance, scientists can run increasingly detailed and accurate simulations to study problems ranging from cleaner combustion to the nature of the universe. The challenge is that these powerful simulations are “computationally expensive,” consuming 10 to 50 million CPU hours for a single simulation. The ExaLearn project aims to develop new tools to help scientists overcome this challenge by applying machine learning to very large experimental datasets and simulations. Contact: Peter Nugent (Nugent on the Web)

Supercomputing-Scale AI

Projects

Perlmutter

MLPerf HPC

Science Search

FourCastNet

ExaLearn

News

Perlmutter-Powered Deep-Learning Model Speeds Extreme Weather Predictions

Berkeley Lab Deploys Next-Gen Supercomputer, Perlmutter, Bolstering U.S. Scientific Research

ExaLearn for Surrogates: Making Realistic Simulations on the Cheap