# Surrogate Models

Berkeley Lab researchers have been heavily involved in ‘traditional’ physics-based numerical simulations for a number of years. These are often computationally intensive, especially when run at increasingly higher resolutions to determine finer structure.

More recently, there has been considerable progress and promise in the use of AI/ML as surrogate models. These include approaches that approximate the full simulation with entirely data-driven approaches and other approaches that run surrogate models alongside low-resolution traditional simulations to model finer-scale phenomena. These models usually require high-fidelity simulations for training but once trained, can produce new simulation data at orders of magnitude faster than traditional approaches, offering a transformative potential to run more complex scientific simulations.

Our researchers and engineers are actively tackling several challenges in producing effective AI/ML surrogate models for science, including incorporating physics-informed constraints into models, quantifying uncertainty, interoperation with existing scientific simulations, and scalable training of large models on high performance computing (HPC) resources.

## Projects

### Surrogate Modeling for Biofuel and Bioproduct Production

This project uses complex process simulation models for advanced biofuel and bioproduct production to develop and train machine learning (ML)-based surrogate models. Researchers need flexibility to explore different scenarios and understand how their work may impact upstream and downstream processes, as well as cost and greenhouse gas emissions. To address this need, the team uses the Tree-Based Pipeline Optimization Tool (TPOT) to automatically identify the best ML pipelines for predicting cost and mass/energy flow outputs. This approach has been used with two promising bio-based jet fuel blendstocks: limonane and bisabolane. The results show that ML algorithms trained on simulation trials may serve as powerful surrogates for accurately approximating model outputs at a fraction of the computational expense. **Contact: Corrine Scown (Scown on the Web)**

### gpCAM for Domain-Aware Autonomous Experimentation

The gpCAM project consists of an API and software designed to make autonomous data acquisition and analysis for experiments and simulations faster, simpler, and more widely available by leveraging active learning. The tool is based on a flexible and powerful Gaussian process regression at the core, which proves the ability to compute surrogate models and associated uncertainties. The flexibility and agnosticism stem from the modular design of gpCAM, which allows the user to implement and import their own Python functions to customize and control almost every aspect of the software. That makes it possible to easily tune the algorithm to account for various kinds of physics and other domain knowledge and constraints and to identify and find interesting features and function characteristics. A specialized function optimizer in gpCAM can take advantage of high performance computing (HPC) architectures for fast analysis time and reactive autonomous data acquisition. **Contact: Marcus Noack**

### Python-based Surrogate Modeling Objects (PySMO)

PySMO is an open-source tool for generating accurate algebraic surrogates that are directly integrated with an equation-oriented (EO) optimization platform, specifically IDAES and its underlying optimization library, Pyomo. PySMO includes implementations of several sampling and surrogate methods (polynomial regression, Kriging, and RBFs), providing a breadth of capabilities suitable for a variety of engineering applications. PySMO surrogates have been demonstrated to be very useful for enabling the algebraic representation of external simulation codes, black-box models, and complex phenomena in IDAES and other related projects. **Contact: Oluwamayowa Amusat (Amusat on the Web)**

### Cosmic Inference: Constraining Parameters with Observations and a Highly Limited Number of Simulations

Cosmological probes pose an inverse problem where the measurement result is obtained through observations, and the objective is to infer values of model parameters that characterize the underlying physical system—our universe, from these observations and theoretical forward-modeling. The only way to accurately forward-model physical behavior on small scales is via expensive numerical simulations, which are further "emulated" due to their high cost. Emulators are commonly built with a set of simulations covering the parameter space; the aim is to establish an approximately constant prediction error across the hypercube. We provide a description of a novel statistical framework for obtaining accurate parameter constraints. The proposed framework uses multi-output Gaussian process emulators that are adaptively constructed using Bayesian optimization methods with the goal of maintaining a low emulation error in the region of the hypercube preferred by the observational data. We compare several approaches for constructing multi-output emulators that enable us to take possible inter-output correlations into account while maintaining the efficiency needed for inference. **Contacts: Dmitriy Morozov, Zarija Lukic **

### Surrogate Model for simulating hadronization processes

We developed a neural network-based surrogate model for simulating the process whereby partons are converted to hadrons for high energy physics. The development is the first step towards a fully data-driven neural network-based hadronization simulator. **Contact: Xiangyang Ju (Ju on the Web)**

### Cosmological Hydrodynamic Modeling with Deep Learning

Multi-physics cosmological simulations are powerful tools for studying the formation and evolution of structure in the universe but require extreme computational resources. In particular, modeling the hydrodynamic interactions of baryonic matter adds significant expense but is required to accurately capture small-scale phenomena and create realistic mock-skies for key observables. This project uses deep neural networks to reconstruct important hydrodynamical quantities from coarse or N-body-only simulations, vastly reducing the amount of compute resources required to generate high-fidelity realizations while still providing accurate estimates with realistic statistical properties. **Contact: Peter Harrington**