A-Z Index | Directory | Careers

Berkeley Lab Researchers Lead Collaboration to Increase Reliability and Efficiency of DOE Scientific Workflows

Poseidon will use AI/ML-based techniques to simulate, model, and optimize scientific workflow performance on large, distributed DOE computing infrastructures.

September 9, 2021

This article was originally published by RENCI

poseidon 2

This image shows the terrain height – an important factor in weather modeling – across almost all of North America with spatial resolution of 4km. Poseidon tools will help improve workflows and lead to even more efficient weather forecasts through reliable and efficient execution of weather models. Credit: Jiali Wang, Argonne National Laboratory

The Department of Energy (DOE) advanced Computational and Data Infrastructures (CDIs) - such as supercomputers, edge systems at experimental facilities, massive data storage, and high-speed networks - are brought to bear to solve the nation’s most pressing scientific problems, including assisting in astrophysics research, delivering new materials, designing new drugs, creating more efficient engines and turbines, and making more accurate and timely weather forecasts and climate change predictions.

Increasingly, computational science campaigns are leveraging distributed, heterogeneous scientific infrastructures that span multiple locations connected by high-performance networks, resulting in scientific data being pulled from instruments to computing, storage, and visualization facilities.

However, since these federated services infrastructures tend to be complex and managed by different organizations, domains, and communities, both the operators of the infrastructures and the scientists that use them have limited global visibility, which results in an incomplete understanding of the behavior of the entire set of resources that science workflows span. 

“Although scientific workflow systems like Pegasus increase scientists’ productivity to a great extent by managing and orchestrating computational campaigns, the intricate nature of the CDIs, including resource heterogeneity and the deployment of complex system software stacks, pose several challenges in predicting the behavior of the science workflows and in steering them past system and application anomalies,” said Ewa Deelman, research professor of computer science and research director at the University of Southern California’s Information Sciences Institute and lead principal investigator (PI). “Our new project, Poseidon, will provide an integrated platform consisting of algorithms, methods, tools, and services that will help DOE facility operators and scientists to address these challenges and improve the overall end-to-end science workflow.”

Under a new DOE grant, Poseidon aims to advance the knowledge of how simulation and machine learning (ML) methodologies can be harnessed and amplified to improve the DOE’s computational and data science. 

Research institutions collaborating on Poseidon include the University of Southern California, the Argonne National Laboratory, the Lawrence Berkeley National Laboratory, and the Renaissance Computing Institute (RENCI) at the University of North Carolina at Chapel Hill.

Poseidon will add three important capabilities to current scientific workflow systems — (1) predicting the performance of complex workflows; (2) detecting and classifying infrastructure and workflow anomalies and “explaining" the sources of these anomalies; and (3) suggesting performance optimizations. To accomplish these tasks, Poseidon will explore the use of novel simulation, ML, and hybrid methods to predict, understand, and optimize the behavior of complex DOE science workflows on DOE CDIs.

“Poseidon will explore hybrid solutions where data collected from DOE and NSF testbeds, as well as from an ML simulator, will be strategically inputted into an ML training system,” said Prasanna Balaprakash, Poseidon co-PI, R&D lead and computer scientist at the Mathematics and Computer Science division, Argonne National Laboratory. “ML-based prediction methods will be compared with simulation-based performance predictions in terms of accuracy, time to prediction, and amount of data and resources used. Using this knowledge as a stepping stone, computer scientists will be able to apply hybrid methodologies to understand and optimize the behavior of scientific workflows on a broad set of modern CDIs.”

Successful Poseidon solutions will be incorporated into a prototype system with a dashboard that will be used for evaluation by DOE scientists and CDI operators. Poseidon will enable scientists working on the frontier of DOE science to efficiently and reliably run complex workflows on a broad spectrum of DOE resources and accelerate time to discovery.

 “In addition to creating a more efficient timeline for researchers, we would like to provide CDI operators with the tools to detect, pinpoint, and efficiently address anomalies as they occur in the complex DOE facilities landscape,” said Anirban Mandal, Poseidon co-PI, assistant director for network research and infrastructure at RENCI, the University of North Carolina at Chapel Hill. “To detect anomalies, Poseidon will explore real-time ML models that sense and classify anomalies by leveraging underlying spatial and temporal correlations and expert knowledge, combine heterogeneous information sources, and generate real-time predictions.”

Furthermore, Poseidon will develop ML methods that can self-learn corrective behaviors and optimize workflow performance.

“Poseidon will focus on aspects of explainability in all methods of optimization,” said Mariam Kiran, Poseidon co-PI, a research scientist in the scientific networking division (ESnet) at the Lawrence Berkeley National Laboratory. “For example, to understand how ML models predict workflow performance, Poseidon explanations will provide insights into the workflow and platform parameter values that affect its performance. These insights will then be cross-validated with CDI experts.”

Working together, the researchers behind Poseidon will break down the barriers between complex CDIs, accelerate the scientific discovery timeline, and transform the way that computational and data science are done.

Please visit the project website for more information.

About Computing Sciences at Berkeley Lab

High performance computing plays a critical role in scientific discovery. Researchers increasingly rely on advances in computer science, mathematics, computational science, data science, and large-scale computing and networking to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab’s Computing Sciences Area researches, develops, and deploys new foundations, tools, and technologies to meet these needs and to advance research across a broad range of scientific disciplines.