Berkeley Lab scientists have developed a faster method for optimizing topological data analysis, a mathematical technique that helps researchers study the shape of data and uncover complex structures in high-dimensional datasets. Their novel approach accelerates the process of extracting meaningful patterns from noisy data by requiring significantly fewer steps than the standard procedure, making it much more efficient. This advancement, which has shown particular promise in machine learning, materials science, and biology, could have far-reaching impacts on data analysis and decision-making.
Their new method was detailed in a paper titled ‘Topological Optimization with Big Steps,’ published in Discrete & Computational Geometry.
“When people are trying to apply topological data analysis to machine learning, they quite often have to do this topological optimization. And so we’re giving them a much more efficient tool,” said Arnur Nigmetov, a computer systems engineer with the Machine Learning and Analytics Group of Berkeley Lab’s Scientific Data Division and lead author of the paper.
Everest, Kilimanjaro, and Topological Data Analysis
One can think of topological data analysis as similar to studying mountains: While Everest and Kilimanjaro have smaller ‘bumps’ along their slopes, these aren’t considered separate peaks. In topological data analysis, these ‘bumps’ are viewed as noise.
Researchers use math to separate meaningful patterns from random noise. Features in the data emerge or ‘birth’ and then disappear or ‘die.’ The longer a feature lasts, the more significant it is. For example, in materials science, stable voids created by atoms tend to last longer, while temporary, noisy pockets disappear quickly. This process helps researchers focus on the important features of complex data.
Topological data analysis can optimize data by smoothing out noise, making unwanted “bumps” disappear. It also enables researchers to adjust data to follow specific patterns, using techniques like backpropagation to speed up the process.
In machine learning, topological data analysis helps correct overfitting, which occurs when a model gets distracted by insignificant features.
“When the network overfits, it’s carving out little pockets around noisy data. It’s like treating a small bump on a mountain as an important peak,” said Morozov, a staff scientist in the Machine Learning and Analytics Group of Berkeley Lab’s Scientific Data Division and co-author of the paper.
To fix this, researchers add a term that encourages the model to focus on the more significant features—like real mountain peaks—helping it generalize better and avoid overfitting.
“By squashing noise and promoting meaningful features, we can make our models smarter and more efficient, much like a hiker learning to navigate the landscape by focusing only on the true peaks, not the distractions along the way.”
‘A Whole New Way of Thinking’
The Berkeley Lab scientists’ new algorithm addresses a long-standing challenge in topological data analysis: the inefficiency of optimization. Previously, optimizing topology required numerous small, incremental steps, making the process slow and cumbersome.
“Everyone who was doing topological optimization before struggled with this inefficiency problem,” said Nigmetov, referring to the many optimization steps previously required. “It’s a problem that has both theoretical and visual appeal and is relevant to the community..”
The pair’s breakthrough was recognizing a more efficient way to approach the problem. Rather than relying on these many small steps, their algorithm revealed a faster, more straightforward method of optimizing persistent homology—a core technique in topological data analysis.
“The idea that you can open this black box and figure out what’s going on inside is very unexpected,” Morozov said. “Usually, this kind of analysis is impossible, which is why everybody was just taking these small, local steps. So the fact that there is this extra structure and it’s pretty straightforward to compute, and you can make use of it, I think that was a big surprise, not just to us, but to everybody who has heard of this work.”
Morozov and Nigmetov’s work was initiated under Laboratory Directed Research and Development (LDRD) funding from Berkeley Lab, provided by the U.S. Department of Energy (DOE). It was also supported by the Scientific Discovery through Advanced Computing (SciDAC) program and the Mathematical Multifaceted Integrated Capability Centers (MMICCs) program.
According to Morozov, this funding gave the researchers the time needed to thoroughly consider the problem. Their approach could have a significant impact on materials science, enabling researchers to manipulate and design materials with specific properties. “The ability to generate materials with prescribed topology could be a game-changer,” Morozov said.
The researchers also hope their work will advance the understanding of persistence diagrams, a core tool in topological data analysis. “Every new mathematical insight leads to fresh ways of thinking,” Morozov said.
About Computing Sciences at Berkeley Lab
High performance computing plays a critical role in scientific discovery. Researchers increasingly rely on advances in computer science, mathematics, computational science, data science, and large-scale computing and networking to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab's Computing Sciences Area researches, develops, and deploys new foundations, tools, and technologies to meet these needs and to advance research across a broad range of scientific disciplines.