Physicists and Machine Learning Experts Team Up to Tackle the TrackML Challenge
CRD, NERSC Staff Help Organize International Particle Tracking Competition; Runs May-October 2018
June 11, 2018
Machine learning experts and physicists from CERN have partnered with Kaggle—a Google-owned platform for predictive modeling and analytics competitions—on the TrackML Particle Tracking Challenge, a competition designed to inspire the development of an algorithm that can quickly reconstruct particle tracks—the trajectories of electrically charged particles emanating from a collision—from three-dimensional coordinates left in silicon particle detectors following millions of particle collisions.
Orchestrating particle collisions and observations at facilities like CERN, where groups of protons collide with one another 40 million times per second, is already a massive scientific accomplishment. Part of this process includes reconstructing tens of thousands of tracks per collision. The goal is to determine the most likely particle paths and decays, as well as anomalies from the expected behavior.
But analyzing the enormous amounts of data produced from particle physics experiments is becoming an overwhelming challenge. Large event rates mean physicists must sift through tens of petabytes of data per year. And as detector resolution improves, better software is needed to enable real-time pre-processing and filtering of the most promising events, which in turn produces even more data.
Three Years in the Making
In 2015, these trends prompted an international group of machine learning experts, computer scientists and physicists—including representatives from Berkeley Lab’s Computational Research Division (CRD) and National Energy Research Scientific Computing Center (NERSC)—to propose the TrackML challenge. The idea is to see if applying machine learning to the process can dramatically decrease the time it takes to process and reconstruct these datasets and extract the most relevant data.
“We spent the last three years trying to define what the problem is in a mathematically pure form that could be synthesized in one number: the fraction of particles correctly tracked through the detector in one collision event, averaged over all collision events in the data set. The tricky bit was to define what we mean by ‘correctly tracked,’” said Paolo Calafiura, a scientist in CRD’s Physics and X-Ray Science Computing Group and one of the TrackML organizers. “Another challenge was to define a dataset that makes sense to a programmer trying to create a machine learning algorithm.”
The TrackML competition is structured in two phases: the Accuracy phase, which began in early May and runs through July, and the Throughput phase, which begins in July and runs through October. During the Accuracy phase, data scientists from around the world who sign up to participate can download 400 gigabytes of simulated particle-collision data and train their algorithms to reconstruct the tracks. The simulated data being used in the challenge were generated using CERN’s open source accurate tracking simulator that features a typical all-silicon Large Hadron Collider tracking detector with 10 layers of cylinders and disks.
A Much Harder Problem
This is not the first particle physics competition of this sort, noted Steven Farrell, a machine learning engineer at NERSC and TrackML organizer who is working in parallel on his own solution to the TrackML problem (because he and Calafiura are TrackML organizers they cannot participate in the competition). A Higgs machine learning challenge held in 2014 gave participants the opportunity to find the Higgs boson in a set of simulated data. But the TrackML challenge is much harder, Calafiura and Farrell emphasized, in part because the Higgs challenge was inherently a binary classification problem that was very straightforward to define in mathematical terms.
“The Higgs challenge was very successful, and we have high hopes for this one,” Farrell said. “But in this case we have to try and figure out where the tracks are in this sea of particle ‘hits’ and connect all the dots. It is a very difficult problem because of the scale, the amount of stuff that’s in every sample, the amount of tracks you have to try and find, the complexity of the physics, the sources of noise from the detector… It is complex but also potentially interesting to the broader ML community because of its novel aspects.”
“The TrackML problem is much harder because it is down in the guts of the detector,” Calafiura added.
As of the second week of May, some 150 teams had signed up for the first phase of the TrackML challenge, with the numbers growing daily. Participants in this first phase will be evaluated on the accuracy with which their algorithms can reconstruct the particles’ tracks, and the top three performers will receive cash prizes of $12,000, $8,000 and $5,000. The Throughput phase will then evaluate the algorithms based on speed and accuracy.
While the competition is exciting for the particle physics and machine-learning communities in terms of what it might yield, the last three years of groundwork have already had a lasting impact, according to Calafiura.
“Even if this challenge were to completely fail and somebody doesn’t provide a good solution, all the work that went into producing a dataset that is realistic but doesn’t belong to any of the experiments—plus the work that has gone into defining the problem in terms that are understandable to the work of a computer scientist—all of this would totally pay off because this dataset will be used for the next five to six years as a benchmark,” he said.
NERSC is a DOE Office of Science user facility.
About Computing Sciences at Berkeley Lab
The Computing Sciences Area at Lawrence Berkeley National Laboratory(Berkeley Lab) provides the computing and networking resources and expertise critical to advancing Department of Energy Office of Science (DOE-SC) research missions: developing new energy sources, improving energy efficiency, developing new materials, and increasing our understanding of ourselves, our world, and our universe. ESnet, the Energy Sciences Network, provides the high-bandwidth, reliable connections that link scientists at 40 DOE research sites to each other and to experimental facilities and supercomputing centers around the country. The National Energy Research Scientific Computing Center (NERSC) powers the discoveries of 7,000-plus scientists at national laboratories and universities. NERSC and ESnet are both Department of Energy Office of Science National User Facilities. The Computational Research Division (CRD) conducts research and development in mathematical modeling and simulation, algorithm design, data storage, management and analysis, computer system architecture and high-performance software implementation.
Berkeley Lab addresses the world's most urgent scientific challenges by advancing sustainable energy, protecting human health, creating new materials, and revealing the origin and fate of the universe. Founded in 1931, Berkeley Lab's scientific expertise has been recognized with 13 Nobel prizes. The University of California manages Berkeley Lab for the DOE’s Office of Science. The DOE Office of Science is the United States' single largest supporter of basic research in the physical sciences and is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.