In today’s data-driven world, the success of various ventures – from business to scientific discovery – depends on their ability to gather large amounts of data, then quickly and efficiently process those massive datasets to find meaningful insights.
With FasTensor, researchers in the Lawrence Berkeley National Laboratory’s (Berkeley Lab’s) Scientific Data Management Group developed an open source tool to help its users efficiently process and analyze their massive datasets. In their new book User-Defined Tensor Data Analysis, the team provides an introduction and user guide for anyone interested in using their programming model to analyze complex simulation data or experimental data.
In this Q+A, the FasTensor team, including Bin Dong, Kesheng “John” Wu, and Suren Byna, discuss what sets their tool apart from the competition.
What is FasTensor? How will it impact scientific discovery?
FasTensor is a generic programming model for big data analyses with user-defined operations. It exploits the structural locality in multidimensional arrays to automate file operations, data partitioning, communication, parallel execution, and other common data management operations.
This tool increases scientific productivity by allowing researchers to easily run custom data analysis operations on large arrays with the world’s most powerful supercomputers. FasTensor automatically takes care of the complex underlying data management and data movement tasks using efficient parallel I/O methods.
Running analysis operations on large multidimensional data structures is an essential part of the scientific process to extract meaningful insights. Some of the significant scientific breakthroughs of the last decade, such as the first detection of the Higgs boson particle at CERN or the first spotting of colliding neutron stars by the LIGO collaboration, were achieved with this kind of data analysis. Thus, as scientific discoveries become more data-intensive, the FasTensor can help reduce the burden of scientists in handling low-level and complex data management tasks.
Arrays appear in nearly every computer program. But in science, arrays are often extremely large and multidimensional. As scientists define their custom analysis operations, data management systems – like FasTensor, Apache Spark, or MapReduce – will distribute these large arrays onto high performance computers (HPC) without user involvement, which greatly improves user productivity. With FasTensor, user-defined functions may express data-analysis operations from simple aggregation operations to advanced machine learning pipelines.
What makes FasTensor different from other data analysis systems?
Many existing data management systems, like Apache Spark or MapReduce, share the same data parallel and functional approach that we use, but their design is not well suited for scientific applications. MapReduce and Spark are fundamentally based on Key-Value (KV) data structure, which doesn’t match the multidimensional array that is dominant in scientific data. Due to this mismatch, data analysis operations, like convolution, could take a lot longer to execute with these other systems.
Because FasTensor is directly defined and executed on the multidimensional array, it is orders of magnitude faster than its competitors. Our tool is designed to work in parallel on a supercomputer and can handle terabyte data-analysis tasks. When it comes to scientific data analysis, we compared a variety of tools and found that FasTensor is a thousand times faster than Apache Spark and 38% faster than the highly optimized code in TensorFlow. It also has wide applications in image data analysis and mesh data analysis tasks from physical simulations and other fields. We explain how the FasTensor achieved these performance advantages over Spark and other systems in this book. We also provide step-by-step instructions for any user to do so in their applications. We also provide detailed examples of using FasTensor on data analysis operations on Distributed Acoustic Sensing (DAS) and Plasma Physics datasets.
Why did you choose to publish your book about FasTensor now?
FasTensor is the next iteration of our ArrayUDF software, and it is timely now because the HPC field is on the verge of a big transformation.
Over the past decade, rapid advancements in machine learning contributed to the rise of heterogeneous HPC systems with CPU processors and acceleration provided by GPUs (graphic processing units) and an associated heterogeneous programming model that relies on tensors. Tensors, which are essentially multidimensional arrays, are so fundamental to machine learning that Google even included it in the name of its flagship machine learning library, “TensorFlow.” TensorFlow is just a machine learning library, but FasTensor has more broad applications than machine learning.
As scientists increasingly rely on machine learning and HPC to analyze the vast amounts of data generated by experiments and simulations, they need a tool built specifically with their data-analysis needs in mind. That tool is FasTensor, and our book is an instruction and reference manual for using this tool.
We would add that FasTensor and our book are timely now because the U.S. Department of Energy (DOE) will be launching the nation’s first exascale supercomputer for science research this year. With a heterogeneous architecture of both CPUs and GPUs, the Frontier system will be able to perform 1.5 quintillion calculations per second, about 1,000 times faster than today’s petascale supercomputers. As we enter the exascale era of HPC and the golden age of machine learning for science, access to open-source analysis tools like FasTesnor will become especially important for scientific discovery.
The FasTensor tool was developed with support from the U.S. Department of Energy’s Office of Advanced Scientific Computing Research.