Bin Yu, Statistics and Machine Learning Innovator, Joins Berkeley Lab

January 7, 2020

Contact: Keri Troutman, khtroutman@lbl.gov, 510-486-5071

BinYuPhoto Bin Yu, a renowned UC Berkeley statistician who has pioneered machine learning tools and their applications to numerous scientific fields, recently joined Berkeley Lab’s Biosciences Area as a faculty scientist. Yu views the appointment as a natural progression of her years of collaboration with Berkeley Lab scientists and very timely, as she believes we are in the midst of a “golden era” for statistics and statistical machine learning. Yu works regularly with scientists in Berkeley Lab’s Computational Research Division (CRD) and Biosciences (BSA), and this appointment will deepen those ties.

Increasingly large and complex scientific data requires innovative approaches, and Yu’s research is at the forefront of practice and theory of statistical machine learning and causal inference. She regularly engages in interdisciplinary research with scientists from genomics, neuroscience, and precision medicine to extract useful information based on data and domain knowledge. “We are constantly producing data associated with new science and social science problems,” Yu says. “We need new computational and mathematical solutions.”

Yu is Chancellor's Professor in the Departments of Statistics and Electrical Engineering and Computer Science at UC Berkeley and a faculty member in the Center for Computational Biology (CCB) at UC Berkeley’s College of Engineering. She chaired the Statistics Department at Berkeley from 2009 to 2012. She has held faculty positions at the University of Wisconsin-Madison and Yale University and was a Member of Technical Staff at Bell Labs, Lucent. She was a founding co-director of the Microsoft Lab on Statistics and Information Technology at Peking University, China. Yu has published more than 100 scientific papers in leading journals and conference proceedings on a wide range of research—from mathematical statistics to information theory to signal processing to remote sensing, neuroscience, genomics, and medicine.

Yu’s latest work is focused on stability (or robustness), which she says is a natural requirement for interpretability and reproducibility in statistics and machine learning, and data science in general. Yu has formulated a framework for data science predicated on ensuring predictability, computability, and stability (PCS). Yu and Karl Kumbier (former PhD student of Yu and now a postdoc at UCSF) proposed the PCS framework for veridical data science to provide responsible, reliable, reproducible, and transparent results across the entire data science life cycle. It uses predictability as a reality check and considers the importance of computation in data collection/storage, algorithm design, and data-inspired simulations. Their paper on PCS will soon be published in PNAS under the title “Veridical Data Science.”

Yu’s PCS framework consists of a PCS workflow and PCS documentation. The workflow assesses how human judgment calls impact data results through data and model/algorithm perturbations. It incorporates the stability principle, which expands significantly on statistical uncertainty considerations across the entire data science life cycle, including problem formulation, data cleaning, and post-hoc visualization and data conclusions. “What I hope will develop with this PCS framework is a workflow from problem to data collection to data cleaning,” she says. “We want to make the whole process very reproducible and transparent, helping researchers think clearly and rigorously about data analysis and establishing trust for users.”

The six-step PCS documentation in Rmarkdown or Jupyter Notebook records the judgment calls, choices, and assumptions made throughout the data science life cycle using narratives and codes. It builds the bridge from symbols in statistical models and codes to reality. “A well-reasoned and well-written PCS documentation builds trust in the data conclusions,” says Yu.

Yu plans to put PCS documentations of her projects on GitHub so that others can contribute. The PNAS paper contains a case study with documentation on Zendo. “The data problems are so complex now; I don’t think a single brain can solve such problems,” she says. “We need a team brain, and this is really something that is confronting all areas of science.”

Yu firmly believes that who you are as a person informs your approach to scientific research, hence her focus on a diverse perspective. Her personal experience of growing up in the midst of the Chinese Cultural Revolution shaped her career—watching members of her family persecuted for their beliefs gave her a unique comfort level with being the “odd one out.”

“I am not easily swayed and I’m not afraid to speak up with an alternate viewpoint,” says Yu. “Growing up during the Chinese cultural revolution, I know that what is seen as true is not always true.”

“I believe that to be a good scientist, not just a technician, you have to be a good philosopher,” says Yu. “You have to make good judgment calls and exercise good common sense, in addition to seeking factual evidence in data and domain knowledge.”

Bin Yu will be a keynote speaker at the Joint Genome Institute’s Genomics of Energy & Environment Meeting in March.

About Computing Sciences at Berkeley Lab

High performance computing plays a critical role in scientific discovery. Researchers increasingly rely on advances in computer science, mathematics, computational science, data science, and large-scale computing and networking to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab’s Computing Sciences Area researches, develops, and deploys new foundations, tools, and technologies to meet these needs and to advance research across a broad range of scientific disciplines.