Jupyter Community Workshop Showcases Open Source Success
Three-day event at Berkeley Lab and BIDS focused on Jupyter in HPC and science user facilities
July 12, 2019
The three-day Jupyter Community Workshop for Scientific User Facilities and High-Performance Computing, held June 11-13 at Berkeley Lab and the Berkeley Institute for Data Science (BIDS), brought together more than 40 Jupyter developers, engineers, and experimental and observational data (EOD) facilities staff to brainstorm on how to make this increasingly popular open-source tool the pre-eminent interface for managing EOD workflows and data analytics at high performance computing (HPC) centers. The event was jointly sponsored by the National Energy Research Scientific Computing Center (NERSC) and BIDS and was part of a series of Jupyter Community Workshops being funded by the media company Bloomberg.
Project Jupyter is an international collaboration of more than 1,500 contributors that develops tools for “interactive computing,” a process of human-computer interplay for scientific exploration and data analysis. These tools – which include the very popular Jupyter Notebook and JupyterHub – have become a de facto standard for data analysis in research, education, journalism, and industry and are becoming increasingly critical for scientific discovery. (By coincidence, the workshop coincided with the release of the latest version of Jupyter’s user interface, JupyterLab.)
Advances in EOD technologies, high-bandwidth global networks, and HPC have resulted in an exponential growth of data to collect, manage, and understand. Interpreting these data streams requires computational and storage resources greatly exceeding those available on laptops, workstations, or university department clusters. Funding agencies are thus increasingly looking to HPC centers to address the growing and changing data needs of their scientists, and scientists are seeking new ways to seamlessly and transparently integrate HPC into their EOD workflows.
This is where Jupyter comes in.
“Over the past four years, we have seen Jupyter at NERSC evolve from a novel science gateway application used by just a few Python enthusiasts into a principal means for many of our users to interact with our systems and services,” said Rollin Thomas, NERSC Data Architect and chair of the workshop. In 2016, Thomas and the NERSC Jupyter team engineered a way for users to launch notebooks on a single shared node on NERSC’s Cori supercomputer; since then, demand has increased to the point that two additional nodes have been allocated for running Jupyter notebooks. On any given day, 200 users have notebooks running on these nodes; a level of usage comparable to the more traditional shared “login” nodes. Recently the team has expanded access to Cori’s compute nodes through Jupyter as well.
In addition, NERSC has joined forces with the Usual Software Systems group in Berkeley Lab’s Computational Research Division to enhance Jupyter to enable it as a key interface for EOD workflows that run at NERSC under the Superfacility initiative. “Scientists at major DOE-supported user facilities have told us that they want to manage and manipulate their data and compute through Jupyter, so we need to develop tools that work with our infrastructure to make that happen,” said Debbie Bard, leader of the Data Science Engagement Group and NERSC Superfacility Team lead. “This is a problem faced by all big science facilities: streaming their data through high-performance compute resources to accelerate science in a seamless way.”
‘A Great Deal of Excitement’
These and related challenges are what prompted the June workshop, which featured dozens of talks and breakout sessions focused on “pain points” and best practices in Jupyter deployment, infrastructure, and user support; securing Jupyter in multi-tenant environments; sharing notebooks; HPC/EOD-focused Jupyter extensions; and strategies for communicating with stakeholders.
“Scientists love Jupyter because it combines visualization, data analytics, text, and code into a document they can share, modify, and even publish,” Thomas said. “But what about using Jupyter to control experiments in real time, steer complex simulations on a supercomputer, or connect experiments to HPC for real-time feedback and decision making? How can users reach outside the notebook to corral external data and computational resources in a seamless, Jupyter-friendly manner?”
“We began talking about this event two years ago,” said Fernando Perez, an assistant professor of statistics at UC Berkeley, a Senior Fellow at BIDS, and a staff scientist in Berkeley Lab’s Computational Research Division who is credited with developing IPython – an interactive add-on to Python that served as the foundation for Jupyter – during an address to the group on the first morning. “There is a great deal of excitement from the Jupyter community about how HPC will use Jupyter. We see Jupyter as the heart of the human/machine connection, enabling and supporting interactive scientific computing.”
Perez is a founding member of BIDS, which hosted the third day of the workshop on the UC Berkeley campus. The Jupyter team at Berkeley focuses on tools for interactive interfaces for data science and education (Jupyter Notebooks, JupyterLab, and Jupyter Book), shared infrastructure for interactive computing (JupyterHub and JupyterHub distributions), and reproducible, sharable computational environments (through the Binder Project). Each project is run in partnership with researchers and educators at BIDS and UC Berkeley's Division of Data Science.
During the workshop, Michael Milligan from the Minnesota Supercomputing Center echoed Perez’ sentiments in his keynote, “Jupyter is a One-Stop Shop for Interactive HPC Services.” Milligan is the creator of BatchSpawner and WrapSpawner, JupyterHub extensions that let HPC users run notebooks on compute nodes supporting a variety of batch queue systems. In addition, contributors to both extensions met in an afternoon-long breakout to build consensus around some technical issues and start managing development and support collaboratively.
“In the past, most computational tasks fit into one of two buckets: ‘local, interactive, informally managed compute’ or ‘remote, scheduled, professionally managed compute,’” Milligan said. “Now users are accustomed to compute that is remote and interactive by default – Google Docs, not Microsoft Word. We hope our users are going to be doing something fundamentally new, so we need to give them general-purpose tools.”
Other highlights of the workshop included:
- Jupyter security. Thomas Mendoza from Lawrence Livermore National Laboratory talked about his work to enable end-to-end SSL in JupyterHub and best practices for securing Jupyter, while two breakout sessions on security yielded a number of next steps, including a plan to more prominently document security best practices and to convene a future meeting focused specifically on security in Jupyter.
- Jupyter implementation at national labs and user facilities. Speakers from Lawrence Livermore and Oak Ridge National Laboratories, the European Space Agency showed off a variety of JupyterLab extensions, integrations, and plug-ins for climate science, complex physical simulations, astronomical images and catalogs, and atmospheric monitoring. People at these and other facilities are finding ways to adapt Jupyter to meet the specific needs of their scientists.
- Berkeley Lab’s Shreyas Cholia gave a lightning talk on Interactive Distributed Computing with Jupyter and Friends, stitching together ipyparallel, QGrid, BQPlot, Kale to deliver interactive, distributed deep learning. Other lightning talks covered topics ranging from LFORTRAN and KBase to dashboards, Slurm, CharlieCloud, and GPUs.
Looking ahead, plans are in motion for a security-focused meeting to be held in the Fall; in addition, and a panel at PEARC19 will include a retrospective discussion of the workshop in Berkeley.
“Many facilities have figured out how to deploy, manage, and customize Jupyter, but they’ve done it while focused on their unique requirements and capabilities,” Thomas said. “Still others are just taking their first steps and want to avoid reinventing the wheel. With some initial critical mass, we can start contributing what we’ve learned separately into a shared body of knowledge, patterns, tools, and best practices.“
About Computing Sciences at Berkeley Lab
High performance computing plays a critical role in scientific discovery, and researchers increasingly rely on advances in computer science, mathematics, computational science, data science, and large-scale computing and networking to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab’s Computing Sciences Area researches, develops, and deploys new foundations, tools, and technologies to meet these needs and to advance research across a broad range of scientific disciplines.
Founded in 1931 on the belief that the biggest scientific challenges are best addressed by teams, Lawrence Berkeley National Laboratory and its scientists have been recognized with 13 Nobel Prizes. Today, Berkeley Lab researchers develop sustainable energy and environmental solutions, create useful new materials, advance the frontiers of computing, and probe the mysteries of life, matter, and the universe. Scientists from around the world rely on the Lab’s facilities for their own discovery science. Berkeley Lab is a multiprogram national laboratory, managed by the University of California for the U.S. Department of Energy’s Office of Science.
DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit energy.gov/science.