Leverage Spark Connect for interactive data analysis in Jupyter Notebooks

Description

SWAN (Service for Web-based ANalysis) is a platform to do interactive data analysis on a web browser. Scientists and engineers, both at CERN and at partner institutes, are using SWAN on a daily basis to develop algorithms required to perform their data analyses. The SWAN service builds on top of the widely-adopted Jupyter Notebooks and, more recently, the new JupyterLab interface. It integrates access to CERN software libraries, storage solutions and compute resources; notably, it leverages the storage synchronization and sharing capabilities of CERNBox and the computational power of Spark/Hadoop clusters for scaling out.

Currently, SWAN uses the Apache Spark Python API to connect Python notebooks to Spark clusters. This works by allocating a Spark Session object that is private to the Python process (the user’s notebook session), which becomes the driver of the distributed computation. The Spark Session on the driver machine can then request worker processes, called executors, from the cluster manager and schedule Spark jobs to be run in parallel utilizing the executors’ resources.

Such architecture has proven to work well and provide a scale out solution to SWAN users. However, a few important limitations have come apparent when using Spark on notebooks, due to the tightly coupled Spark driver architecture. The lack of built-in client-server connectivity in Spark (up to version 3.3.x) means, for example, that users need to spawn a new Spark Session for each of their notebooks, an operation that is resource intensive and has a high latency. These and other limitations are addressed in the latest development by the Apache Spark community: the Spark Connect component (SPARK-39375). Spark Connect is a major improvement in Apache Spark and brings more flexibility to the interactive data analysis use cases with Jupyter notebooks. This is expected to improve the experience of SWAN users who offload computations to Spark clusters, since it allows to use Spark in client-server mode and hence share a connection to a Spark cluster across multiple notebooks, which also improves resource utilization.

Therefore, this project proposes to develop a JupyterLab extension that makes it easy to establish a connection to a Spark cluster and share it among multiple notebooks, by exploiting Spark Connect under the hood. Potentially, this extension could be used not only in SWAN but also in other JupyterLab deployments.

Task ideas

Expected results

A working and easy-to-use extension, installable both in SWAN and in vanilla JupyterLab, which allows to leverage Spark Connect for connecting to Spark clusters and do interactive data analysis.

Desirable Skills

Mentors

Additional Information

Corresponding Project

Participating Organizations