Creation and usage of disposable Spark on Kubernetes clusters from notebook service (SWAN) for distributed physics analysis

Description

The Hadoop expands its user base for analysts who want to perform analysis with big data technologies - namely Apache Spark – with main users from accelerator operations and infrastructure monitoring. Hadoop Service integration with Jupyter notebook (SWAN) Service offers scalable interactive data analysis and visualizations using Jupyter notebooks, with spark computations being offloaded to compute clusters - on-premise YARN clusters and more recently to cloud-native Kubernetes clusters.

With the recent developments in ROOT framework - Distributed RDataFrame, there is a growing number of physicists who are performing analysis using Apache Spark and ROOT RDataFrame and more so on the clusters created and managed by them. This project will develop the necessary integrations to use such Spark on Kubernetes clusters from Jupyter notebook service (SWAN)

Task ideas

Expected results

A Jupyter plugin to create, initialize and attach Kubernetes cluster from the notebooks

Requirements

  1. Python
  2. JavaScript
  3. Spark

Mentors

Corresponding Project

Participating Organizations