Next generation Big Data Analysis monitoring tools with ROOT

Description

The ROOT Software Framework is the cornerstone of all software stacks used by High Energy Physics (HEP) experiments. It provides components which are fundamental for the entire data processing chain, from particle collisions to final publications, including final user data analysis and modern machine learning techniques.

ROOT features a declarative analysis sub-system, RDataFrame, which has proven to be a solution to scale in-process parallel HEP data analysis to ~100 cores with a simple and intuitive programming model. Moreover, recent developments on previous GSoC editions have extended RDataFrame to offload heavy computations on external clusters using the Spark task distribution layer. Preliminary tests on ~5TB of real analysis data revealed the capacity of this interface to decrease the computational time from a dozen of hours to less than 5 minutes running on more than 500 external workers in parallel.

To allow scientists to perform massive analysis on much bigger datasets with interactive or quasi-interactive response times, it is crucial to have relevant performance metrics at the level of application in order to optimize existing bottlenecks, thus making the most out of the available resources.

The main goal of this project will be designing a high level display that can collect and show monitoring information from a Spark distributed application, in a way that feels intuitive and helpful for the user. While there is already a monitoring system for Spark and Jupyter notebooks, this only covers details concerning Spark metrics, such as number of tasks. The goal of this project is to extend such monitoring system with relevant information coming from a ROOT application.

Task ideas

Expected results

Requirements

Mentors

Corresponding Project

Participating Organizations