Apache Parquet support for ROOT's RDataFrame

Description

RDataFrame is a ROOT based data frame library, offering a high level declarative interface for the analysis of tabular and hierarchical data. Transformations and filtering of the data is expressed as a set of lazily applied chained operations on the data frame itself, expressed using a syntax similar to the one of other popular packages like Pandas and Apache Spark.

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

RDataFrame provides an abstract interface, RDataSource, to ingest data from various backends, including ROOT TTrees, CSV files and Apache Arrow.

The goal of this project should be allowing RDataFrame to process Apache Parquet files.

Task ideas

Expected results

A working RDataSource, tentatively named RParquetDS, which is able to read Parquet files and have them processed by ROOT via RDataFrame.

Desirable Skills

Mentors

Corresponding Project

Participating Organizations