RDataFrame is a ROOT based data frame library, offering a high level declarative interface for the analysis of tabular and hierarchical data. Transformations and filtering of the data is expressed as a set of lazily applied chained operations on the data frame itself, expressed using a syntax similar to the one of other popular packages like Pandas and Apache Spark.
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
RDataFrame provides an abstract interface, RDataSource, to ingest data from various backends, including ROOT TTree
s, CSV files and Apache Arrow.
The goal of this project should be allowing RDataFrame to process Apache Parquet files.
RArrowDS
to support Apache Arrow.RDataSource
backends like ROOT TFile
s or CSV
.A working RDataSource, tentatively named RParquetDS, which is able to read Parquet files and have them processed by ROOT via RDataFrame
.