The LHCb experiment software stack is based on the Gaudi framework, which is designed to provide a common environment for simulation, filtering, reconstruction, and analysis applications for High Energy Physics (HEP) experiments. It uses ROOT, a widely used data analysis framework within the HEP community, as its data format, since it allows for flexible and efficient storage of various types of objects, including complex data structures.
User-level data analysis typically deals with much simpler data structures, namely arrays of values corresponding to particle properties, with one event (or particle candidate) per row, which are stored by the framework as ROOT
TTrees. Traditionally, these trees have been analysed using the ROOT tools, either in C++ or using their Python bindings. But, lately, the advent of high-performance, open-source python scientific computing tools, such as numpy or pandas, and powerful machine learning frameworks such as scikit-learn or tensorflow, have shifted the analysis paradigm. This creates the need for an intermediate step where analysis data are converted to HDF format for the use within these tools, making analysis workflows unnecessarily complex.
The GSoC participant will integrate the HDF file format in the Gaudi framework as a possible output for these user-level data analysis tools, thus helping the integration of the python scientific computing tools in the day-to-day arsenal of HEP physicists, and opening the way for more streamlined analysis workflows. Several compression algorithms for HDF file format (blosc, gzip) can be benchmarked and compared with the original ROOT format to choose the most suitable one. If time allows, an implementation of the Apache Parquet data format, also compatible with pandas, could also be implemented and benchmarked against HDF.