YODA is a statistical toolkit for binned data, mainly used by the Rivet collider-event analysis package. Rivet analyses simulated events from new models of physics processes, and statistically compares them to LHC data via YODA: this is a key way in which we test LHC data against ever-improving theory models, including proposals of new physics beyond the Standard Model.
YODA was initially designed with a focus on 1D histograms, for which it was feasible to use a structured-text data format, that could be easily read and edited by hand. But the increasing precision and detail of LHC data and modelling has led to an increase in both uncertainty calculations and multi-dimensional histogramming.
A new C++ structure was added to YODA in GSoC 2020, generalising the data types and in particular allowing binned storage and manipulation of arbitrary data types. But (even with gzipping) the data format is now straining under the number and size of data objects that need to be stored. We also currently have no way to connect the arbitrary stored types to the I/O system. A better binary format is needed, both for performance and flexibility.
This project will connect the YODA statistics objects to an efficient, parallel-writeable I/O format.
To be compatible with needs for data simulation and analysis on large parallel-computing (HPC) facilities, we have chosen the HDF5 data standard as the basis for the new YODA format. This is a binary format with strong support in data-science, programming models in C, C++ and Python, and which supports parallel writing from multiple processes.
Familiarity with C++ and git are essential; HDF5 and Cython (for connecting C++ to Python) can be learned in-situ.