Through this blog, I’ll be going over the work that I have done as a contributor for GSoC 2022 program with CERN-HSF under the project “ROOT - Machine Learning Developments : Batch Generator for training machine learning models”.
I’m a recent Computer Science and Engineering graduate, who’d always had a keen sense of amazement for Physics and Science in general. That’s why, it was like a dream come true when I got this opportunity via Google Summer of Code to make a contribution towards the High Energy Physics experimentations and the study of the universe, while utilising the skills and knowledge of my domain, that is Computer Science.
Toolkit for Multivariate Analysis (TMVA) is a multi-purpose machine learning toolkit integrated into the ROOT scientific software framework, used in many particle physics data analysis and applications. Since it is part of the ROOT data analysis framework, it comes with an automatically generated Python interface, which closely follows the C++ interface. The goal of this project is to develop a generator in C++ and Python to read data from the ROOT I/O and input them to the Python machine learning tools such as Tensorflow/Keras and PyTorch. The main aim of the generator is to efficiently input data from the ROOT I/O system to train machine learning models, and keep in memory only the data required to train a batch of events and not all the data set.
During the community bonding, I made sure to plan out the to-do’s for the project and discuss, modify and reiterate over my project ideas. I familiarized myself with ROOT, TMVA, RDataFrame, RTensor, TTree and other needed data structures and tools via tutorials, to get a headstart before the coding period. After finalizing, I communicated the revised expected project goals and timeline to my mentors via a powerpoint presentation, and it was finalized to begin the coding.
It has been evident while working that this project for developing the Batch Generator is something experimental and would require researching into different approaches before implementing or directly packaging it into TMVA. Thus, I have been working in a stand-alone repository over here, where I made pull requests for the experimental prototype.
It has been a great learning experience while working on the batch generator project. I plan on continuing to contribute for the improvement of this project in the future as well, some of the points under future scope are listed below:
I would like to express my gratitude to my mentors Lorenzo Moneta, Omar Zapata, Sanjiban Sengupta and Sitong An, for they have been extremely supportive with guidance and helped me put my efforts into the right direction, since the beginning of the program.
You can find my final GSoC submission here. Also, I have been documenting my GSoC 2022 journey with CERN-HSF via blogs that can be found over here.
- Sanchi