Through this blog, I’ll be going over the work that I have done by this middle point in time as a contributor for GSoC 2022 program with CERN-HSF under the project “ROOT - Machine Learning Developments : Batch Generator for training machine learning models”.
Google Summer of Code 2022
I’m a recent Computer Science and Engineering graduate, who’d always had a keen sense of amazement for Physics and Science in general. That’s why, I was extremely elated when I got this opportunity via Google Summer of Code to make a contribution towards the High Energy Physics experimentations and the study of the universe, while utilising the skills and knowledge of my domain, that is Computer Science.
Toolkit for Multivariate Analysis (TMVA) is a multi-purpose machine learning toolkit integrated into the ROOT scientific software framework, used in many particle physics data analysis and applications. Since it is part of the ROOT data analysis framework, it comes with an automatically generated Python interface, which closely follows the C++ interface. The goal of this project is to develop a generator in C++ and Python to read data from the ROOT I/O and input them to the Python machine learning tools such as Tensorflow/Keras and PyTorch. The main aim of the generator is to efficiently input data from the ROOT I/O system to train machine learning models, and keep in memory only the data required to train a batch of events and not all the data set.
During the community bonding, I made sure to plan out the to-do’s for the project and discuss, modify and reiterate over my project ideas. I familiarized myself with ROOT, TMVA, RDataFrame, RTensor, TTree and other needed datastrucutres and tools via tutorials, to get a headstart before the coding period. After finalizing, I communicated the revised expected project goals and timeline to my mentors via a powerpoint presentation, and it was finalized to begin the coding.
It has been evident while working that this project for developing the Batch Generator is something experimental and would require researching into different approaches before implementing or directly packaging it into TMVA. Thus, I have been working in a stand-alone repository over here, where I make pull requests for the experimental prototype.
In the third approach, I have replaced this slice method, removing the extra copying actions from the memory. I discovered that initializing the RTensor directly as a ‘view’ would help us achieve this goal. - Link to Code
I would like to express my gratitude to my mentors Lorenzo Moneta, Omar Zapata, Sanjiban Sengupta and Sitong An, for they have been extremely supportive with guidance and helping me put my efforts into the right direction, since the beginning of the program.
I would like to highlight the goals of the project for the remaining half of the program as follows:
Also, I have been documenting my GSoC 2022 journey with CERN-HSF via blogs that can be found over here.