Hello peeps, this is Kush Kothari, a CS student from Mumbai, India. This is going to be a short report on my work on Uproot. The primary project goal is to upgrade Uproot to use AwkwardArrays v2 and to create the
uproot.dask function. This function is a reimplementation of
uproot.lazy, and now uses Dask’s ability to delay a task’s computation. This project is a major revamp of the structure and codebase of Uproot and the changes will result in a new major version of Uproot, i.e. Uproot v5.
The work I did over the past 6 weeks is majorly split over 6 Pull Requests into Uproot.
This PR begun as the evaluation task for GSoC and was continued into the coding period. It introduces the
uproot.dask function using the Dask Array collection. Setting
library=np makes the function return a Python dict of dask-arrays, each representing a single TBranch of the root file. These dask-arrays are computed into Numpy arrays on calling
.compute(). This also implements some features previously present in
uproot.lazy like the
step_size parameter that can be used to control the size of chunks in the dask arrays.
This PR solves the feature request of delaying the opening of the ROOT files. Sometimes, reading the metadata from ROOT files is itself quite an expensive operation. We may want to delay this using dask. However, to build the dict of dask arrays, we need to know the keys. Now, when
uproot.dask([filename1, filename2...], open_files=False) is called, Uproot only opens the first file to read the key names. Assuming the same key names in all files and making use of dask’s “unknown chunk sizes” feature, the opening of the rest of the files is delayed through a dask delayed node.
Uproot 3 had a
numentries function which was not ported to Uproot 4. This feature was requested here and introduced in PR 609. This function skips reading a lot of the metadata in the ROOT file, thus quickly providing the value of
fEntries in the TTree metadata.
At this point, Uproot had transitioned to Uproot 5 on the main branch, and Uproot 4 in a secondary branch. A small PR just removed
uproot.lazy’s implementation from Uproot 5, while keeping the docstring intact (since Uproot 4 and 5 share online documentation). Instead, calling
uproot.lazy in Uproot 5 raises a
This was a major one. All instances of awkward usage in Uproot were upgraded to use Awkward v2. This is part of the change from Uproot 4 to Uproot 5. This PR involved a lot of debugging and running of tests. I am really thankful for all the support I received from my mentor Dr Jim Pivarski during this time.
Currently, a work-in-progress, this PR extends the
uproot.dask function to use the newly developed dask_awkward collection. While a basic working model is ready, I am currently working on optimizing the Dask graph with the help of Douglas Davis, the maintainer of dask_awkward.