Hello peeps, this is Kush Kothari, a CS student from Mumbai, India. This is going to be a short report on my work on Uproot. The primary project goal is to upgrade Uproot to use AwkwardArrays v2 and to create the uproot.dask
function. This function is a reimplementation of uproot.lazy
, and now uses Dask’s ability to delay a task’s computation. This project is a major revamp of the structure and codebase of Uproot and the changes will result in a new major version of Uproot, i.e. Uproot v5.
The work I did over the past 12 weeks is majorly split over 11 Pull Requests into Uproot.
This PR begun as the evaluation task for GSoC and was continued into the coding period. It introduces the uproot.dask
function using the Dask Array collection. Setting library=np
makes the function return a Python dict of dask-arrays, each representing a single TBranch of the root file. These dask-arrays are computed into Numpy arrays on calling .compute()
. This also implements some features previously present in uproot.lazy
like the step_size
parameter that can be used to control the size of chunks in the dask arrays.
This PR solves the feature request of delaying the opening of the ROOT files. Sometimes, reading the metadata from ROOT files is itself quite an expensive operation. We may want to delay this using dask. However, to build the dict of dask arrays, we need to know the keys. Now, when uproot.dask([filename1, filename2...], open_files=False)
is called, Uproot only opens the first file to read the key names. Assuming the same key names in all files and making use of dask’s “unknown chunk sizes” feature, the opening of the rest of the files is delayed through a dask delayed node.
Uproot 3 had a numentries
function which was not ported to Uproot 4. This feature was requested here and introduced in PR 609. This function skips reading a lot of the metadata in the ROOT file, thus quickly providing the value of fEntries
in the TTree metadata.
At this point, Uproot had transitioned to Uproot 5 on the main branch, and Uproot 4 in a secondary branch. A small PR just removed uproot.lazy
’s implementation from Uproot 5, while keeping the docstring intact (since Uproot 4 and 5 share online documentation). Instead, calling uproot.lazy
in Uproot 5 raises a NotImplementedError
.
This was a major one. All instances of awkward usage in Uproot were upgraded to use Awkward v2. This is part of the change from Uproot 4 to Uproot 5. This PR involved a lot of debugging and running of tests. I am really thankful for all the support I received from my mentor Dr Jim Pivarski during this time.
Currently, a work-in-progress, this PR extends the uproot.dask
function to use the newly developed dask_awkward collection. While a basic working model is ready, I am currently working on optimizing the Dask graph with the help of Douglas Davis, the maintainer of dask_awkward.
After referring to code from dask_awkward and some internal helper functions in dask, the from_map
optimization was implemented for library='ak'
. During this time, after some discussion with the Uproot and Dask-Awkward team and it was decided that a similar Blockwise
optimization would be needed for dask numpy arrays too.
#679 introduces a new Blockwise implementation for the dask array collection. This involved implementing a from_map
function that was not yet present in the dask.array
module. The from_map
function now took a callable object that bijectively mapped function calls to chunks in the arrays.
#703 further used the same optimization for all code-paths.
Issue #697 showed that the existing code failed when TBranches were empty. The issue turned out to be in the code that calculated the typetracer array, which was then used by dask-awkward.
#700 Fixed this and added tests for the same.
Documentation for the work done is in progress in #702. This PR may not be merged until December 2022, the target release date for Uproot v5. The documentation involves the uproot.dask
docstring and the Getting Started Guide.
The developments with uproot.dask
will be demonstrated in a PyHEP lightning talk. The code for this talk will be uploaded here.
library='ak'
only read TBranches that are used in the dask-graph. This is currently underway and will be done over the next few weeks.library='pd'
. This will have to wait for some progress in the corresponding awkward-pandas project.