Kush

Uproot + Dask

Introduction

Hello peeps, this is Kush Kothari, a CS student from Mumbai, India. This is going to be a short report on my work on Uproot. The primary project goal is to upgrade Uproot to use AwkwardArrays v2 and to create the uproot.dask function. This function is a reimplementation of uproot.lazy, and now uses Dask’s ability to delay a task’s computation. This project is a major revamp of the structure and codebase of Uproot and the changes will result in a new major version of Uproot, i.e. Uproot v5.

The work I did over the past 12 weeks is majorly split over 11 Pull Requests into Uproot.

Dask Arrays

This PR begun as the evaluation task for GSoC and was continued into the coding period. It introduces the uproot.dask function using the Dask Array collection. Setting library=np makes the function return a Python dict of dask-arrays, each representing a single TBranch of the root file. These dask-arrays are computed into Numpy arrays on calling .compute(). This also implements some features previously present in uproot.lazy like the step_size parameter that can be used to control the size of chunks in the dask arrays.

Delay in opening files

This PR solves the feature request of delaying the opening of the ROOT files. Sometimes, reading the metadata from ROOT files is itself quite an expensive operation. We may want to delay this using dask. However, to build the dict of dask arrays, we need to know the keys. Now, when uproot.dask([filename1, filename2...], open_files=False) is called, Uproot only opens the first file to read the key names. Assuming the same key names in all files and making use of dask’s “unknown chunk sizes” feature, the opening of the rest of the files is delayed through a dask delayed node.

Num_entries

Uproot 3 had a numentries function which was not ported to Uproot 4. This feature was requested here and introduced in PR 609. This function skips reading a lot of the metadata in the ROOT file, thus quickly providing the value of fEntries in the TTree metadata.

Removing uproot.lazy

At this point, Uproot had transitioned to Uproot 5 on the main branch, and Uproot 4 in a secondary branch. A small PR just removed uproot.lazy’s implementation from Uproot 5, while keeping the docstring intact (since Uproot 4 and 5 share online documentation). Instead, calling uproot.lazy in Uproot 5 raises a NotImplementedError.

Awkward v2 update

This was a major one. All instances of awkward usage in Uproot were upgraded to use Awkward v2. This is part of the change from Uproot 4 to Uproot 5. This PR involved a lot of debugging and running of tests. I am really thankful for all the support I received from my mentor Dr Jim Pivarski during this time.

Dask-Awkward Support

Currently, a work-in-progress, this PR extends the uproot.dask function to use the newly developed dask_awkward collection. While a basic working model is ready, I am currently working on optimizing the Dask graph with the help of Douglas Davis, the maintainer of dask_awkward.

Post Midterm Period

Post-midterm Dask-Awkward Optimization

After referring to code from dask_awkward and some internal helper functions in dask, the from_map optimization was implemented for library='ak'. During this time, after some discussion with the Uproot and Dask-Awkward team and it was decided that a similar Blockwiseoptimization would be needed for dask numpy arrays too.

Blockwise Optimization

#679 introduces a new Blockwise implementation for the dask array collection. This involved implementing a from_map function that was not yet present in the dask.array module. The from_map function now took a callable object that bijectively mapped function calls to chunks in the arrays.

#703 further used the same optimization for all code-paths.

Empty TBranches

Issue #697 showed that the existing code failed when TBranches were empty. The issue turned out to be in the code that calculated the typetracer array, which was then used by dask-awkward.

#700 Fixed this and added tests for the same.

Documentation

Documentation for the work done is in progress in #702. This PR may not be merged until December 2022, the target release date for Uproot v5. The documentation involves the uproot.dask docstring and the Getting Started Guide.

PyHEP 2022 Lighting Talk

The developments with uproot.dask will be demonstrated in a PyHEP lightning talk. The code for this talk will be uploaded here.

Future Work

  • Have library='ak' only read TBranches that are used in the dask-graph. This is currently underway and will be done over the next few weeks.
  • Implementing library='pd'. This will have to wait for some progress in the corresponding awkward-pandas project.