IRIS-HEP - Uproot-Dask

Description

Uproot is a pure Python reader and writer of the ROOT file format, the primary format for High Energy Physics data. It converts ROOT files on disk to and from NumPy arrays, Awkward Arrays, and Pandas DataFrames for analysis.

Uproot is primarily “eager,” in that a request for ROOT data to be read into memory is executed immediately, with the exception of one function, uproot.lazy. This function is currently only supported by Uproot’s Awkward Array backend, and it relies on the VirtualArray and PartitionedArray array types that are being deprecated in favor of a new dask-awkward collection type. Dask is an industry standard library for delayed and distributed computation in Python.

The goal of this project would be to reimplement uproot.lazy (now uproot.dask) using the new dask.awkward collection. In addition, the new function should also support the NumPy and Pandas backends, leveraging the dask.array and dask.dataframe collection types in Dask.

It is part of a larger project to update Uproot 4 to use Awkward Array version 2, of which the migration to Dask is one part. uproot.dask will be a key feature of Uproot version 5.

Project tasks

As written, these tasks are not sequential: implementation will likely be iterative, and your understanding should grow throughout the process.

Expected results

The new uproot.dask function should return a Dask DAG node (high level), suitable for further processing in dask.array, dask.dataframe, or the new dask.awkward. All modes of processing: synchronous, asynchronous, multithreaded, multiprocessing, and distributed, should work without unnecessary copying of data. This should be demonstrated with diagnostics.

Evaluation Task

Interested students please contact Jim (pivarski@princeton.edu) for an evaluation task.

Requirements

Mentors

Additional Information

Corresponding Project

Participating Organizations