Awkward Array Operations

Description

Most particle physics analysis today is performed by physicists writing programs to traverse nested data structures. These one-time analysis programs suffer from several issues:

In other academic fields and in data science, these issues are avoided by expressing analysis logic in SQL or a suite of array operations in MATLAB or Numpy. Particle physics, however, relies crucially on variable-sized, nested data structures that don’t fit neatly into tables or arrays. Every proton collision at the LHC produces a different number of electrons, gluons, and quarks with complex interrelationships.

We have been developing extensions to array programming concepts for nested, heterogeneous, and cross-linked data in a library called awkward-array. This library follows the syntax of Numpy, but for complex structures:

>>> import awkward

>>> array = awkward.fromiter(
[[1.1, 2.2, None, 3.3, None],
 [4.4, [5.5]],
 [{"x": 6, "y": {"z": 7}}, None, {"x": 8, "y": {"z": 9}}]])

>>> (array + 100).tolist()
[[101.1, 102.2, None, 103.3, None],
 [104.4, [105.5]],
 [{'x': 106, 'y': {'z': 107}}, None, {'x': 108, 'y': {'z': 109}}]]

Like Numpy, a single expression performs calculations across a whole dataset (alleviating the tradeoff between interactivity and performance) that is contiguous by type (column-oriented data) in a way that is fully portable to GPUs. Our set of awkward-array operations is broader than those needed for flat-array processing, and we are discovering new operations by translating traditional particle physics programs into array-centric scripts.

In this project, we would like you to create a library of precompiled awkward-array operations. Our current implementation of awkward-array is built from Numpy primitives, which is portable but not as efficient as dedicated, precompiled routines because each Numpy call makes a separate pass over memory, flushing the CPU cache. The project will focus on good software engineering principles to build a maintainable infrastructure. We don’t expect an optimized implementation of every operation by the end of the summer, just a clearly organized space to put new implementations when we need them.

Task ideas

(Not all of the above are possible in one summer.)

Expected results

By the end of the summer, we would like to see a well-established library structure. Even if the set of implemented operations is incomplete, it should be clear how the library will grow and be maintained.

Desirable Skills

Mentors

  1. awkward-array repository
  2. Introduction to array-centric analysis for physicists
  3. Presentation on array-centric analysis for physicists

Corresponding Project

Participating Organizations