Deep autoencoders for ATLAS data compression

Short description of the project

Storage is one of the main limiting factors to the recording of information from proton-proton collision events at the Large Hadron Collider, at CERN in Geneva. Hence, the ATLAS experiment at the LHC uses a so-called trigger system, which selects and sends interesting events to the data storage system while throwing away the rest.

However, if interesting events are buried in very large backgrounds and difficult to identify as signal by the trigger system, they will also be discarded together with the background. To alleviate this problem, we plan to reduce the size of the data that is recorded, and study compression algorithms that can be used directly within the trigger system. A variety of compression algorithms is already in use in high energy physics (see e.g. Zhe Zhang and Brian Bockelman. Exploring compression techniques for ROOT IO. CoRR (2017)).

A machine learning (ML) approach will be used to compress ATLAS data, specifically a method using autoencoder (AE) neural networks. In short an AE is a neural network which tries to implement an approximation to the identity, f(x) ≈ x, by using one or more hidden layers with smaller size than the input and output layers. If this is possible, it means that all the information necessary to reproduce the input, x, is contained in the hidden layer, and the data has been compressed. The idea is then to only save this smaller hidden layer representation instead of the current data format, along with the neural network that can recreate the original data.

Moreover, an AE can be used for anomaly detection. This is perhaps the most common use of AEs and works as follows. The AE is first trained on data which is known not to be anomalous. If then the network is presented with a new data point that differs in some significant way from the training data, the AE will not be able to provide a faithful reconstruction at the output layer and hence the data point is considered anomalous. In other words, if the reconstruction error of a data point is larger than some threshold, it is considered anomalous.

The aim of this project is to understand the standalone performance of an existing AE network, and whether this network can be used in combination with standard compression methods. Furthermore, a byproduct of this project is that the student will gain expertise in cutting-edge machine learning techniques, and learn to use them in the context of both data compression and detection of anomalous events.

It will also be interesting to treat this study as a proof-of-principle for future data compression techniques for the ATLAS experiment. For the planned experimental upgrades in 2026, the techniques used in this report may help solving the problem of needing much more storage space than in the past due to the increase of the size of the dataset.

Task ideas

Expected results

An improved autoencoder network with documentation and figures of merit that may include:

Requirements

Required: Good knowledge of UNIX, Python, MatPlotLib or ROOT, Pytorch, Pandas, fastai, ability to install software and run a Jupyter notebook locally

Evaluation tasks

Jupyter and pyTorch exercise of running the existing NN and describing the results.

Mentors

Corresponding Project

Participating Organizations