Development of an auto-tuning tool for the CLUEstering library

Description

CLUE is a fast and fully parallelizable density-based clustering algorithm, optimized for high- occupancy scenarios, where the number of clusters is much larger than the average number of hits in a cluster (Rovere et al. 2020). The algorithm uses a grid spatial index for fast querying of neighbors and its timing scales linearly with the number of points within the range considered. It is currently used in the CMS and CLIC event reconstruction software for clustering calorimetric hits in two dimensions based on their energy. The CLUE algorithm has been generalized to an arbitrary number of dimensions and to a wider range of applications in CLUEstering, a general purpose clustering library, with the backend implemented in C++ and providing a Python interface for easier use. The backend can be executed on multiple backends (serial, TBB, GPUs, ecc) thanks to the Alpaka performance portability library. One feature currently lacking from CLUEstering and that would be extremely useful for every user, is an autotuning of the parameters, that given the expected number of clusters computes the combination of input parameters that results in the best clustering.
For this task, one of the options to be explored is “The Optimizer”, a Python library developed by the Patatrack group of the CMS experiment which provides a collection of optimization algorithm, in particular MOPSO (Multi-Objective Particle Swarm Optimization).

Expected results

Consider the best techniques and tools for the task
Develop an auto-tuning tool for the parameters of CLUEstering
Test it on a wide range of commonly used datasets
Benchmark and profile to identify the bottlenecks of the tool and optimize it

Evaluation Task

Interested students please contact simone.balducci@cern.ch

Technologies

C++, Python

Desirable skills

Experience with development in C++17/20
Experience with GPU computing
Experience with machine learning and optimization techniques
Experience with development of Python libraries

Mentors

Simone Balducci - CERN UNIBO
Felice Pantaleo - CERN

Additional Information

Difficulty level (low / medium / high): medium
Duration: 350 hours
Mentor availability: June-October

Corresponding Project

Patatrack

Participating Organizations

CERN