Optimizing Computing Operations with Machine Learning algorithms

Description

ATLAS Computing Operations requires on-shift manpower to manage its complex infrastructure. This includes, but is not limited to, the Data Management system, Workload Management system, Conditions and Databases, or the interactive clients. Additionally, the participating computing centres are heterogeneous, providing various software and hardware, with different versions and configurations. We see an opportunity to support computing operations using machine learning algorithms, to automate the many different manual activities and decision processes that previously required human intervention.

With the ATLAS Open Analytics Platform (based on ElasticSearch, HDFS, Spark, Graphite and Jupyter) we now have in place a central system that collects critical metrics that can form the basis of automated operational decisions. With this project we propose to design, implement, and deploy a framework that is able to classify and cluster these metrics based on operational needs. The operator is notified upon a significant event, and potential resolutions are proposed. The framework will learn the decisions of the operator through reinforcement algorithms over time, yielding better classification of events and proposals for notification or automation.

For example, a broken network link between two data centres could trigger a ticket, the blacklisting of either data centres, or the rescheduling of data transfers. Such resolutions should be proposed by the system and automatically learnt over time based on the decisions of the operators.

Task ideas

Expected results

The ATLAS Data Management system (Rucio) will serve as the experimental platform for the first iteration of these studies. Care must taken that the implementation is applicable to other ADC systems as well, as anomalous behaviour usually shows across multiple systems. For many of the proposed tasks there are already systems in place, and they should be exploited for this project. In detail, we propose the following:

Milestone - Data aggregation

Milestone - Anomaly detection

Milestone - Streaming trigger

Milestone - Operator decision

Milestone - Automation

Requirements

Mentors

Corresponding Project

Participating Organizations