Support for Rucio Users with Natural Language Processing


Rucio is an open-source software framework that provides functionality to scientific collaborations to organize, manage, monitor, and access their distributed data and dataflows across heterogeneous infrastructures. Data in Rucio is organized using Data Identifiers (DIDs) which have three levels of granularity: files, datasets, and containers. Datasets are used to organise sets of files in groups, and to facilitate bulk operations such as transfers or deletions. Users are permitted to perform certain actions on the DIDs such as downloads, uploads, or transfers. Different levels of expert support are available for users in case of problems. When satisfying answers are not found at lower support levels, a request from a user or a group of users can be escalated to the Rucio support experts. Due to the large amount of support requests, we are looking into methods to assist the support team in answering these requests. Ideally, the support would be provided by an intelligent bot able to process and understand the user’s requests and finally trigger appropriate action. Natural Language Processing (NLP) is a fairly developed technique used in many fields requiring the analysis of large chunks of text and automation of actions to be taken. Processing questions from users and providing satisfying answers is the objective. This activity could be optimised by prototyping a bot capable to handle user’s requests up to a certain level of complexity, and forwarding only the remaining most difficult ones to the experts.

Milestones for GSoC Student

Ideas for Extension

Expected Results

Note that the Rucio Bot project is strategic beyond the GSoC timescale. The essential requirements for declaring success of the work of a GSoC student are explicitly written in the Section ‘Milestones’ above. Especially the second bullet requires deep expertise with NLTK including creation of both the training set and the validation set and running the developed bot on them. The validation set provides us with the necessary statistics to demonstrate the reliability of the bot.

Achievable, but not mandatory within GSoC, developments are listed in the section ‘Ideas for Extension’.


Good python programming skills. Previous experience with NLTK, ML and data analysis is a plus.


  1. Rucio GitHub
  2. Rucio web
  3. Journal article

Corresponding Project

Participating Organizations