The High-Luminosity Large Hadron Collider (HL-LHC) will produce roughly an exabyte of data per year for CMS. We are trying to understand access patterns in Run2, i.e. most recent couple years of CMS on the current accelerator, in order to provide input on the design of more cost-effective data access infrastructure for the HL-LHC. This involves understanding access patterns at the level of task, file, block, dataset, and object within files. There are questions of predictability as well as optimal replication schemes.
The raw data describing the access patterns is currently being collected and stored, but we lack the tools to properly curate and analyze it. Moreover, CMS is currently lacking a comprehensive modeling framework that would allow for predicting the access patterns within an alternate data access infrastructure.
This project will be focused on the development of the needed frameworks and tools. Nevertheless, actual analysis of current usage data and modeling of potential novel infrastructures will be required to validate their usefulness.
The analysis will include both global usage information, as well as local access patterns at the distributed Xrootd cache across Caltech and UCSD.