CMS and the cloud

Acronyms
Grid jargon:
SE = storage element, ours is called gfe02.grid.hep.ph.ic.ac.uk and runs dCache.

HEP (= High Energy Physics) jargon:
Skimming: In particle physics, selecting a small(ish) number of events from a large dataset according to specific criteria and writing it to a new file which is later analysed in detail.

What we do

The CMS experiment based at the Large Hadron Collider is concerned with the study of the property of elementary particles. The data collected runs into tens of petabytes a year. To first order the data are processed centrally and grouped into datasets according to physics criteria which are then distributed and analysed by physicists at the universities. Imperial College as a mayor contributor to the CMS experiment hosts several of these datasets. The smallest unit within a dataset is an 'event' - i.e. a collision of particles in the detector which can be analysed independently. To understand these data, a comparable number of simulated ('Monte Carlo') events in which a particle physics process - production and decay of elementary particles - is simulated according to our best knowledge and then the detector response calculated is also needed. This process is generally referred to as "Monte Carlo production".

From a computing point of view there are three classes of jobs:
a) Monte Carlo production:
These jobs are compute intensive (though rarely compute bound - see attached plot) and involve relatively small input data (kilobytes ? ) and output files around 5 GB. The code is contained with the CMS software (CMSSW) framework. At Imperial these jobs currently run on CentOS 5 based compute nodes and do not use more than 2 GB of memory. These jobs are typically highly standardized and easily scalable, i.e. an ideal case for cloud computing. Having said this, Monte Carlo production is often planned considerable time in advance so the 'get up and go' availability of cloud computing is less of an advantage here.

b) Standardized Data Analysis:
The CMS collaboration has developed sophisticated (*) software to analyse the data collected by the detector or the corresponding Monte-Carlo data. Within the framework of this software (CMSSW) users insert their own code suited to their analysis. Analysis activity tends to come in bursts, i.e. there is a considerable increase in demand when new data becomes available or shortly before conferences. Here the cloud could act as overflow to avoid having capacity being idle during quiet periods. The size of the input datasets (0.5 - 50 TB) will be the biggest challenge when it comes to porting this application to the cloud. Output data (per job) is expected to the tens of MB range and considered less of an issue.

c) User analysis
The last step in a physics analysis typically involves analysing a small, highly customized dataset usually in a ROOT (http://root.cern.ch) based data format a large number of times to extract and verify physics results. While cloud computing could provide a considerable benefit in certain cases (e.g. memory requirements that exceed the ones typically provided machines), the specifics vary too greatly to be studied conclusively within the scope of this project (**).

(*) in the French meaning of the word.
(**)in fact users usually sulk if they have to use anything but a local batch system