=Paper= {{Paper |id=Vol-1787/385-388-paper-67 |storemode=property |title=PanDA for COMPASS: processing data via Grid |pdfUrl=https://ceur-ws.org/Vol-1787/385-388-paper-67.pdf |volume=Vol-1787 |authors=Artem Petrosyan,Elena Zemlyanichkina }} ==PanDA for COMPASS: processing data via Grid == https://ceur-ws.org/Vol-1787/385-388-paper-67.pdf
         PanDA for COMPASS: processing data via Grid
                          A. Sh. Petrosyana, E. V. Zemlyanichkina
                   Joint institute for nuclear research, 6 Joliot-Curie, 141890, Dubna, Russia
                                       E-mail: a artem.petrosyan@jinr.ru


      The development of PanDA, production and distributed analysis system as a workload management system
for ATLAS, started in 2005. Since that time the system has grown up and in 2013 the BigPanDA project started,
aiming to extend the scope of the system to non-LHC experiments. One of the experiments to which production
management PanDA is being applied, is COMPASS at CERN. The workflow of the experiment has to be
changed to enable Grid for production and user jobs. A large amount of the infrastructure work is being per-
formed on backstage. The PanDA job definition replaces the native batch system job definition, an automatic
submission to Condor computing elements comes in place of a console job submission, Grid user certificates
identify job submitters instead of AFS user names, Grid storage elements substitute local directories on AFS and
EOS. Production software moves from a private directory of production account to CVMFS. Also, a virtual or-
ganisation with role management has been established for the experiment. A central monitoring was enabled.
The experiment has started to use several computing elements instead of local batch queues. The way how the
COMPASS’ data are being processed via Grid is presented in this paper.

     Keywords: COMPASS, PanDA, workload management system, distributed data management, Grid


                                                                    © 2016 Artem Sh. Petrosyan, Elena V. Zemlyanichkina




                                                                                                                385
Introduction
      Historically, data of COMPASS [Abbon et al., 2007] experiment is processed locally: jobs are
being sent to CERN cluster (LSF) by production manager, logged in to lxplus.cern.ch with experiment
production account. Since 2003 COMPASS collects 1.5 to 3 PB of data every year, and it is already a
serious volume of data. In case of production and reproduction it takes long time to be processed on
only one computing element still available for the experiment.
      Having in mind wide distributed physics community of the experiment, using several computing
resources to faster the processing of the experiment’s data is a natural choice. Usually, workload man-
agement systems (WMS) are used to process data on several sites. WMS’ have several key features
and, in general, allow to work with many various computing sites as with one local queue. Common
features of WMS’ may be presented in a short list:
          • Provide central queue for users, similar to local batch systems;
          • Build a pilot job system;
          • Hide middleware while supporting diversity and evolution;
          • Hide variation of infrastructure;
          • Use the same system for simulation, data processing and user analysis.
      Back side of the medal is that installation and configuration of such systems and resources defini-
tion in them is quite complicated.
      One of the systems, widely used in HEP and beyond is PanDA [Maeno et al., 2014], production
and distributed analysis system. Started in 2005 as a WMS for ATLAS, now PanDA has grown to
BigPanDA [Klimentov et al., 2015] platform, which is a crucial part of several projects, including sev-
eral HPC utilisation optimisation projects, Cloud Distributed Operating System and Large Synoptic
Survey Telescope.
      The difference between ATLAS Grid and COMPASS Grid is that for ATLAS PanDA daily runs
hundred thousands jobs on more than 200 sites while COMPASS at the moment has access to a single
site. In such situation, many components of the system have to work in limited conditions, several
have to be stopped and disabled. To proof that such complicated system may be configured to run
COMPASS production jobs, the following list of action items was prepared:
          1. PanDA instance installation;
          2. Preparation of production chain management software;
          3. Grid environment setup;
          4. Validation that COMPASS software can work in Grid;
          5. Production jobs execution;
          6. Physics validation of the job execution results.
      Definition of COMPASS’ data flow, items 1 and 2 of the list above described in details in
[Petrosyan, 2016]. In this article we will concentrate on the rest items.


Grid environment
     PanDA designed to distribute jobs to many Grid sites. But, to enable data processing via Grid ,
several steps had to be performed:
          • Virtual organisation (VO) should be established;
          • Grid computing element will come in place of local batch queue;
          • EOS storage element will replace EOS local directory;
          • COMPASS software has to be installed on CVMFS, or on any place, which can be ac-
      cessed from Condor computing elements.
     These steps in common allow to run COMPASS jobs on any site in Grid.




                                                                                                   386
      Also, to send jobs to computing elements, auto pilot factory (APF) component of PanDA has to
be configured. APF interacts directly with local batch queue management software on computing sites
and allows to send as many jobs as site ready to accept and handle. All that user has to do is to declare
his jobs in PanDA, and then APF takes care about sending them to the sites.
      These steps were applied to COMPASS case. Virtual organization was created on CERN VOMS
server (https://lcg-voms2.cern.ch:8443/). Several users were registered, including production account.
Quotas on CERN Condor CE’s were requested and received. With support of CERN EOS team, ac-
cess to experiment’s directories on the service was granted to users from COMPASS VO with produc-
tion role. Production software was moved to new location on AFS so that it became visible and acces-
sible from Condor CE’s, after that the installation had to be validated.


Validation
     In order to confirm that COMPASS software may work in Grid, desired version of software had
to be moved to a location where Condor CE pseudo-user could get access to. While testing, permis-
sions were changed in several places to allow read and execution access so that Grid user could suc-
cessfully execute the software.
     Several hundreds jobs were executed to make the chain work from the start till the finish: soft-
ware version, access rights, production software, EOS storage element and local directories on AFS to
store logs have to work as one system. Once this is done then we’re ready for the next step, which is
execution of real production jobs.


Real production execution
      To test a real production, run of 2014 was chosen. It consists of 2804 raw data files, one per job,
and, after processing, each job produces 3 result files: histogram, data summary tree, and event dump.
Each must be stored on EOS in the directory reserved for files of each type. Besides, each job produc-
es logs of the job, stderr, stdout. This log must be stored separately in the directory of COMPASS pro-
duction account so that they may be easily analysed in case of errors. When results are ready, they
usually being merged and copied to save space on EOS and, later, on long term storage on CASTOR.
This part of work also had to be performed during the test production. Such logic of storing each out-
put file in its specific directory is unusual for PanDA payload behaviour in other applications and re-
quired changes on PanDA pilot side.
      Processing started on one queue at computing element on Condor CE at CERN. The queue al-
lowed experiment to run 150 jobs at the same time. During the processing, second computing element
became available. Running on two CE’s, up to 300 simultaneously running jobs, all files were pro-
cessed within one week. Average execution time of each job is approximately 6 hours. Average failure
rate was 2%, caused mostly by network instability between elements of computing infrastructure of
CERN at JINR. One job was not processed correctly due to network lacks between computing element
and EOS storage. No other problems appeared during the processing.
      Merging was done after all results of the run finished and produced their results. Due to large
volumes of data merging can not be executed as single job, that is why size of result of each job must
be tracked and then merging job is being sent with several input files. Merging job results stored to
their own directory on EOS.


Physics validation
     Results were validated by COMPASS physics coordinators. Several minor errors appeared and
jobs were reproduced. The cause of most of the problems is connection lacks between job and PanDA
server, the obvious way to solve that problem is to move PanDA server from JINR cloud service infra-




                                                                                                   387
structure [Baranov, Balashov, Kutovskiy, Semenov, 2016], which is used mostly for development and
testing, to a production-quality infrastructure with better network connectivity, for example to CERN.


Summary
      Results of the performed work show, that PanDA can be used to run COMPASS production jobs,
and, even more important, that COMPASS software works correctly in Grid environment. The follow-
ing goals were achieved during the previous phases of the project:
          • Grid environment was prepared for COMPASS, all elements of the chain work as a coher-
       ent system, allowing to send jobs to any Grid site which would like to participate in data pro-
       cessing;
          • New production management software was built;
          • More than 5000 jobs were executed, including 2800 jobs of a real run of 2014.
      PanDA server, installed on single virtual machine in JINR cloud service, behaves well and shows
impressive productivity and reliability while running thousands COMPASS production jobs in Grid
environment and maintains high load of available computing elements.
      Grid processing is enabled now for COMPASS community. Starting from this point experiment
management may connect as much Grid sites as it’s necessary to handle jobs flow and PanDA will
distribute the load among them. Large work still awaiting in several areas: central file catalog, distrib-
uted data management and careful adaptation of production software for Grid.
      At the moment documentation with step by step instructions is being prepared so that next sub-
missions could be done by COMPASS production managers.
      Next steps of the project imply:
          • Better work with COMPASS production software exit codes so that job would be restarted
       automatically in case of problems;
          • Enabling CVMFS for easy turning on new computing sites;
          • Adding several new computing sites and start distributing jobs among them;
          • Development of more intelligent production management system, which will take care of
       whole chain of steps of jobs submission, management and monitoring.


References
Abbon P. et al. The COMPASS experiment at CERN // Nuclear Instruments and Methods in Physics
      Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment. —
      2007. — Vol. 577, Issue 3. — P. 455-518.
Maeno T. et al. Evolution of the ATLAS PanDA workload management system for exascale computa-
      tional science // Journal of Physics Conference Series. — 2014. — Vol. 513. —
      http://inspirehep.net/record/1302031/.
Klimentov A. et al. Next generation workload management system for big data on heterogeneous dis-
      tributed computing // Journal of Physics Conference Series. — 2015. — Vol. 608. —
      http://inspirehep.net/record/1372988/.
Petrosyan A. Sh. PanDA for COMPASS at JINR // Physics of Particles and Nuclei Letters. — 2016.
      — Vol. 13, Issue 5. — P. 708-710.
Baranov A.V., Balashov N.A., Kutovskiy N.A., Semenov R.N. JINR cloud infrastructure evolution //
      Physics of Particles and Nuclei Letters. — 2016. — Vol. 13, Issue 5. — P. 672-675.




                                                                                                    388