Evaluation of the Impact of Various Local Data Caching
      Configurations on Tier2/Tier3 WLCG Sites

 Aleksandr Alekseev6,7,8, Stephane Jezequel9, Andrey Kiryanov1, Alexei Klimentov3,
Tatiana Korchuganova6,7,8, Valery Mitsyn4, Danila Oleynik4, Serge Smirnov5, Andrey
                                    Zarochentsev2
                   1 NRC “Kurchatov Institute” – PNPI, Gatchina, Russia
                2 Saint Petersburg State University, Saint Petersburg, Russia
                    3
                        Brookhaven National Laboratory, Upton, NY, USA
                    4 Joint Institute for Nuclear Research, Dubna, Russia
              5 National Research Nuclear University MEPhI, Moscow, Russia
                 6 Institute for System Programming RAS, Moscow, Russia
                           7 University Andres Bello, Santiago, Chili
                    8 Plekhanov University of Economy, Moscow, Russia
             9 Laboratoire d’Annecy de Physique des Particules, Annecy, France

                                    kiryanov@cern.ch


       Abstract. In this paper, we describe various data caching scenarios test imple-
       mentation and lessons learned. In particular, we show how local data caches
       may be configured, deployed, and tested. In our studies, we are using xCache,
       which is a special type of Xrootd server setup to cache input data for a physics
       analysis. A relatively large Tier2 storage is used as a primary data source and
       several geographically distributed smaller WLCG sites configured specifically
       for this test. All sites are connected to the LHCONE network. The testbed con-
       figuration is evaluated using both synthetic tests and real ATLAS computation-
       al jobs submitted via the HammerCloud toolkit. The impact and realistic ap-
       plicability of different local cache configurations is explained, including both
       the network infrastructure and the configuration of computing nodes.

       Keywords: Federated Storage, xCache, WLCG, DOMA.


1      Introduction

HENP experiments are preparing for the HL-LHC era, which will bring an unprece-
dented volume of scientific data. This data will need to be stored and processed by
collaborations, but the expected resources growth is nowhere near extrapolated re-
quirements of existing models both in storage volume and compute power. It is well
understood that computing models need to evolve. Such evolution includes multiple
aspects:
• Optimized data processing, squeezing the maximum from the available
     CPU/GPGPU/FPGA resources.


Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons Li-
cense Attribution 4.0 International (CC BY 4.0).
•   Optimized data storage, reduction of the number of copies, different data access
    methods, full utilization of network resources.
•   Cost optimizations, no high-end expensive RAID setups, no underutilized CPUs
    on storage servers, no HDDs with 90% free space on the worker nodes.
•   Deployment optimizations, scalability and containerization with on-demand ex-
    pansion into the cloud (both community and commercial).
•   Operational cost optimization, more standardized solutions, lower requirements
    on unique Grid expertise.


2      Ongoing R&D Projects

WLCG and experiments have launched multiple R&D projects to address HL-LHC
challenges:
• Data Lake. The aim is to consolidate geographically distributed data storage sys-
    tems connected by fast network with low latency. The Data Lake model as an
    evolution of the current infrastructure bringing reduction of the storage and oper-
    ational costs.
• Intelligent Data Delivery Service (iDDS). The intelligent data delivery system
    will deliver events as opposed to delivering bytes. This allows an edge service to
    prepare data for production consumption, the on-disk data format to evolve inde-
    pendently of applications, and decrease the latency between the application and
    the storage. The first implementation in April-May 2020 for Data carousel and
    active ML workflows.
• Hot/Cold storage. Data placement and data migration between “Hot” and “Cold”
    storages using data popularity information.
• Data format and I/O. Evaluating new formats (e.g. parquet) and I/O performance
    for HENP data.
• Third Party Copy. Improve bulk data transfers between sites and find a viable
    replacement to the GridFTP protocol.
• Operations Intelligence. Reduce the HEP experiments computing operations ef-
    fort by exploiting anomaly detection, time series and classification techniques to
    help the operators in their daily routines, and to improve the overall system effi-
    ciency and resource utilization.
• Data Carousel. Use tape more effectively and actively in distributed computing
    context.


3      Objectives of this work

This research is conducted in collaboration with the European Data Lake Project,
which is part of the WLCG DOMA initiative [1]. We will show a few possible ways
of optimizing remote data access from the worker nodes in somewhat small T2/T3
setups or dynamically scaled containerized deployments for physics analysis pay-
loads. This kind of deployment implies the necessity of heavy site-remote read-biased
data I/O, and time slot (t) allocated for analysis job is normally split into three phases
(disregard some overhead): input read (t1), compute (t2) and output write (t3). Some-
times analysis payloads can read and write data while performing computation which
makes it hard to separate t1 from t2 and t2 from t3, but in any case, at least some data
needs to be preloaded before computation can start. Here, we will focus on optimizing
t1 and thus improving the CPU utilization of a compute resource.
    In order to optimize the read time (t1) in cases where hardware and network per-
formance cannot be easily improved, various caching systems are standardly used.
However, any kind of caching is only effective with a sufficient cache hit ratio. The
very first thing we need to check is the real repeatability of read requests during
standard physics workflow. Let us try to evaluate the typical number of read requests
to a single file (K) of the ATLAS experiment data suitable for user analysis. Figure 1
shows the ATLAS derivation data sample popularity (number of usage) by users’
analysis tasks [2].


                Fig. 1. Derivation datasets usage in ATLAS physics analysis.

   There is at least one dataset that was accessed 1170 times. On average, ATLAS
DAOD datasets consist of 50 files, which means that each file in this dataset was
accessed at least 20 times if the data popularity is evenly split between them. We take
K = 20 as the basis of our tests. In the end, we have to build a distributed data pro-
cessing system where the computing element (CE) is distinct and distant from the
primary storage system; computing tasks are submitted by users, and these tasks can
eventually request access to the same input file up to 20 times. The input file is locat-
ed on a primary storage and its size varies from 1 to 5 GB (the overview of ATLAS
data files is given in the HEPiX talk [3]). In this case, the infrastructure and building
blocks of CEs can vary significantly between the sites.


4      The test bed

For the scheme we have explained above, it is necessary to describe some specific
details such as the data access protocol and a caching system. In our tests, we will use
an xrootd protocol which is widely employed by LHC experiments and has an im-
portant property of supporting redirects. The latter feature is important when building
distributed storage systems, including distributed caching systems. As a caching soft-
ware we will use an xCache which is, basically, a standard xrootd server configured
in a special way.
   We have decided to exploit three data caching schemes shown in Fig. 2. There is
no universal solution due to the hardware (especially network) differences on differ-
ent sites. With these schemes, we tested three quite obvious scenarios:
1. A single dedicated cache server for sites having a modest external connectivity
     (~1 Gbps) and a relatively good internal network for worker nodes (>=10 Gbps).
2. A local isolated cache on each worker node for sites having a good external con-
     nectivity (>=10 Gbps), but modest internal network for worker nodes (~1 Gbps).
3. A shared cache between worker nodes for sites having external and internal net-
     works of the same relatively high quality (>=10 Gbps) – this approach requires
     some sort of service discovery.


                           Fig. 2. Three data caching schemes.

   At the first stage of the testing, scenarios 1 and 2 were implemented using re-
sources of JINR, PNPI and MEPhI (Fig. 3). JINR was used as a primary storage with
10 Gbps uplink while still having a local CE with 1 Gbps internal network. This CE
was used as a reference and no caching system was deployed there. Tests with JINR
CE were only carried out at the very beginning; later, such tests lost their value. The
PNPI CE located 520 km (~ 11 ms latency) from JINR has 10 Gbps internal network
and 10 Gbps uplink to primary storage. The MEPhI CE located 120 km (~1 ms laten-
cy) from JINR has 1 Gbps internal network and 10 Gbps uplink to primary storage.
                               Fig. 3. Testbed on the map.


5       Tests and results

In order to receive some useful performance metrics, we needed tthe appropriate tests.
In this case, the authors already had some experience in testing distributed storage
systems with both synthetic tests [4] and the HammerCloud toolkit used by the
ATLAS experiment [5], both of which were used for testing the EOS-based distribut-
ed storage [6].
• As synthetic tests, a simple file copy by the xrdcp tool was used.
• As a HammerCloud payload a real-life Athena analysis task was submitted to the
     CEs.

   The first tests, which were reported at HEPiX Workshop [3], were conducted at
PNPI and JINR sites, using only the dedicated xCache. Figures 4, 5, 6 show the re-
sults of HammerCloud tests with and without xCache (JINR is a primary storage, so
xCache was not used there). The results, as can be seen from the graphs:
• Wallclock at PNPI (t):
         o Direct mean time = 2698 ± 577 s
         o xCache mean time = 1934 ± 139 s
         o Difference ~ 770s, ~30%
• Download input files time at PNPI (t1):
         o Direct mean time = 811 ± 574 s
         o xCache mean time = 53 ± 137 s
         o Difference ~ 770s, ~95%
• Download input files time at JINR (t1):
         o Direct mean time = 117 ± 17 s

    These results give an idea of the fundamental benefits of using xCache.
Fig. 4. Wallclock time at PNPI using HammerCloud test N20146370 from Template 1099
(copy2scratch). Direct read on the left, dedicated xCache on the right.


Fig. 5. Download input files time at PNPI using HammerCloud test N20146370 from Template
1099 (copy2scratch). Direct read on the left, dedicated xCache on the right.

   We have made several improvements in our testbed configuration and software in
the last 5 months:
• New site was added – MEPhI (Moscow).
• New Torque with task affinity patches was installed.
• New ARC CE was deployed at MEPhI.
• New monitoring (ELK) was designed and implemented.
• New node-local tests were added.
• Network backbone was improved at MEPhI.
Fig. 6. Download input files time at JINR using HammerCloud test N20146370 from Template
1099 (copy2scratch). Direct read only.

    The following synthetic tests were carried out taking into account the average
number of requests to a single file (20) and the fact that the task can land on a random
worker node, which is important in the case of a local cache (case 2 in Fig. 2). Tests
were carried out in batches, since the load and available bandwidth of the external
network is variable and it was necessary to compare different caching schemes in the
same external network conditions.
    Figures 7, 8, and 9 show the results of these tests. PNPI was tested with both dedi-
cated and local caches, while MEPhI only with dedicated cache because of the short-
age of local disk resources on the worker nodes. The results clearly show the benefit
of using a dedicated cache for both sites, which is a bit unexpected for MEPhI, since
the local network there is worse than the external one, and no improvement from us-
ing the cache was expected. At the same time, we can see minimum benefits from
using the local cache which are within the margin of error.
    HammerCloud tests were carried out in two scenarios only: direct read and dedi-
cated cache, as there were technical problems registering a site with a local cache in
the ATLAS information system (AGIS). The tests themselves have also changed
since 2019, in particular, the template for test jobs was changed from HITS (digitiza-
tion and reconstruction) to Derivation (AOD and DAOD) which is more I/O-intensive
and have a larger input file size per event than with HITS. Figures 10 and 11 show the
results of comparative tests using HammerCloud copy2scratch template (the input file
is entirely downloaded to the working node before execution) for PNPI and MEPhI,
respectively.
    It can be seen that in all cases the gain from using the cache is obvious, which is
expected, since in these tests there was no limit on the number of reads of a single
file. Also, the gain in download input files time accurately matches the overall gain in
the total time of the task execution.
   Fig. 7. Synthetic tests at PNPI. Direct access on the left, local cache on the right.


Fig. 8. Synthetic tests at PNPI. Direct access on the left, dedicated cache on the right.


Fig. 9. Synthetic tests at MEPhI. Direct access on the left, dedicated cache on the right.
Fig. 10. HammerCloud tests (copy2scratch) at PNPI. Top row - download input files time (t1)
with direct read (left) and dedicated cache (right). Bottom row - total time (t) with direct read
(left) and dedicated cache (right).


Fig. 11. HammerCloud tests (copy2scratch) at MEPhI. Top row - download input files time (t1)
with direct read (left) and dedicated cache (right). Bottom row - total time (t) with direct read
(left) and dedicated cache (right).


6      Conclusions and future work

We have successfully passed “a pilot project phase” (PoC). During PoC, we have
configured and tested two types of xCache setup: dedicated cache and local cache.
We have shown performance benefits of using xCache on smaller sites using synthet-
ic and real-life ATLAS analysis workloads. Together with the WLCG community, we
need to address the Data Lake challenge in a global context. The DOMA ACCESS
initiative is the first step in this direction. We will work closely with DOMA and
ATLAS to define the next steps, in particular we will be interested to test our setup
for other HL-LHC R&Ds, such as Data Carousel, QOS and hot/cold storage, etc.
   As a result of this work, we have observed an apparent benefit of a dedicated cache
even for a limited number of requests to a single file, while for the local cache the
benefit is severely doubtful. A dedicated cache, on the other hand, implies some addi-
tional operational and hardware costs that might not be justified by the expected per-
formance benefits. The idea of a distributed cache on local nodes (case 3 on Fig. 2),
which the authors see as very productive, still needs to be understood. Our near-term
plans will include implementation and further evaluation of this idea.


7      Acknowledgements

It is a collaborative effort in ATLAS and WLCG (Operations, Distributed Computing,
Software developers) and sites (JINR, MEPhI, SPbSU and PNPI). Thanks to all.
    We thank Simone Campana and Xavier Espinal for discussions and their contribu-
tion to the tests methodology.
    In Russia, this project is supported by the Russian Science Foundation, Project No.
19-71-30008 (the research is conducted in the Plekhanov Russian University of Eco-
nomics).


References
 1. Data Organization, Management and Access (DOMA). https://iris-hep.org/doma.html, last
    accessed 2020/06/25.
 2. ATLAS        HEP-Google        R&D     project.    Technical    Interchange     Meeting.
    https://indico.cern.ch/event/921179/contributions/3870250/subcontributions/307490/attach
    ments/2042093/3420405/RnD_HEPGCP.pdf, last accessed 2020/06/25.
 3. A. Kiryanov, A. Klimentov, A. Zarochentsev et al, ATLAS Data Carousel Project. HEPiX
    Autumn Workshop, 14-19 Oct. 2019, Amsterdam, Nederlands.
 4. A. Kiryanov, A. Klimentov, D. Krasnopevtsev, E. Ryabinkin, A. Zarochentsev, Federated
    data storage system prototype for LHC experiments and data intensive science, CEUR
    Workshop Proceedings, v. 1787, pp. 40-47.
 5. J. Schovancova, A. Di Girolamo, A. Fkiaras, V. Mancinelli, Evolution of HammerCloud to
    commission CERN Compute resources, to appear in proceedings of the 23rd International
    Conference on Computing in High Energy and Nuclear Physics, Sofia, 2018.
 6. X. Espinal, A. Kiryanov, A. Klimentov, J. Schovancova, A. Zarochentsev, Federated data
    storage evolution in HENP: data lakes and beyond, ACAT, 10-15 Mar. 2019, Saas Fee,
    Switzerland.