=Paper=
{{Paper
|id=Vol-1787/34-39-paper-5
|storemode=property
|title=UTFSM/CCTVal Data Center (10 Years of Experience)
|pdfUrl=https://ceur-ws.org/Vol-1787/34-39-paper-5.pdf
|volume=Vol-1787
|authors=Yuri Ivanov,Luís Salinas
}}
==UTFSM/CCTVal Data Center (10 Years of Experience)==
<pdf width="1500px">https://ceur-ws.org/Vol-1787/34-39-paper-5.pdf</pdf>
<pre>
   UTFSM/CCTVal Data Center (10 Years of Experience)
                               Yu. P. Ivanov1,2,a , L. Salinas1,b
                               1Universidad Tecnica
                                             ´      Federico Santa Marı́a,
                                                                    ´ Chile
                                      ˜ 1680, Casilla 110-V, Valparaıso,
                          Avenida Espana
                                   2 Joint Institute for Nuclear Research,

                          Joliot-Curie 6, Dubna, Moscow region, 141980, Russia

                           E-mail: a yuri.ivanov@usm.cl, b lsalinas@inf.utfsm.cl


       During the last 10 years the data center of Universidad Tecnica Federico Santa Marı́ a
(UTFSM, Valparaı́so, Chile) has been providing its computational resources for users from UTFSM
and other Chilean universities. Started as a Chilean part of the international project EELA on
creation of a computational infrastructure distributed between Europe and Latin America, the
cluster was significantly extended after creation of the Scientific and Technological Center of
Valparaiso CCTVal in 2009 and became the UTFSM/CCTVal Data Center. Local users have direct
access to the cluster’s computational resources via the batch system. The cluster facilities are also
available in the frame of Grid Computing. Grid users have access to the part of the cluster
resources in the frame of the EGI (European Grid Infrastructure) and WLCG (Worldwide LHC
Computing Grid) infrastructures. These resources are provided to users from certain ”Virtual
Organizations” (VO). The main supported VO is ”ATLAS”, which joins researchers of the ATLAS
experiment at the Large Hadron Collider (LHC, CERN). The data center was the first computer
center in Latin America working for the ATLAS Collaboration. Another supported VO is ”EELA
Production” which includes users from Latin America, Italy, Portugal and Spain. In addition to
ordinary processors, the cluster also allows powerful Graphics Processing Units to be used for
calculations. The data center participates in a big Chilean project NLHPC (National Laboratory for
High Performance Computing). Computational facilities provided by this project are integrated into
the data center infrastructure. The range of problems being solved at the cluster is huge, from
fundamental and applied problems in different branches of Physics, Chemistry, Astronomy, Computer
Science to the educational and training purposes. The paper presents the history and the current
status of the cluster including configuration and some statistics of usage.
     Keywords: distributed computing, high performance computing, grid computing


                                                                                   c 2016 Yuri P. Ivanov, Luı́s Salinas


                                                                                                                 34
1. Introduction
       In this paper we present the history and the current status of the UTFSM/CCTVAL data center,
beginning with just a few words about Universidad Técnica Federico Santa Marı́a [UTFSM] and Centro
Cientı́fico Tecnológico de Valparaı́so [CCTVal].
       UTFSM is a private, not-for-profit university and belongs to the 25 traditional universities in
the Chilean University Council of Rectors (CRUCH). The beginnings of the University go back to
the altruist dream of Federico Santa Marı́a, who set the foundation of the Institution in his will and
testament in 1920. He stated his wish to contribute to the progress and increase the cultural horizon of
Chile to his executors. And in the mid-1930s his dream of having a world class engineering University
came true. UTFSM is one of the most academic selective universities in the country. The University
specializes in the following areas: Electronics, Mining, Mechanical Engineering, Metallurgy, Electrical
Engineering, Industrial Engineering, Informatics, Business, Basic Sciences (Physics, Chemistry, and
Mathematics), Aeronautics, Architecture, Construction, and Environmental Sciences. It has four
campuses in Chile and one in Ecuador. The University has PhD and Master’s Degree Programs in
around 20 areas including Physics, Chemistry, Informatics, Electronics Engineering, and Biotechnology.
UTFSM also hosts a number of research centers, including CCTVal.
       The idea to merge experience in the development of the Particle Physics, Computing, and
Electronics research areas led to the foundation of CCTVal in 2009. The Center was created and
acknowledged by the National Commission of Scientific and Technological Investigation (CONICYT).
CCTVal has several research groups: Theoretical Elementary Particle Physics, Experimental High
Energy Physics, Informatics and Computing, Power Electronics and Systems and Signals. One of the
main objectives of CCTVal is to comply with the commitment of global collaboration, strengthening
the links between Chile and world renowned laboratories such as CERN, Jefferson Lab, and Fermilab.
       Since the first days of its creation, the UTFSM/CCTVal data center has been actively used by
researchers and students from UTFSM and other Chilean universities. The cluster facilities are also
available in the frame of the Grid Computing. The Data Center is a Tier-2 level site in the frame of the
European Grid Initiative [EGI] and Worldwide LHC Computing Grid [WLCG] infrastructures.
       The short history of the cluster development is presented in the second Section. The third Section
contains information on the current cluster layout for the Grid and High Performance Computing
(HPC). Conclusions are presented in the last Section.

2. Short history of the Data Center
       The Data Center started as a small Chilean part of the international project ”E-infrastructure
shared between Europe and Latin America” [EELA]. UTFSM joined this project due to the initiative
of the Informatics and Physics Departments that were badly in need of modern computational facilities.
At the end of 2006 UTFSM received a dozen of servers in a quite reasonable configuration for that
time (dual CPU 1.6 GHz, 4 GB RAM and 140 GB HDD per server). One machine had even three SAS
hard drives with a total capacity of around 1 TB raw disk space. This equipment allowed launching the
first computational cluster in UTFSM.
       The first operating system (OS) used at the cluster in 2006 was Scientific Linux 3 (SL3). In
a year it was replaced by Scientific Linux 4 (SL4). This system was used approximately up to 2010
despite the fact that Scientific Linux 5 (SL5) was already available since 2007. Such delays with OS
upgrades are related with the requirement to keep the OS compatible with the systems used in big
scientific research centers like CERN and Fermilab.
       During the first few years the cluster worked mostly for local users. All Grid-related activities
during that period were on a level of simple tests. One should mention here that the Grid in Latin
America is rather specific: there are just a few computational centers working in this area, most


                                                                                                     35
of which are dedicated only to one experiment or project. So, instead of coordination of the Grid
activities on a national level via the National Grid Infrastructures (NGI), it was necessary to create
the ”Latin American Regional Operation Centre” (ROC-LA). It was done only at the end of 2008 in
the frame of the second stage of the EELA project (EELA-2). And in March 2009 the UTFSM cluster
successfully passed all Grid tests and was certified as a Tier-3 Grid Resource Center with the site name
”EELA-UTFSM”. One of the first supported VOs was ”EELA Production” which included users from
Latin America, Italy, Portugal, Spain and some other countries originally through the project EELA
and since 2010 through the project ”Grid Initiatives for e-Science virtual communities in Europe and
Latin America” [GISELA].
        In 2009 the cluster still had around 40 CPU computational cores, while the number of even
local users had already grown up to more than a hundred. The situation became worse when Grid job
processing started. It was clear that the existing computational facilities were not enough. Fortunately,
that year the CCTVal center was founded. Funds provided by the center allowed the computational
facilities and storage capacities to be significantly increased during the next years. In 2010 the cluster
got additional computational servers with 256 CPU 2.8 GHz cores and a storage server with the 20 TB
disk space, and more servers (around 200 additional CPU 3.1 GHz cores) with storage servers of total
capacity around 180 TB in 2012. Computers with only 40 CPU cores can fit in a small rack, but several
hundreds require not only a few racks, but also a proper environment, i.e. air conditioning systems,
powerful electric power supply, etc. All this was done with CCTVal funding, and, as a result, the
UTFSM computer cluster became the UTFSM/CCTVal Data Center.
       Noticeable increase in computational power brought another problem to the fore — bandwidth of
the external network connection. In most cases, processing of Grid jobs also requires data transfer
from the external Grid storage. When in 2009 the cluster started processing of grid jobs, it had
only 10 Mbps external bandwidth shared with the Informatics Department. And that situation was
typical of universities and research centers in most Latin American countries. Fortunately, after 2010
the situation with network connectivity started to improve. New optical lines (including submarine
cables) to Europe and North America appeared. Finally, in 2011 UTFSM got a direct network link
from the National University Network provider REUNA [REUNA]. The UTFSM/CCTVal cluster
obtained the total external bandwidth of 54 Mbps (40 Mbps dedicated and 14 Mbps shared). Such
bandwidth extension allowed the cluster to start processing of the full range of Grid jobs. With a low
bandwidth, the cluster was suitable for processing jobs without big data transfers like ATLAS Monte-
Carlo simulations. With the ability to process hundreds of jobs, the cluster started to meet requirements
of local users, and it also became possible to include the center into ATLAS data processing. In 2012
the cluster passed the ATLAS certification and was included in ATLAS Production and Analysis as
a Tier-3 Grid Ready site. The UTFSM/CCTVal data center was the first computer center in Latin
America working for the ATLAS collaboration.
       The number of CPU cores provided by the cluster for the Grid computing at that time allowed
it to be considered close to the Tier-2 level, but the technical characteristics of the site were not
the only issue. The most important difference between Tier-3 (i.e. entry Grid level) and Tier-2 is
that resources provided by a Tier-2 site are ”pledged”. These ”pledges” are stipulated in a special
agreement between the resource provider and the WLCG. In 2013, the coordinated efforts of the
Latin American research centers led to the creation of the ”Tier-2 Latin-American Federation”. The
Memorandum of Understanding between Centro Latinoamericano de Fı́sica [CLAF] and the WLCG
was signed in September 2013. CLAF is an international organization aimed to promote and coordinate
efforts in development of physics in Latin America (Argentina, Brazil, Colombia, Chile and others).
This organization works as an ”umbrella” to all ROC-LA sites. The financing is coming from the
hosting institutions, not from central sources. As a result of all those efforts, in November 2013 the


                                                                                                      36
UTFSM/CCTVal cluster was officially presented as an ATLAS Tier-2 site at the ATLAS International
Computer Board (ICB).
       In addition to computations with ordinary CPUs, the cluster hardware allows data processing
with the powerful GPU cards. The first servers with nVidia Tesla C1060 cards became available for
cluster users in 2010. Next year a couple of servers with nVidia Tesla M2050 were added. And in 2014
the cluster received five more servers with powerful nVidia Kepler K20m GPU cards in the frame of
the Chilean project ”Tsunami”.
       In 2014 the cluster was seriously extended in the frame of the big Chilean project called ”National
Laboratory for High Performance Computing” [NLHPC]. The main computational facilities of this
project are located in the Center for Mathematical Modeling [CMM] of Universidad de Chile, Santiago.
But other universities participating in this project also obtained some computational equipment. As a
result, the UTFSM/CCTVal data center increased its computing power by around 240 CPU cores of
2.9 GHz and disk storage by around 60 TB.
       Cluster computational facilities grew with time. Figure 1 shows the amount of the CPU time
used by local users with direct access for High Performance Computing (HPC) and by remote users
via EGI/WLCG Grid infrastructures since 2007.

         3.0

         2.5                            CPU time [106 hours]

         2.0               HPC (UTFSM, UFRO, USACH etc.)
                           Grid (WLCG, EELA/GISELA)
         1.5

         1.0

         0.5

         0.0
                   2007    2008    2009    2010    2011     2012    2013    2014    2015

                        Fig. 1. CPU time used yearly for HPC and Grid computing

      In the last two years the whole cluster infrastructure has been seriously updated: noticeable
extension of all computational facilities required serious upgrading of the whole air conditioning system
and reconstruction of the electric power supply. The internal network infrastructure has also been
changed: now all storage servers and most of the computational servers have 10 Gigabit Ethernet
connection.

3. Cluster Layout
       The present-day cluster configuration includes around 800 CPU cores with a total amount of
disk space on storage servers around 300 TB. Figure 2 presents the current cluster layout. Local
users have direct access to the cluster computational resources via User Interface (UI) servers. For
authentication and authorization of users LDAP and Kerberos services are used. Using development
tools and packages installed on the UI servers, users can submit jobs to the Portable Batch System
(PBS) server for the execution via the batch system. The PBS server (PBS version ”Torque” with
scheduler ”Maui” is used) distributes jobs among the Worker Nodes (WN). All computational nodes
have processors with frequencies from 2.8 GHz to 3.1 GHz and at least 3 GB RAM per each core.


                                                                                                      37
In addition to the ordinary computational facilities provided by CPUs, the cluster allows special jobs
processing with powerful GPU cards on the GPU Worker Nodes (nodes gp01..gp09 in Figure 2).
Disk storage for local users is provided by the File Servers (FS) with the total capacity of around
100 TB via the NFS and GlusterFS [GlusterFS] distributed file systems. Local and grid users also have
direct access to the programs and libraries provided by CERN via the CernVM File System [CernVM-
FS].


                         Local Users                               Batch Computing
                                                      PBS server                   Worker Nodes
                           UI: ui01..ui04
                                                                                   CPU (wn01..wn34)


                                                                                   GPU (gp01..wn09)
                        Grid Services
                            ARGUS, APEL
                            CE (ce01,ce02)
                                                                         Storage
                            SE (dCache)
                                                                       FS: fs01..fs08 (GlusterFS, NFS)
                            Site BDII
                                                                       SE: sp01..sp05 (dCache pools)


                                             Infrastructure Services
                              DNS, Kerberos, LDAP, Mail, Proxy, SVN, Web, perfSONAR


Fig. 2. UTFSM/CCTVal cluster layout. Local users have direct access via the UI servers to the batch system.
Access for remote users is provided via the Grid services. See text for more detailed explanations.

       Remote users from the supported VOs have access to the cluster resources via the EGI/WLCG
Grid infrastructure. Integration into the cluster computational structure is provided by the Grid services:
the Computing Element (CE) servers deliver Grid jobs to the local batch system and the Storage
Element (SE) server (dCache system) gives access via the Grid access protocols to the SE dCache disk
pools with a total size of around 200 TB. The ”Site BDII” server informs the Grid infrastrusture on
the current status of the computational and storage cluster resourses. For authorization of Grid users
the ARGUS server [ARGUS] is used. The APEL client [APEL] collects accounting data on the cluster
work and transfers this information to the EGI/WLCG servers.
       The data center also has infrastructure level services like the Domain Name Server (DNS) and
the Mail, Proxy, and Web servers. For monitoring the quality of the external network connectivity two
”perfSONAR” nodes [perfSONAR] (for monitoring bandwidth and latency) are used. All computational
and storage servers use ”bare” hardware, while for some services (like APEL, ARGUS, CE etc.)
different virtualization techniques (Kernel Virtual Machine [KVM] and Container Virtualization
provided by OpenVZ [OpenVZ]) are used.
       Currently most of the cluster servers have Scientific Linux 6 (SL6) installed in the 64-bit mode.
Users have access to compilers (C, C++, FORTRAN) and other programming languages (Perl, Python).


                                                                                                         38
The cluster has a lot of specialized program packages and libraries installed (GEANT, OpenFOAM,
Pythia, ROOT etc). For the parallel programming the Message Passing Interface (MPI) with the MPICH
or OpenMPI packages can be used. For the GPU programming the nVidia CUDA compiler is available.

4. Conclusions
       The UTFSM/CCTVal Data Center is used by researchers from UTFSM, Universidad de Chile,
Universidad La Frontera (UFRO), Universidad Santiago de Chile (USACH) and other Chilean
universities. Being also a Grid Resource Center of the Tier-2 level, the cluster is under permanent
monitoring by different control systems, starting from local cluster monitoring and up to external
monitoring by the Grid infrastructures, including systems provided by each VO. This control imposes
serious restrictions on the possible time of cluster unavailability for users. The Tier-2 level requires
high values for such cluster parameters as ”Accessability” and ”Reliability”: their values should be not
lower than 95%. High reliability of the data center provides effective cluster usage in a wide range of
different research projects:
   • Computations in high energy and particle physics, including ATLAS analysis
   • Biomedical image processing (like Digital Pathology on breast cancer etc.)
   • Satellite image processing for environment protection
   • Modeling of mechanical constructions (turbulent flow around bridge piers etc.)
   • Project "Tsunami"(modeling of tsunami hydrodynamics and implementation of an operational
     database integrated with the National Tsunami System)
        Creation and maintenance of this center require serious efforts from the cluster support team.
Note that here the term ”maintenance” includes not only regular replacement/repair of hard drives, UPS
batteries, hardware and software updates, but also a proper user support via web and online tutorials
etc. During all the years passed the cluster has been working as a reliable computational tool. The
cluster support team works hard to meet all needs of the data center users. And all its members hope
that it will be possible to continue this work in the future to satisfy in a reliable way all new arising
requirements of the ongoing and upcoming projects.

References
APEL accounting tool [Electronic resource]: https://wiki.egi.eu/wiki/APEL
Argus authorization service [Electronic resource]: https://www.gridpp.ac.uk/wiki/Argus_Server
CernVM File System [Electronic resource]: https://cernvm.cern.ch/portal/filesystem
CCTVal Research Center http://cctval.cl
Centro Latino-Americano de Fı́sica [Electronic resource]: http://www.claffisica.org.br
Center for Mathematical Modeling [Electronic resource]: http://www.cmm.uchile.cl
Project EELA [Electronic resource]: http://www.eu-eela.eu
European Grid Infrastructure [Electronic resource]: https://www.egi.eu
The GISELA Project [Electronic resource]: http://www.gisela-grid.eu
GlusterFS storage file system [Electronic resource]: https://www.gluster.org
Kernel-based Virtual Machine [Electronic resource]: http://www.linux-kvm.org/page/Main_Page
National Laboratory for High Performance Computing [Electronic resource]: http://www.nlhpc.cl
OpenVZ container-based virtualization [Electronic resource]: https://openvz.org
”perfSONAR” test and measurement infrastructure [Electronic resource]: http://www.perfsonar.net
Red Universitaria Nacional (REUNA) [Electronic resource]: http://www.reuna.cl
UTFSM University [Electronic resource]: http://www.usm.cl
Worldwide LHC Computing Grid [Electronic resource]: http://wlcg.web.cern.ch


                                                                                                     39

</pre>