=Paper= {{Paper |id=Vol-3041/229-235-paper-42 |storemode=property |title=New Hardware Testing Methodology at IHEP Data Center |pdfUrl=https://ceur-ws.org/Vol-3041/229-235-paper-42.pdf |volume=Vol-3041 |authors=Victoria Ezhova,Viktor Kotliar }} ==New Hardware Testing Methodology at IHEP Data Center== https://ceur-ws.org/Vol-3041/229-235-paper-42.pdf
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021



   NEW HARDWARE TESTING METHODOLOGY AT IHEP
                 DATA CENTER
                                   V. Ezhovaa, V. Kotliar
   Institute for High Energy Physics named by A.A. Logunov of National Research Centre
      “Kurchatov Institute”, Nauki Square 1, Protvino, Moscow region, Russia, 142281

                                   E-mail: a victoria.ezhova@ihep.ru

The modern computing center is not only about a production capacity, it is also about a stable work.
Stability means not only software, the reliability of the software component but also assembly
components. It is important to ensure this before the production. One of the ways is testing
components for performance, reliability, assembly defects. In this work we will present the
methodology for identifying problem points, the specific tests and some results of their work based on
IHEP data center.
Keywords: data center, disk system, RAID, disk server, software, assembly components.



                                                                          Victoria Ezhova, Viktor Kotliar



                                                             Copyright © 2021 for this paper by its authors.
                    Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).




                                                   229
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021



1. Introduction
      IHEP has a data center which consists of computing nodes and data storage systems. Every day
IHEP central computing cluster produce thousands of calculations related to research activities, both
IHEP and GRID experiments. These are data from experiments ATLAS, ALICE, CMS, LHCb. This
talking about the safety of the results of calculations. So each machine must meet the reliability
requirements. Each new server gets a series of checks which help to identify some problems even
before launching into production. During the testing phase, it is possible to adjust the settings and to
identify some vulnerabilities, including incompatibility of components or a physical defect.
      In this article it will talk about some methods of checking servers before starting up, additional
monitoring tools.


2. RAM check
        Bad RAM is one of the most frustrating computer problems as the symptoms are often random
and difficult to identify.
     RAM is not compatible with motherboard
     there is a defect in some slats
     errors appear only during heating (long-term operation)
        In the case of complex technical components, the problems not with the fact that the memory
does not work at all. As fact - it may not work with certain conditions. The unstable operation of the
RAM can begin after it "warms up", or when the load increases. In this case, the temperature rises, and
the higher the temperature, the more unstable the memory works.
         The main task of memory testing programs is to load the RAM as much as possible. This will
raise the temperature and allow to check the stability of operation in this mode. The problem is that
not all programs are able to load the RAM to the maximum. This is exactly what MemTest86 succeeds
[1].
        Memtest boots from a USB flash drive [fig. 1]. It executes a series of numbered test sections
to check for errors. These test sections consist of a combination of test algorithm, data pattern and
cache setting. The execution order for these tests were arranged so that errors will be detected as
rapidly as possible [2].




                                         Figure 1. Memtest work screen
         The program writes information to each memory block, and then reads it and checks for
errors. During testing, the utility makes several passes, which allows it to identify and compile a list of
bad memory blocks in the BadRAM format. It runs in a circle. The special attention should be paid to
the Errors and Walltime blocks. And as soon as one circle is completed, the message "***** Pass
complete, (no, 1, 2, 10 ...) errors, press Esc to exit *****" will appear at the bottom of the screen.



                                                   230
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021



3. CPU check
         Here we will focus on testing the processor for speed and stability in extreme mode. Mprime
fits well for these purposes. It is a freeware application. It is used to check stability when overclocking
computers. Overclocking is the practice of increasing the speed of the CPU and/or memory to make a
machine faster at little cost. Typically, overclocking involves pushing a machine past its limits and
then backing off just a little bit. The principle of operation is based on a large amount of mathematical
calculations.
       This program is a good stress test for the CPU, memory, L1 and L2 caches, CPU cooling, and
case cooling. The torture test runs continuously, comparing your computer's results to results that are
known to be correct. Any mismatch and you've got a problem [3].
        There are several options for the torture test (menu option 15).
    Small FFTs (option 1) to stress the CPU
    In-place large FFTs (option 2) to test the CPU and memory controller
    Blend (option 3) is the default and constitutes a hybrid mode which stresses the CPU and
RAM.
        Below is a part of the script execution [fig. 2].




                                     Figure 2. The script execution

        The archive contains a script and some information files. At the end of the test, 2 more files
are formed:
       prime.txt - a configuration file containing information about the GUI's output behaviour and
the     connection with the PrimeNet server;
       results.txt - a history file containing completed workunits and related stuff.
        Errors will be reported if only they occur both to stdout and to ~/results.txt for review later.
Many people do not consider a system as 'stable' unless it can run the Large FFTs for a 24 hour
period.[4].
        This stress testing software is classified as high power for electrical power. It is recognized
universally as one defacto measure of system stability.




                                                    231
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021



4. Test HDD speed
        How to find out what maximum speed of hard drive supports? One option is to use the hdparm
or dd command. It provides a command line interface to various hard disk ioctls supported by the
stock Linux ATA/IDE/SATA device driver subsystem.
      Hdparm can set parameters such as drive caches, sleep mode, power management, acoustic
management, and DMA settings. Here is example to measure hard disk data transfer speed [fig. 3].




                                 Figure 3. Hard disk data transfer speed

        The second utility also can check write and read speed [fig. 4].




                                     Figure 4. Write and read speed

        Tempfile has been cached into a buffer and its read speed will be faster than real. To get real
speed, it need to clear the cache [fig. 5] [5].




                                   Figure 5. Speed without the cache


5. Smartmontools HDD check
        To get all the available SMART information about a storage device, we can launch the
smartmontools utility with the -a option (short for -all) and of course pass the path of the device we
want to check as the last argument of the command.
        The smartctl utility can be used to launch a variety of self-tests:
     short
     long [fig. 6]
     conveyance (ATA devices only)
     select (ATA devices only)
        In our case the second type is used. The long test is basically a more accurate version of the
“short” test. It can take a lot of time to complete: as stated in the the smartctl manual, it can last from


                                                   232
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021



tens of minutes to several hours [6].




                                    Figure 6. The long test execution

        To check the current status of the /dev/sda device [fig. 7]




                                        Figure 7.Result of long test

       The command above produces a lot of output. Among the other things, we can see the status
of various SMART parameters. Very important parameters to check are, among the others,
“Reallocated_Sector_Ct” and “Current_Pending_Sector”. In both cases if the RAW_VALUE is
something other than 0, we should be very careful and start to backup data on the hard drive. The
Reallocated_Sector_Ct is the count of sectors on the block device which cannot be used correctly.
        When such a sector is found it is remapped to one of the available spare sectors of the storage
device, and data contained in it is relocated. The Current_Pending_Sector attribute, instead, is the
count of bad sectors that are still waiting to be remapped [6].




                                                    233
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021



6. RAID Controller
       Storage servers work with a huge volume of data. So the safety of information is very
important. For IHEP storages the most optimal solution was to use RAID 6. Failure is simulated after
assembly of the array during active server operation. Next the disk will be replace and after the file
system will be checked.
        Here we will talk about the hardware controller. IHEP datacenter uses Adaptec and 3ware
RAID controllers. The use of a hardware RAID controller is justified in most cases when high speed
and reliability of the disk subsystem is required.
        At first the array will be splitted into GPT-partitions of the same size by the parted utility. The
best solution will be to use mkfs.xfs to create the XFS file system. The XFS filesystem is a high
performance journalling filesystem. A good idea is to enable write cache, this will speed up the rebuild
process (at least for RAID 6).The procedure is shown below [fig. 8].




                                  Figure 8. File system integrity check

        The number 20,000 is taken from the disk array - 20 TB [7]. Also use the xfs_repair command
to repair an XFS file system. Run the command xfs_repair -n to see which bugs can be fixed (but not
fix them).
        If no errors were found, the test was successful.


7. Conclusions
         The article presents the methodology for identifying problem points, software tests and
specific tests that are used by IHEP. Results of their work are based on IHEP data center. At this stage,
components are tested for performance, reliability, assembly defects. It does not include connection.
This year a gigabit channel was connected to IHEP. Signal availability is enough now, but in the
future it is planned to introduce network bandwidth tests with the advent of a 10 GB channel.



                                                   234
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021



References
[1] The article about memtest.          Available at: http://reanimator.mixed-spb.ru/memtest/what-is-
memtest.php (accessed 30.06.2021)
[2] «Memtest86 technical information». Available at: https://www.memtest86.com/tech_individual-
test-descr.html (accessed 30.06.2021)
[3] «Free Mersenne Prime Search Software» (Prime95 Version 30.3 build 6). Available at:
https://www.mersenne.org/download/ (accessed 30.06.2021)
[4] «Stress testing software». Available at: https://wiki.archlinux.org/title/Stress_testing (accessed
30.06.2021)
[5] «Test HDD speed in Linux» Available at: https://wdocs.readthedocs.io/hdd_speed_test.html
(accessed 30.06.2021)
[6] «How to check an hard drive health from the command line using smartctl». Available at:
https://linuxconfig.org/how-to-check-an-hard-drive-health-from-the-command-line-using-smartctl
(accessed 30.06.2021)
[7] «The        disk      server      testing»    (dp0031        —       lustre).    Available          at:
https://redmine.ihep.su/projects/lpoit/knowledgebase/articles/5 (accessed 30.06.2021)




                                                   235