=Paper=
{{Paper
|id=Vol-3041/236-240-paper-43
|storemode=property
|title=Research of Improving the Performance of Explicit Numerical Methods on the X86 and ARM CPU
|pdfUrl=https://ceur-ws.org/Vol-3041/236-240-paper-43.pdf
|volume=Vol-3041
|authors=Vladislav Furgailo,Egor Elchinov,Nikolay Khohlov
}}
==Research of Improving the Performance of Explicit Numerical Methods on the X86 and ARM CPU==
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 RESEARCH OF IMPROVING THE PERFORMANCE OF EXPLICIT NUMERICAL METHODS ON THE X86 AND ARM CPU V. Furgailoa, E. Elchinov, N. Khohlov Russia, Moscow Institute of Physics and Technology (MIPT) E-mail: a furgailo@phystech.edu This paper is a continuation of the research of improving the computing performance of explicit numerical methods on the CPU. We considered the computing possibility of such common computing architectures as x86 and arm, for using optimizations on the data layer as vectorization and tiling. Other aspects of high-performance optimizations of explicit numerical methods have also been explored - metaprogramming, code generation, and OpenMP technology. However, the novelty of this research is the optimization of the arm architecture for the task of computing by the FDTD method and the assessment of the effectiveness of using the arm architecture for solving such a range of scientific problems. This paper considers a number of optimization algorithms, provides a description of the algorithms, test calculations for various architectures.The results of the re search and further directions of work on this topic are also presented. Keywords: expicit numerical methods, FDTD, tiling, x86, arm Vladislav Furgailo, Egor Elchinov, Nikolay Khohlov Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 236 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 1. Introduction Explicit numerical methods are used to solve and simulate a wide range of mathematical problems whose origins can be mathematical models of physical conditions. However, simulations with large model spaces can require a tremendous amount of floating point calculations and run times of several months or more are possible even on large HPC systems. The vast majority of HPC systems in the field today are powered by x86 and ARM CPUs [1]. Our aim is to investigate methods of increasing computational speed for simulation on CPUs and also to compare the performance and energy efficiency on x86 and ARM CPUs. High-order finite difference time domain (FDTD) method to solve the 3D acoustic equation was used in our work. For HPC, in conjunction with parallel computing, we used CPU capabilities like SIMD- computing (AVX on x86 and NEON on ARM) [2] and hierarchical structure of the memory of the CPU caches to optimize data locality. For data locality was used the method of changing order of traversal on the iteration space – loop tiling [3]. Our work considers a number of optimization tiling algorithms and test calculations for x86 and ARM architectures. In particular, we considered recursive and non-recursive cube-tiling [4] and ZCube data locality optimization. 2. Mathematical Problem In this paper, an explicit numerical method is the solution of the acoustic equation in three dimensional space with free-boundary condition by the central finite difference time domain (FDTD) method with a fourth-order accuracy. Thus, the stencils can be represented in Fig.1 Figure 1. Stencils for 1D, 2D, 3D wave-equation 3. Optimization algorithms 3.1 Tiling To increase the performance of the computation, the method of partitioning the iteration space into cube-tiles was used with vectorization. The algorithm shows the highest performance on tiles that are the same size as the first level (L1) of the CPU cache, as shown in fig.2. and fig 3. 237 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 Figure 2. Graph of the number of points computed per second from a fixed tile size on various threads. Grid size varies. Computed on ARM - CortexA53. Figure 3. Graph of the number of points computed per second from a fixed tile size on various threads. Grid size varies. Computed on x86 - Intel(R) Xeon(R) CPU E5-2620 v2. 3.2 ZCube To increase the spatial locality a new data storage principle was used, such that the grid cells are grouped into small cubes forming a new type of cell, each storing original cells in Z-order, as represented on fig 4. Combine ZCube with recursive tiling or nested tiles, we got the following results - as shown on fig.5 the problem of low performance of Neon-computing occurs due to an overflow of Cortex- A53 data cache[5]. Consequently, ZCube tiling improve the performance of utilizing Neon data- registers and instructions. Also on x86, as shown on fig.6, combination AVX-vector and ZCube recursive tiling provides up to 2-times speedup compared to the naive implementation. 238 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 Figure 4. Z-cube 2×2×2, with stencil representation Figure 5. ZCube efficiency for the variable grid size on ARM. Figure 6. ZCube efficiency for the variable grid size on x86 239 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 4. Conclusion In this paper, we implemented a new data locality algorithm that increases the performance of multi-threaded explicit stencil computation. Vectorization over the outer space of iterations and Z- Cube recursive tiling were applied to achieve data locality and to speed up multithreaded computing. However, non-recursive tiling remains a more effective data localization algorithm to FDTD problem. Also, as shown in Fig. 7, the computation of explicit numerical equations on the ARM architecture by non-recursive tiling is 12 times more energy efficient in peak power consumption. In this respect, extending our experiments on ARM-cluster computing with increasing performance of non-recursive and recursive tiling would be of interest. Figure 7. Graph of non-recursive vectorized tiling performance/power effectiveness ofx86 and ARM. 5. Acknowledgement This work was performed with the financial support of the Russian Science Foundation (project No. 21-11-00139). References [1] http://www.top500.org/ [2] S. M. et. al., “Vector instructions to enable efficient synchronization and parallel reduction operations,” U.S. Patent WO2009120981A2, Oct. 2009. [3] J. Xue, “On tiling as a loop transformation,”Parallel Processing Letters, vol. 07,no. 04, pp. 409– 424, 1997. [4] V. Furgailo, A. Ivanov, and N. Khokhlov, “Research of techniques to improve the performance of explicit numerical methods on the cpu,” pp. 79–85, 09 2019. [5] J. Bakos,Embedded Systems: ARM Programming and Optimization. Elsevier Sci-ence, 2015. 240