Experimental Considerations Towards Effective
 Memory Bandwidth Evaluation on Large-Scale
              ccNUMA Systems

     Pavel Drobintsev, Vsevolod Kotlyarov, Aleksei Levchenko, and Evgeniy
                                  Petukhov

    Peter the Great St. Petersburg Polytechnic University, Saint Petersburg, Russia
                                    vpk@spbstu.ru


        Abstract. In order to predict the performance of a wide range of scien-
        tific applications on current high-end ccNUMA architectures, this paper
        introduces benchmark-related modeling considerations for memory band-
        width and hybrid MPI/OpenMP performance. We use HPCG, state-of-
        the-art benchmark, in order to create a workload representative for a
        multitude of computational and communication tasks. We ran our model
        validation experiments on real ccNUMA machine with 12Tb RAM in sin-
        gle operating system image mode to define the boundaries of problem size
        and demonstrate improved indicators for the target architecture as com-
        pared with the fundamental model. Our model will permit to evaluate
        reliably the performance of contemporary and future ccNUMA systems
        with more than 20Tb of RAM and to compare their experimental results
        with other problem-oriented architectures worldwide.

        Keywords: benchmarking · ccNUMA · HPCG · memory bandwidth ·
        NUMA effects


1     Introduction

The current Cache-Coherent Non-Uniform Memory Access (ccNUMA) systems
are able to provide a larger amount of random access memory per node with a
single operating system image than it is accessible on an usual cluster. Asymmet-
ric ccNUMA nature raises a number of potentially overwhelming strong NUMA
effects such as memory hot-spotting, the substantial penalty of incorrect NUMA
assignment, varying complex multilevel structure of latency and mismatch of
data access models and actual distribution of data in memory [9, 4, 22]. These
factors have a multidirectional impact on memory bandwidth, which continues
to be a major system challenge for memory-bound scientific applications. Deduc-
ing memory bandwidth from the theoretical peak one for a specific computing
procedure is a sophisticated problem [14]. Thereby, hypothetical prediction of
ccNUMA systems memory bandwidth is unconvincing.
    Our ultimate goal is to measure reliably the performance of current and fu-
ture ccNUMA systems. In this work, we present only preliminary considerations
    Effective Memory Bandwidth Evaluation on Large-Scale ccNUMA Systems          41

for the experimental benchmarking, modeling and predicting of ccNUMA mem-
ory bandwidth. The High Performance Conjugate Gradients (HPCG) Bench-
mark was used for creating a workload with the low ratio of computations to
data access that is representative for the major communication and computa-
tional patterns [6]. As we extend the existing HPCG performance model, we
predict the effective memory bandwidth of real system with a globally address-
able memory, so-called jumbonode equipped with 12Tb of RAM and loaded as
a single operating system image. We shall compare the obtained results with
other problem-oriented architecture worldwide and predict the effective memory
bandwidth of future ccNUMA machines. We also demonstrate valuable technical
ccNUMA-related aspects of launching a hybrid HPCG.
    The remaining sections of this paper are organized as follows. In Section
2 we shall mention the most important previous works including the reference
model. In Section 3 we shall describe the factors considered by us that are able
to extend the existing general-purpose model for the ccNUMA architecture. The
model validation and experimental results are discussed in Section 4. Finally, we
summarize our conclusions in Section 5, where we also consider the aspects of
future development of the model.


2     Background and Related Work

We review the previous work in NUMA- and HPCG-related aspects, which will
help us to take into account more challenges proposed by the ccNUMA archi-
tecture, namely (1) hybrid MPI/OpenMP performance modeling, (2) NUMA
effects, which have impact on performance and (3) HPCG-related publications
including reference model of HPCG performance.
    Wang et al. [19] presents a model, which predicts both memory bandwidth
usage and optimal core allocations. Luo et al. [13] provides valuable insights
into off-socket and inter-socket bandwidth modeling to analyze performance of
different thread and data placements. A hybrid approach for the development
of high-level performance models of large-scale computing systems, which com-
bines mathematical modeling and discrete-event simulation has presented in [17].
Work [18] shows advantages of hybrid OpenMP/MPI programming on large-
scale NUMA clusters. Other work on performance modeling of communication
and computation in hybrid MPI/OpenMP applications is carried out by [1].
    As for the HPCG, we already have a number of important works since 2013.
Dongarra et. al [6] describes allowed and disallowed HPCG optimizations. Several
studies, [24, 3, 11, 10, 12, 5, 2] have been done to describe an early experience of
HPCG optimizations on large systems like Tianhe-2, Angara, Sunway TaihuLight
System, etc.
    A general-purpose performance model [14] of the HPCG Benchmark in-
cludes the execution time for main kernels, namely for Symmetric Gauss-Seidel
smoother (SymGS), Sparse Matrix Vector Multiplication (SpMV), Vector Up-
date, Global Dot Product (DDOT), as well as Multigrid preconditioner (MG).
Together with the model of two communication procedures, the complete model
42     Pavel Drobintsev et al.

allows us to predict HPCG performance reliably. As implied by the foregoing,
HPCG can provide insight into comparsion of ccNUMAs with the results of
other problem oriented architectures (non-ccNUMA). The evolutionary aspects
and experimental application of the mentioned works are contributions of this
work.


3    The Extended Model Features

The contribution made by our work is the prediction of ccNUMA system memory
bandwidth by using an reference model from work [14]. The main performance
challenges on ccNUMA are (1) locality of data access, (2) the amount of data
sharing between threads and (3) effective memory bandwidth [21]. The effective
memory bandwidth from main memory participating in the model of all com-
puting procedures is of a greater significance. Our contribution is also in using
hybrid HPCG, not only pure MPI like in model [14]. In spite of the facts that
HPCG is well balanced at the MPI level, the performance of pure MPI real-
ization is higher and OpenMP does not provide support for ccNUMA, our core
point is that the hybrid version is an additional great challenge for ccNUMA
architectures per se, providing emergence of a number of effects detrimental to
performance, such as memory hot-spotting. Table 1 shows the estimated range
of model options that have been considered by us or have such prospect. The
features of our model include (1) the execution time in seconds for main kernels
(SYMGS, SpMV, etc.) previously presented in [14] and extended in this work to
take into account the effective memory bandwidth and interconnect latency, and
(2) the effects of hybrid MPI+OpenMP parallelism in ccNUMA environment.
In this paper, we describe only the experimental aspects of effective bandwidth
evaluation.


                     Table 1. Comparison of model editions

             Model Features              Reference      Extended
             SY M GSexec time(sec)         Considered     +BWef f
             SpM Vexec time(sec)           Considered     +BWef f
             W AXP Bexec time(sec)         Considered     +BWef f
             DDOTexec time(sec)            Considered     +BWef f
             Allreduce, Haloexec time(sec) Considered     +IClatency
             Hybrid MPI+OpenMP             Not considered Considered
             Effective bandwidth           Not considered Considered
             IC latency                    Not considered Considered
             Optimization techniques       Not considered Future work


    We already know total execution time from the non-hybrid HPCG model
[14]:
  Effective Memory Bandwidth Evaluation on Large-Scale ccNUMA Systems        43


      Itertime(sec) = M G + SpM V (depth = 0) + 3(DDOT + W AXP B)            (1)

    Hybrid HPCG is more memory-bound, than pure MPI and can deliver better
performance [15], especially in case of the ccNUMA. For OpenMP, execution time
proposed by Wu and Taylor for hybrid MPI/OpenMP scientific applications [20]
is rewritten as follows:


                                           T otalexec time(sec)
  P erf = (RefM P I + OM P ) ×                                               (2)
                                 Compexec time(sec) + Commexec time(sec)

   where OMP represent the model for intranode OpenMP performance:

                                                Tc2 − Tc1
                     OM P = Tc1 + (BWn − 1)                                  (3)
                                                BW2 − 1
    Here we use Eqn. 3 to model the OpenMP application execution time on
n cores based on the performance for single and dual cores (Tc ) and memory
bandwidth ratio (BWn ) [20].
    The effective memory bandwidth can be deduced from reference model [14]
for every HPCG kernel as follows.


                            (nx × ny × nz)/23×d × (20 + 20 × 27)(Bytes)
 BWSY M GS (Bytes/sec) =                                                ×2
                                       SY M GSexec time(sec)
                                                                         (4)


                           (nx × ny × nz)/23×d × (20 + 20 × 27)(Bytes)
  BWSpM V (Bytes/sec) =                                                      (5)
                                       SpM Vexec time(sec)

                                    (nx × ny × nz)/23×d × 24(Bytes)
         BWW AXP B (Bytes/sec) =                                             (6)
                                         W AXP Bexec time(sec)

                                   (nx × ny × nz)/23×d × 16(Bytes)
          BWDDOT (Bytes/sec) =                                               (7)
                                          DDOTexec time(sec)

    where the most expensive routine is SYMGS [16].
    While computing procedures were modeled exhaustively, important factors
obtained empirically remain. Second of them, after effective memory bandwidth
from main memory, is IC latency, whose influence on the prediction is considered
as insignificant by the authors of work [14]. We evaluate empirically IC latency
by KNEM, a Linux kernel module enabling high-performance intra-node MPI
communication for large messages [8].
44     Pavel Drobintsev et al.

                       Table 2. Target system configuration

        Architecture             Standalone Single OS image macronodes
        details                  server    Minimal Medium Jumbonode
        RAM                  188Gb        752Gb    3Tb       12Tb
        NUMA node(s)         6            24       96        384
        Board/Socket/Core(s) 1/3/48       4/12/192 16/48/768 64/192/3072


4    Experimental Results and Discussion
Since ccNUMA having more than 3Tb memory size are an exotic systems and
it seems complex to obtain a set of various gigantic ccNUMA systems, we use
our target system in different configurations presented in Table 2.
     For a more in-depth study of NUMA-related challenges, we performed our
early-stage experiments with hybrid HPCG running on macronodes from 188Gb
of RAM (48 cores) with aggregation of macronode memory to 3Tb of RAM
(768 cores) and with subsequent integration into a single macronode with up to
≈12Tb of RAM (3072 cores) at the final stage.
     A standalone server is based on AMD Opteron Processor 6380, interconnect
has a 3D Torus topology. We use Linux 4.12 with patchset for support of Block
Transfer Engine driver for NumaChip node controllers, which provide large num-
ber of outstanding memory transactions, memory controller for the cache and
memory tags, a cross-bar switch for the interconnect fabric and a number of
interconnect fabric link controllers.
     Hybrid HPCG run on ccNUMA system with 12Tb is in itself nontrivial prob-
lem, which has not been previously described, to the best of our knowledge.
Operating system as well as HPCG have been compiled with optimized libgomp,
which supports stack and thread local storage (TLS) to keep local to more than
1024 threads. A private stack with size up to 2Gb is allocated to each HPCG
thread for increasing of problem size, which is very relevant. All MPI processes
mapped by NUMA nodes to reduce memory traffic and keep the data close to the
cores [15]. Generation of instructions to prefetch memory is used for increasing
performance of loops that access large arrays. Load is balanced for improving
efficiency of OpenMP application, distributing threads through all accessible
NUMA nodes, using more FPU and reducing load on the memory interface and
L3 cache. Generation of instructions to prefetch memory is used for increasing
performance of loops that access large arrays. The largest allowable size of the
problem was 256 × 256 × 256. All start-up options described above have a sig-
nificant impact on HPCG performance on ccNUMA. Figure 1 shows the results
of modeling with the help of the fundamental model that does not take these
characteristics into account.
     Figure 2 compares our predictions with the actual measured results of hybrid
HPCG on the jumbonode with 12Tb of RAM, and the predictions by the refer-
ence model have been put to comparison too. In contrast to the results of work
[14], the hybrid HPCG scales non-linearly on ccNUMA system; non-uniformity
    Effective Memory Bandwidth Evaluation on Large-Scale ccNUMA Systems      45


                   Fig. 1. Reference model prediction for HPCG


of the system results in surface separation whose causes will be studied. And
finally in Figure 3 we show the comparison of the modeled ccNUMA bandwidth
and the STREAM Benchmark results.
    As to predicting the performance of future ccNUMA systems with more than
20Tb memory size, the view taken is that HPCG will remain memory-bound in
the future as well. Having about 7Tb memory consumption upon HPCG start
with the maximum jumbonode task size, we expect a proportionally high mem-
ory consumption since future ccNUMAs will have at least 4Gb per core. IC la-
tency, whose weight in the general HPCG model is insignificant, will grow. Based
on our model, we expect the performance of at least 400GFlops for macronode
with 20Tb of RAM. In respect of current non-ccNUMA machines, HPCG offers
a single metric for comparing various problem-oriented architectures and reduces
the gap between them created by LINPACK. E.g., the experimental ccNUMA
demonstrates a satisfactory HPCG performance as compared to the results of
technical report [5] for “The Sunway TaihuLight supercomputer” [7], suggesting
that the ccNUMA memory is slightly slower as compared to the current TOP500
leaders.


5     Concluding Remarks and Future Work
In this work, we presented an experimental approach to contemporary ccNUMA
systems memory bandwidth evaluation. HPCG Benchmark was used to create a
46     Pavel Drobintsev et al.


Fig. 2. Modeled and measured HPCG results on the target full-sized jumbonode with
12Tb of RAM


workload comparable to the contemporary scientific applications. The existing
HPCG performance model was extended by considering hybrid MPI/OpenMP
and supplemented by the factors influencing memory bandwidth. As a result,
the effective memory bandwidth of an real ccNUMA system with 12Tb of RAM
was predicted. The approach used by us can be applied to comparing of current
and future ccNUMA machines.
    As implied by the foregoing, the divergence between the actually obtained
using STREAM Benchmark results and the deduced from reference model ones
is up to 12%. As was demonstrated in Section 4, the whole software environment
was optimized on a wide scale, namely the Linux kernel, gcc, libgomp, etc. How-
ever, large-scale optimizations of the HPCG itself are still possible. In the near
future we plan to concentrate for realization of the existing HPCG optimiza-
tions for ccNUMA case as “improving the performance of HPCG will improve
the performance of real applications” (J.Dongarra, et al. [6]).
    First of all, we consider the refinement of the cache locality model with
the help of the novel HPCG optimization technique proposed in the paper [2],
namely coloring along two areas XY at a time in SymGS. Among other im-
provements a number of works argue to replace the default CSR matrix storage
format with simplified SELLPACK for SpMV and SYMGS kernels [24, 2]. Table
3 shows the expected speedup. Also recent work [23] demonstrates new data
redeployment model which allows to reduce the remote memory access overhead
  Effective Memory Bandwidth Evaluation on Large-Scale ccNUMA Systems        47


               Fig. 3. STREAM benchmark results vs. prediction

             Table 3. Planned optimizations and expected speedup

      Optimization Techniques    Expected Results
      Coloring along two areas
                                ×3 SYMGS performance
      XY at a time in SYMGS [2]
      CSR → ELLPACK [2]         5% speedup in SYMGS and SPMV
      Data redeployment [23]    Better performance for large scale problem


for computation-intensive applications with large size of problem in ccNUMA
architecture. These optimizations presume an analysis that will allow to study
better the challenges proposed by the ccNUMA architecture. Finally, we plan to
propose an IC latency model for ccNUMA systems in the near future.


Acknowledgments. This work was financially supported by the Ministry of
Education and Science of the Russian Federation within the framework of the
Federal Targeted Programme for Research and Development in Priority Areas of
Advancement of the Russian Scientific and Technological Complex for 2014-2020
(№ 14.584.21.0022, ID RFMEFI58417X0022). The results were obtained using
ccNUMA system in Supercomputer Center of Peter the Great St. Petersburg
Polytechnic University.
48      Pavel Drobintsev et al.

References
 1. Adhianto, L., Chapman, B.: Performance modeling of communication and compu-
    tation in hybrid mpi and openmp applications. In: 12th International Conference
    on Parallel and Distributed Systems - (ICPADS’06). vol. 2, pp. 6 pp.– (2006)
 2. Agarkov, A., Semenov, A., Simonov, A.: Optimized implementation of HPCG
    benchmark on supercomputer with “Angara” interconnect. In: Voevodin, V.,
    Sobolev, S. (eds.) Proceedings of the 1st Russian Conference on Supercomputing
    — Supercomputing Days 2015. CEUR Workshop Proceedings, vol. Vol-1482, pp.
    294–302. Research Computing Center, Moscow State University, CEUR-WS.org,
    Moscow (Sep 28–29, 2015), http://ceur-ws.org/Vol-1482/294.pdf
 3. Chen, C., Du, Y., Jiang, H., Zuo, K., Yang, C.: HPCG: Preliminary evaluation and
    optimization on Tianhe-2 CPU-only nodes. In: Computer Architecture and High
    Performance Computing (SBAC-PAD), 2014 IEEE 26th International Symposium
    on. pp. 41–48 (Oct 2014)
 4. Diener,      M.,     Cruz,     E.H.,      Navaux,     P.O.:    Modeling      mem-
    ory     access     behavior     for    data     mapping.    International     Jour-
    nal      of    High      Performance       Computing       Applications     (2016),
    http://hpc.sagepub.com/content/early/2016/04/13/1094342016640056.abstract
 5. Dongarra, J.: Report on the Sunway Taihulight System. Tech. Rep. UT-EECS-
    16-742, Oak Ridge National Laboratory,Department of Electrical Engineering and
    Computer Science, University of Tennessee (Jun 2016)
 6. Dongarra, J., Heroux, M.A., Luszczek, P.: High-performance conjugate-gradient
    benchmark: A new metric for ranking high-performance computing systems. Inter-
    national Journal of High Performance Computing Applications 30(1), 3–10 (2016)
 7. Fu, H., Liao, J., Yang, J., Wang, L., Song, Z., Huang, X., Yang, C., Xue, W.,
    Liu, F., Qiao, F., Zhao, W., Yin, X., Hou, C., Zhang, C., Ge, W., Zhang, J.,
    Wang, Y., Zhou, C., Yang, G.: The Sunway TaihuLight supercomputer: sys-
    tem and applications. Science China Information Sciences 59(7), 1–16 (2016),
    http://dx.doi.org/10.1007/s11432-016-5588-7
 8. Goglin, B., Moreaud, S.: Knem: A generic and scalable kernel-assisted intra-node
    mpi communication framework. J. Parallel Distrib. Comput. 73(2), 176–188 (Feb
    2013), http://dx.doi.org/10.1016/j.jpdc.2012.09.016
 9. Li, T., Ren, Y., Yu, D., Jin, S., Robertazzi, T.: Characterization of input/output
    bandwidth performance models in NUMA architecture for data intensive applica-
    tions. In: 2013 42nd International Conference on Parallel Processing. pp. 369–378
    (Oct 2013)
10. Liu, F., Yang, C., Liu, Y., Zhang, X., Lu, Y.: Reducing communication overhead in
    the high performance conjugate gradient benchmark on Tianhe-2. In: Distributed
    Computing and Applications to Business, Engineering and Science (DCABES),
    2014 13th International Symposium on. pp. 13–18 (Nov 2014)
11. Liu, Y., Zhang, X., Yang, C., Liu, F., Lu, Y.: Accelerating HPCG on Tianhe-2:
    A hybrid CPU-MIC algorithm. In: 2014 20th IEEE International Conference on
    Parallel and Distributed Systems (ICPADS). pp. 542–551 (Dec 2014)
12. Liu, Y., Yang, C., Liu, F., Zhang, X., Lu, Y., Du, Y., Yang, C., Xie, M., Liao, X.:
    623 Tflop/s HPCG run on Tianhe-2: Leveraging millions of hybrid cores. Interna-
    tional Journal of High Performance Computing Applications 30(1), 39–54 (2016),
    http://hpc.sagepub.com/content/30/1/39.abstract
13. Luo, H., Brock, J., Li, P., Ding, C., Ye, C.: Compositional model of coherence and
    NUMA effects for optimizing thread and data placement. In: 2016 IEEE Interna-
  Effective Memory Bandwidth Evaluation on Large-Scale ccNUMA Systems              49

    tional Symposium on Performance Analysis of Systems and Software (ISPASS).
    pp. 151–152 (April 2016)
14. Marjanović, V., Gracia, J., Glass, C.W.: Performance modeling of the HPCG
    benchmark. In: Jarvis, A.S., Wright, A.S., Hammond, D.S. (eds.) High Performance
    Computing Systems. Performance Modeling, Benchmarking, and Simulation: 5th
    International Workshop, PMBS 2014, New Orleans, LA, USA, November 16, 2014.
    Revised Selected Papers. pp. 172–192. Springer International Publishing, Cham
    (Dec 5–9, 2015)
15. Nakajima, K.: Flat MPI vs. Hybrid: Evaluation of parallel programming models
    for preconditioned iterative solvers on “T2K Open Supercomputer”. In: 2009 In-
    ternational Conference on Parallel Processing Workshops. pp. 73–80 (Sept 2009)
16. Park, J., Smelyanskiy, M., Vaidyanathan, K., Heinecke, A., Kalamkar, D.D., Liu,
    X., Patwary, M.M.A., Lu, Y., Dubey, P.: Efficient shared-memory implementation
    of high-performance conjugate gradient benchmark and its application to unstruc-
    tured matrices. In: SC14: International Conference for High Performance Comput-
    ing, Networking, Storage and Analysis. pp. 945–955 (Nov 2014)
17. Pllana, S., Benkner, S., Xhafa, F., Barolli, L.: Hybrid performance modeling and
    prediction of large-scale computing systems. In: Complex, Intelligent and Software
    Intensive Systems, 2008. CISIS 2008. International Conference on. pp. 132–138
    (March 2008)
18. Tsuji, M., Sato, M.: Performance evaluation of OpenMP and MPI hybrid programs
    on a large scale multi-core multi-socket cluster, T2K open supercomputer. In: 2009
    International Conference on Parallel Processing Workshops. pp. 206–213 (Sept
    2009)
19. Wang, W., Davidson, J.W., Soffa, M.L.: Predicting the memory bandwidth and
    optimal core allocations for multi-threaded applications on large-scale NUMA ma-
    chines. In: 2016 IEEE International Symposium on High Performance Computer
    Architecture (HPCA). pp. 419–431 (March 2016)
20. Wu, X., Taylor, V.: Performance modeling of hybrid MPI/OpenMP scientific ap-
    plications on large-scale multicore cluster systems. In: Computational Science and
    Engineering (CSE), 2011 IEEE 14th International Conference on. pp. 181–190 (Aug
    2011)
21. Yang, R., Antony, J., Rendell, A.P.: A simple performance model for multithreaded
    applications executing on non-uniform memory access computers. In: High Perfor-
    mance Computing and Communications, 2009. HPCC ’09. 11th IEEE International
    Conference on. pp. 79–86 (June 2009)
22. Zeng, D., Zhu, L., Liao, X., Jin, H.: A Data-Centric Tool to Improve the Perfor-
    mance of Multithreaded Program on NUMA, pp. 74–87. Springer International
    Publishing, Cham (2015)
23. Zhang, M., Gu, N., Ren, K.: Optimization of computation-intensive applications
    in cc-NUMA architecture. In: 2016 International Conference on Networking and
    Network Applications (NaNA). pp. 244–249 (July 2016)
24. Zhang, X., Yang, C., Liu, F., Liu, Y., Lu, Y.: Optimizing and scaling HPCG on
    Tianhe-2: Early experience. In: Sun, X.h., Qu, W., Stojmenovic, I., Zhou, W., Li,
    Z., Guo, Huaand Min, G., Yang, T., Wu, Y., Liu, L. (eds.) Algorithms and Ar-
    chitectures for Parallel Processing: 14th International Conference, ICA3PP 2014,
    Dalian, China, August 24-27, 2014. Proceedings, Part I. pp. 28–41. Springer Inter-
    national Publishing, Cham (2014)