Modern methods of energy consumption optimization in FPGA-based heterogeneous HPC systems

Modern methods of energy consumption optimization in FPGA-based heterogeneous HPC systems OleksandrVHryshchuk oleksandr_hryshchuk@knu.ua SergiyPZagorodnyuk szagorodniuk@gmail.com Taras Shevchenko National University of Kyiv

64/13 Volodymyrska Str 01601 Kyiv Ukraine

Kryvyi Rih Ukraine

Modern methods of energy consumption optimization in FPGA-based heterogeneous HPC systems 1613-0073 1AEEBCAA38012CEC5ABF75CD39580DBF GROBID - A machine learning software for extracting information from scholarly documents high-performance computing (HPC) FPGA power modeling power analysis heterogeneous computing power saving task scheduling

High-Performance Computing (HPC) systems play a pivotal role in addressing complex computational challenges across various domains, but their escalating energy consumption has raised concerns regarding sustainability and operational costs. This paper presents a comprehensive investigation into the parametrization and modeling of energy consumption in heterogeneous HPC systems, aiming to provide valuable insights for optimizing energy efficiency while preserving performance. We begin by characterizing the heterogeneity within modern HPC environments, which encompass diverse hardware components, such as CPUs, GPUs, FPGAs, and accelerators. Our research delves into modeling techniques, leveraging heuristics methods and statistical approaches to construct accurate predictive models for energy consumption. Furthermore, we explore the integration of dynamic power management strategies, such as DVFS (Dynamic Voltage and Frequency Scaling) and task scheduling, to optimize energy usage without compromising performance. This paper provides a vital foundation for sustainable HPC practices, enabling researchers and practitioners to make informed decisions for achieving enhanced energy efficiency without sacrificing computational performance.

Introduction

Today's large-scale computing systems, such as data centers and high-performance computing clusters (HPCs), are severely limited by power and cooling costs for extremely large-scale (or exascale) problems. The steady increase in electricity consumption is a growing concern for several reasons, such as cost, reliability, scalability, and environmental impact. Nowadays data centers use 200 TWh per year and contribute near 0.3% of whole carbon emissions in the world, when entire complex of ICT (Information and computing technology) devices produce up to 2% of it [1]. Best case scenario model predicts that in 2030 ICT will share 8% of whole electricity consumption in the world [2], while worst case scenario anticipate 51% of global electricity usage. This potential increase in power consumption and, sequentially, cost of computing operations leads researcher and engineers to investigate and develop new techniques and approaches to optimize power management in HPC systems and in ICD domain in general. Present-day there are set of methods and approaches to resolve this energy optimization issue, mainly only for homogeneous CPU-based HPC systems. General taxonomy of this techniques, suggested in [3] and depicted on figure 1 and can be divided into two main groups SPM (static power management) and DPM (dynamic power management). SPM methods, divided in two separate groups (for hardware and software level management) usually defined during design time and cannot be changed in runtime. Hardware SPM techniques can be detailed and split into three separate groups [3]:

1. Circuit level 2. Logic level 3. Architecture level DPM methods widely used in HPC [4] systems can be divided into two main groups -DCD (Dynamic component Deactivation), based on predictive and heuristic approaches, and DPS (Dynamic Power Scaling), like resource throttling and DVFS (Dynamic Voltage Frequency Scaling). This techniques can be a foundation for more complicated optimization methods, in example, task scheduling based on DVFS [5] or DCD heuristics applications [4].

Methods described before can be used on different hardware platforms, both homogeneous (well-studied nowadays) and heterogeneous (with GPU, TPU, FPGA and CGRA), which became popular in HPC according to a survey on Deep Learning hardware accelerators for heterogeneous HPC Platforms [6]. At the same time number of scientific papers on energy-aware optimization for HPC systems with FPGA controllers are extremely low (1-3 per year), compared to all researches about "FPGA heterogeneous computing" (see figure 2 with data obtained from app.dimensions.ai) which indicates a limited number of solutions in this domain, so this work will be focused on heterogeneous applications of energy-aware optimizations in HPC systems.

Energy optimization theory

Optimization problem definition for task scheduling

In introduction section was mentioned that optimization techniques can be divided into hardware and software types, first of them are case-specific for different variations of hardware like CPU, memory chips, NIC, etc., while software-defined approaches can be generalized and provide a solution for disparate equipment with same characteristics/types, in example, homogeneous or heterogeneous GPU and TPU-based HPC clusters [7]. Such software solutions are often leads to energy-efficient task-scheduling methods, optimization problem for which can be defined in a way that described next.

For a finite set of jobs(task) 𝐽 and a finite set of resources 𝑅, 𝑡𝑖𝑚𝑒(𝑗, 𝑟) is a function, that returns time of execution of job 𝑗 ∈ 𝐽 on resource 𝑟 ∈ 𝑅 [4].Then scheduling can be described as task of finding a set of start times {𝑠 1 , 𝑠 2 , . . . , 𝑠 |𝐽| } for jobs, allocated to resources {𝑎 1 , 𝑎 2 , . . . , 𝑎 |𝐽| } in conditions where:

∀𝑠 𝑥 : ∄𝑠 𝑦 : 𝑠 𝑥 ≤ 𝑠 𝑦 + time (𝑦, 𝐴 𝑦 ) ∧ 𝑠 𝑦 ≤ 𝑠 𝑥 + time (𝑥, 𝐴 𝑥 ) ∧ 𝑎 𝑥 = 𝑎 𝑦 , ∀𝑎 𝑥 : 𝑥 ∈ 𝑅 (1)

Additional optimization conditions (see equation 2) can be applied to provided scheduling, where optimization criteria can be finding maximum or minimum, depending on formulation of a function which involves simple metrics such as execution time, consumed energy, etc. [4].

This model is extremely simplified and does not suitable for real applications due to several reasons -it assumes that one resource can take only one task at the time, number of available resources always equal or higher than number of jobs to complete and does not include impact of communication between tasks on nodes or computing elements. To resolve these problems and adapt model to real world upgraded model was suggested [4] -for two tasks 𝑥 and 𝑦 from set of jobs pairs 𝐷, 𝑃 𝑗 is set of devices, which can be assigned for job 𝑗 ∈ 𝐽, time of communication between jobs obtained from function 𝑐𝑜𝑚𝑚(𝑥, 𝑦, 𝑎 𝑥 , 𝑎 𝑦 ), then solution is a set of assignments 𝐴 𝑗 and start times {𝑠 1 , 𝑠 2 , . . . , 𝑠 |𝐽| } for each job, like it described in equations 3-6:

∀𝑥 ∈ 𝐴 𝑗 : 𝑥 ∈ 𝑃 𝑗 (3)∀𝑠 𝑥 : ∄𝑠 𝑦 : 𝑠 𝑥 ≤ 𝑠 𝑦 + time (𝑦, 𝐴 𝑦 ) ∧ 𝑠 𝑦 ≤ 𝑠 𝑥 + time (𝑥, 𝐴 𝑥 ) ∧ 𝐴 𝑥 ∩ 𝐴 𝑦 = ∅ (4) ∀{𝑥, 𝑦} ∈ 𝐷 : 𝑠 𝑥 + time (𝑥, 𝐴 𝑥 ) + comm (𝑥, 𝑦, 𝐴 𝑥 , 𝐴 𝑦 ) ≤ 𝑠 𝑦(5)

With optimization condition:

min / max (︀ OptimizationCriteria (︀{︀ 𝑠 1 , 𝑠 2 , . . . , 𝑠 |𝐽| }︀ , 𝐴 1 , . . . , 𝐴 |𝐽| , 𝐷 )︀)︀(6)

This method involves enumeration of all jobs for all available resources, which leads to idea that solution can not be found in polynomial time, and it was proved that problem of energy-efficient active time [8] scheduling is NP-Complete [5], so to be able use this model there can be a two possible ways -use predefined constraints and precalculated configurations or use heuristic methods, in example genetic algorithms [9], to find solution during runtime.

Optimization criteria

General optimization problem was described in previous section, and to be used in real HPC systems in requires properly defined optimization criteria. Existing solutions in this domain based on energy consumption metric (EC), or can take under consideration other properties, in example, execution time, etc. [4]. Power consumption can be described via energy itself (in joules or watts), or can be represented with more complicated models like instruction per joule or power per watt [10]. This approach used in Green500 rating as FLOPS per Watt metric [11].

More sophisticated can use combination of following metrics such as EC (energy consumption), ExecT (execution time), utilization, average weighted time, wait time, power, Pareto front, AST, AFT, clock frequency, work(job) per energy, reliability, electricity cost, temperature, EDP, EDF, Number of cores, Probability of execution, branch transition rate, cache efficiency, issue width [4]. In example new algorithm was proposed for reformed scheduling method with energy consumption constraint (RSMECC), based on AST, AFT and energy consumption metrics [12]. This algorithm can make it possible to more efficiently solve a wide range of computing tasks, including in the field of neural networks, complex 3D modeling and artificial intelligence.

Cluster architecture

Nowadays HPC clusters widespread around the world in different forms and variations, but generally main part of them are based on homogeneous massive parallel processor architecture (MPP), which inherited from older NUMA (non-uniform memory access) architecture [13]. This approach looks similar to shared-memory technology, but in this case each processor in cluster is connected to it's own part of memory and create entity of single independent node, which connected with other nodes via network interface card and common network (see figure 3). Absence of shared memory between nodes (not including common NAS) simplifies design and reduces inefficient components therefor improving scalability and stability of HPC system [13]. At the same time due to lack of shared memory, a processor core in one group must employ a different method to exchange data and coordinate with cores of other processor groups [14]. This issue become more visible for heterogeneous systems, based on CPUs form different series or types, or even for GRID computing systems [15]. Another popular approach for building HPC systems is usage of symmetric multi-processors (SMP). It embodies a category of parallel architectures that harness the power of multiple processor cores to enhance performance by leveraging parallel processing, all the while upholding a unified memory structure that spans the entirety of the parallel computing system [13].

An SMP defines a self-contained and self-sustaining computer system equipped with all the subsystems and components essential for fulfilling the demands and facil-itating the execution of various applications. It can operate independently to support user applications designed as shared-memory multi-threaded programs, serve as one among several equivalent subsystems in a scalable MPP systems or commodity clus-ter, and work as a throughput computer for the simultaneous execution of independent concurrent tasks [14]. General architecture of SMP system depicted on figure 4.

Heterogeneous cluster architecture comparison

Heterogeneous computing in HPC refers to the utilization of diverse hardware accelerators, like general purpose graphic processing unit (GPGPU), field programmable gate array (FPGA), coarsegrained reconfigurable array (CGRA) [15] and specialized coprocessors, alongside traditional CPU. This approach harnesses the strengths of different computing components to optimize performance and energy efficiency, making it particularly well-suited for workloads that can benefit from parallel processing. Most common heterogeneous clusters involve usage of coupled CPU and GPGPU as single node, therefore nowadays exists energy efficient solutions for this kind of HPC system, which was analyzed in [4].

But FPGA in same time in HPC is a new type of accelerators and less studied as it was shown in Introduction section of this paper. But nowadays there are existing works on this topic, in example the technique of cooperative CPU, GPU and FPGA task execution, based on EngineCL framework was suggested in [16]. Also, new approach, called Cooperative Heterogeneous Acceleration with Reconfigurable Multi-devices (CHARM) was proposed for multi hybrid accelerated cluster with GPU and FPGA coupling, which was implemented in "Albireo-nodes" in Cygnus cluster, based on CPU Intel Xeon Gold, GPU NVIDIA Tesla V100 x4 and FPGA Nallatech 520N with Intel Stratix10 [17]. Architecture of this nodes shown of figure 5.

Characteristic comparison for Cygnus supercomputer node and heterogeneous system from EngineCL test setup shown on table 1. At the same time, for EngineCL was shown that performance improvement from heterogeneity was obtained for all benchmark tasks ("Matrix multiplication", "Mersenne Twister", "Watermarking", "Sobel Filter", "Nearest Neighbor", "AES Decrypt"), but energy consumption improvement was detected only for "Sobel Filter" [16], which leaves a research gap for searching energy-optimization methods for this kind of system. Consequently, this two works have a lack of energy consumption optimization for described systems, and despite existing methods of power management and optimization described in survey of FPGA optimization methods for data center energy efficiency [18]. Finding "general" solution for FPGA-kind of system is complicated due to the necessity of reconfiguring of hardware for each specific task (job), but nevertheless, energy optimization constraints with proper criteria, described in "Energy optimization theory" section of this paper can be applied to multi-hybird hardware FPGA systems to optimize power consumption.

Conclusions

This paper shows modern theories and approaches for power consumption planning and optimizations for heterogeneous HPC systems, including optimization model for MPP system, described in third section of this paper. As this problem in NP-complete, heuristics approaches for finding solutions was mentioned. Results from mentioned solutions can be implemented on hardware or software level via DPM technologies. At the same time mentioned solutions is well suited to only CPU-GPU coupled systems, but not for CPU-GPU-FPGA coupled systems. For last one there is existing power management techniques, like easy-to use in FPGA DCD, but the is a lack of schedulers and general approaches for implementing solution from theoretical optimal model. Therefore, future work involves further search ways of amplification methods, including heuristic solutions of power consumption planning in FPGA-coupled HPC systems.

Figure 1 :1Figure 1: General classification of power management methods in computer systems.

Figure 2 :2Figure 2: Count of scientific publications per year on topic "FPGA heterogeneous computing" and "Energy-aware FPGA heterogeneous computing" from 2014 to 2023.

Figure 3 :3Figure 3: MPP HPC cluster architecture.

Figure 4 :4Figure 4: Internal architecture of SMP HPC system.

Figure 5 :5Figure 5: Internal architecture of Albireo-node from Cygnus cluster.

, 𝑠 2 , . . . , 𝑠 |𝐽| }︀ , {︀ 𝑎 1 , 𝑎 2 , . . . , 𝑎 |𝐽| }︀)︀)︀min / max(︀OptimizationCriteria(︀{︀𝑠 1

Table 11Comparison of Cygnus and EngineCL setup node specifications.CharacterisicCygnusEngineCL test setupCPUIntel Xeon Gold x2Intel Core i7-G700kGPUNvidia Tesla V100x4 (32 Gb x4) Nvidia GeForce GTX Titan X (12 Gb)FPGAIntel Stratix 10x2Altera DE5NET Stratix VRAM192GB64GBNumber of nodes 32 GPU+FPGA, 46 CPU-only1Energy-efficiency N/A1 of 6 benchmark tasks

How to stop data centres from gobbling up the world's electricity NJones 10.1038/d41586-018-06610-y Nature 561 2018 AS GAndrae TEdler 10.3390/challe6010117 On Global Electricity Usage of Communication Technology: Trends to 2030 2015 Energy Efficient High Performance Processors: Recent Approaches for Designing Green High Performance Computing JHaj-Yahya AMendelson YBAsher AChattopadhyay 2018 Springer Energy-Aware Scheduling for High-Performance Computing Systems: A Survey BKocot PCzarnul JProficz 10.3390/en16020890 Energies 16 2023 NP-completeness of the Active Time Scheduling Problem SSaha MPurohit 2021 A Survey on Deep Learning Hardware Accelerators for Heterogeneous HPC Platforms CSilvano DIelmini FFerrandi LFiorin SCurzel LBenini FConti AGarofalo CZambelli ECalore 10.48550/arXiv.2306.15552 2023 Runtime and energy constrained work scheduling for heterogeneous systems VRaca SUmboh EMehofer BScholz 10.1007/s11227-022-04556-7 Journal of Supercomputing 78 2022 A Model for Minimizing Active Processor Time JChang HNGabow SKhuller 10.1007/978-3-642-33090-2_26 Algorithms -ESA 2012 Lecture Notes in Computer Science LEpstein PFerragina Springer 2012 Energy-efficient allocation of computing node slots in HPC clusters through parameter learning and hybrid genetic fuzzy system modeling ACocaña-Fernández JRanilla LSánchez 10.1007/s11227-014-1320-9 Journal of Supercomputing 71 2015 Energy-aware scheduling algorithm for time-constrained workflow tasks in DVFS-enabled cloud environment MSafari RKhorsand 10.1016/j.simpat.2018.07.006 Simulation Modelling Practice and Theory 87 2018 Node variability in large-scale power measurements: perspectives from the Green500, Top500 and EEHPCWG TScogland JAzose DRohr SRivoire NBates DHackenberg 10.1145/2807591.2807653 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '15 the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '15

New York, NY, USA

Association for Computing Machinery 2015 A reformed task scheduling algorithm for heterogeneous distributed systems with energy consumption constraints YHu JLi LHe 10.1007/s00521-019-04415-2 Neural Computing and Applications 32 2020 TSterling MBrodowicz MAnderson High Performance Computing: Modern Systems and Practices Morgan Kaufmann 2017 Modeling communication in cache-coherent SMP systems: a casestudy with Xeon Phi SRamos THoefler 10.1145/2462902.2462916 Proceedings of the 22nd international symposium on Highperformance parallel and distributed computing, HPDC '13 the 22nd international symposium on Highperformance parallel and distributed computing, HPDC '13 Association for Computing Machinery 2018 A Coarse-Grained Reconfigurable Array for High-Performance Computing Applications PSKäsgen MWeinhardt CHochberger 10.1109/RECONFIG.2018.8641720 2018 International Conference on Re-ConFigurable Computing and FPGAs (ReConFig) 2018 Cooperative CPU, GPU, and FPGA heterogeneous execution EngineCL MDávila RNozal RGran Tejero MVillarroya DSuárez JGracia Bosque 10.1007/s11227-019-02768-y The Journal of Supercomputing 75 2019 Cygnus -World First Multihybrid Accelerated Cluster with GPU and FPGA Coupling TBoku NFujita RKobayashi OTatebe 10.1145/3547276.3548629 Workshop Proceedings of the 51st International Conference on Parallel Processing, ICPP Workshops '22 Association for Computing Machinery 2023 A Survey of FPGA Optimization Methods for Data Center Energy Efficiency MTibaldi CPilato 10.1109/TSUSC.2023.3273852 IEEE Transactions on Sustainable Computing 2023