1. Introduction

Modern methods of energy consumption optimization in FPGA-based heterogeneous HPC systems

Oleksandr V. Hryshchuk

Sergiy P. Zagorodnyuk

0 0 Taras Shevchenko National University of Kyiv , 64/13 Volodymyrska Str., Kyiv, 01601 , Ukraine

167 176

High-Performance Computing (HPC) systems play a pivotal role in addressing complex computational challenges across various domains, but their escalating energy consumption has raised concerns regarding sustainability and operational costs. This paper presents a comprehensive investigation into the parametrization and modeling of energy consumption in heterogeneous HPC systems, aiming to provide valuable insights for optimizing energy eficiency while preserving performance. We begin by characterizing the heterogeneity within modern HPC environments, which encompass diverse hardware components, such as CPUs, GPUs, FPGAs, and accelerators. Our research delves into modeling techniques, leveraging heuristics methods and statistical approaches to construct accurate predictive models for energy consumption. Furthermore, we explore the integration of dynamic power management strategies, such as DVFS (Dynamic Voltage and Frequency Scaling) and task scheduling, to optimize energy usage without compromising performance. This paper provides a vital foundation for sustainable HPC practices, enabling researchers and practitioners to make informed decisions for achieving enhanced energy eficiency without sacrificing computational performance.

eol>high-performance computing (HPC) FPGA power modeling power analysis heterogeneous computing power saving task scheduling

1. Introduction

Today’s large-scale computing systems, such as data centers and high-performance computing clusters (HPCs), are severely limited by power and cooling costs for extremely large-scale (or exascale) problems. The steady increase in electricity consumption is a growing concern for several reasons, such as cost, reliability, scalability, and environmental impact. Nowadays data centers use 200 TWh per year and contribute near 0.3% of whole carbon emissions in the world, when entire complex of ICT (Information and computing technology) devices produce up to 2% of it [ 1 ]. Best case scenario model predicts that in 2030 ICT will share 8% of whole electricity consumption in the world [ 2 ], while worst case scenario anticipate 51% of global electricity usage. This potential increase in power consumption and, sequentially, cost of computing operations leads researcher and engineers to investigate and develop new techniques and approaches to optimize power management in HPC systems and in ICD domain in general.

Present-day there are set of methods and approaches to resolve this energy optimization issue, mainly only for homogeneous CPU-based HPC systems. General taxonomy of this techniques, suggested in [ 3 ] and depicted on figure 1 and can be divided into two main groups SPM (static power management) and DPM (dynamic power management). SPM methods, divided in two separate groups (for hardware and software level management) usually defined during design time and cannot be changed in runtime. Hardware SPM techniques can be detailed and split into three separate groups [ 3 ]:

1. Circuit level

2. Logic level 3. Architecture level

DPM methods widely used in HPC [ 4 ] systems can be divided into two main groups – DCD (Dynamic component Deactivation), based on predictive and heuristic approaches, and DPS (Dynamic Power Scaling), like resource throttling and DVFS (Dynamic Voltage Frequency Scaling). This techniques can be a foundation for more complicated optimization methods, in example, task scheduling based on DVFS [ 5 ] or DCD heuristics applications [ 4 ].

Methods described before can be used on diferent hardware platforms, both homogeneous (well-studied nowadays) and heterogeneous (with GPU, TPU, FPGA and CGRA), which became popular in HPC according to a survey on Deep Learning hardware accelerators for heterogeneous HPC Platforms [ 6 ]. At the same time number of scientific papers on energy-aware optimization for HPC systems with FPGA controllers are extremely low (1-3 per year), compared to all researches about “FPGA heterogeneous computing” (see figure 2 with data obtained from app.dimensions.ai) which indicates a limited number of solutions in this domain, so this work will be focused on heterogeneous applications of energy-aware optimizations in HPC systems.

2. Energy optimization theory 2.1. Optimization problem definition for task scheduling

In introduction section was mentioned that optimization techniques can be divided into hardware and software types, first of them are case-specific for diferent variations of hardware like CPU, memory chips, NIC, etc., while software-defined approaches can be generalized and provide a solution for disparate equipment with same characteristics/types, in example, homogeneous or heterogeneous GPU and TPU-based HPC clusters [ 7 ]. Such software solutions are often leads to energy-eficient task-scheduling methods, optimization problem for which can be defined in a way that described next.

For a finite set of jobs(task) and a finite set of resources , (, ) is a function, that returns time of execution of job ∈ on resource ∈ [ 4 ].Then scheduling can be described as task of finding a set of start times {1, 2, . . . , ||} for jobs, allocated to resources {1, 2, . . . , ||} in conditions where: ∀ : ∄ : ≤ + time (, ) ∧ ≤ + time (, ) ∧ = , ∀ : ∈ (1)

Additional optimization conditions (see equation 2) can be applied to provided scheduling, where optimization criteria can be finding maximum or minimum, depending on formulation of a function which involves simple metrics such as execution time, consumed energy, etc. [ 4 ]. min / max (︀ OptimizationCriteria (︀{ 1, 2, . . . , ||

This model is extremely simplified and does not suitable for real applications due to several reasons – it assumes that one resource can take only one task at the time, number of available resources always equal or higher than number of jobs to complete and does not include impact of communication between tasks on nodes or computing elements. To resolve these problems and adapt model to real world upgraded model was suggested [ 4 ] – for two tasks and from set of jobs pairs , is set of devices, which can be assigned for job ∈ , time of communication between jobs obtained from function (, , , ), then solution is a set of assignments and start times {1, 2, . . . , ||} for each job, like it described in equations (3) (4) (5) (6) ∀ ∈ : ∈ ∀ : ∄ : ≤ + time (, ) ∧ ≤ + time (, ) ∧ ∩ = ∅ ∀{, } ∈ : + time (, ) + comm (, , , ) ≤

With optimization condition:

min / max (︀ OptimizationCriteria (︀{ 1, 2, . . . , ||︀} , 1, . . . , ||, )︀

This method involves enumeration of all jobs for all available resources, which leads to idea that solution can not be found in polynomial time, and it was proved that problem of energy-eficient active time [ 8 ] scheduling is NP-Complete [ 5 ], so to be able use this model there can be a two possible ways – use predefined constraints and precalculated configurations or use heuristic methods, in example genetic algorithms [ 9 ], to find solution during runtime.

2.2. Optimization criteria

General optimization problem was described in previous section, and to be used in real HPC systems in requires properly defined optimization criteria. Existing solutions in this domain based on energy consumption metric (EC), or can take under consideration other properties, in example, execution time, etc. [ 4 ]. Power consumption can be described via energy itself (in joules or watts), or can be represented with more complicated models like instruction per joule or power per watt [ 10 ]. This approach used in Green500 rating as FLOPS per Watt metric [ 11 ].

More sophisticated can use combination of following metrics such as EC (energy consumption), ExecT (execution time), utilization, average weighted time, wait time, power, Pareto front, AST, AFT, clock frequency, work(job) per energy, reliability, electricity cost, temperature, EDP, EDF, Number of cores, Probability of execution, branch transition rate, cache eficiency, issue width [ 4 ]. In example new algorithm was proposed for reformed scheduling method with energy consumption constraint (RSMECC), based on AST, AFT and energy consumption metrics [ 12 ]. This algorithm can make it possible to more eficiently solve a wide range of computing tasks, including in the field of neural networks, complex 3D modeling and artificial intelligence.

3. Cluster architecture

Nowadays HPC clusters widespread around the world in diferent forms and variations, but generally main part of them are based on homogeneous massive parallel processor architecture (MPP), which inherited from older NUMA (non-uniform memory access) architecture [ 13 ]. This approach looks similar to shared-memory technology, but in this case each processor in cluster is connected to it’s own part of memory and create entity of single independent node, which connected with other nodes via network interface card and common network (see figure 3). Absence of shared memory between nodes (not including common NAS) simplifies design and reduces ineficient components therefor improving scalability and stability of HPC system [ 13 ]. At the same time due to lack of shared memory, a processor core in one group must employ a diferent method to exchange data and coordinate with cores of other processor groups [ 14 ]. This issue become more visible for heterogeneous systems, based on CPUs form diferent series or types, or even for GRID computing systems [ 15 ].

Another popular approach for building HPC systems is usage of symmetric multi-processors (SMP). It embodies a category of parallel architectures that harness the power of multiple processor cores to enhance performance by leveraging parallel processing, all the while upholding a unified memory structure that spans the entirety of the parallel computing system [ 13 ].

An SMP defines a self-contained and self-sustaining computer system equipped with all the subsystems and components essential for fulfilling the demands and facil-itating the execution of various applications. It can operate independently to support user applications designed as shared-memory multi-threaded programs, serve as one among several equivalent subsystems in a scalable MPP systems or commodity clus-ter, and work as a throughput computer for the simultaneous execution of independent concurrent tasks [ 14 ]. General architecture of SMP system depicted on figure 4.

3.1. Heterogeneous cluster architecture comparison

Heterogeneous computing in HPC refers to the utilization of diverse hardware accelerators, like general purpose graphic processing unit (GPGPU), field programmable gate array (FPGA), coarsegrained reconfigurable array (CGRA) [ 15 ] and specialized coprocessors, alongside traditional CPU. This approach harnesses the strengths of diferent computing components to optimize performance and energy eficiency, making it particularly well-suited for workloads that can benefit from parallel processing. Most common heterogeneous clusters involve usage of coupled CPU and GPGPU as single node, therefore nowadays exists energy eficient solutions for this kind of HPC system, which was analyzed in [ 4 ].

But FPGA in same time in HPC is a new type of accelerators and less studied as it was shown in Introduction section of this paper. But nowadays there are existing works on this topic, in example the technique of cooperative CPU, GPU and FPGA task execution, based on EngineCL framework was suggested in [16]. Also, new approach, called Cooperative Heterogeneous Acceleration with Reconfigurable Multi-devices (CHARM) was proposed for multi hybrid accelerated cluster with GPU and FPGA coupling, which was implemented in “Albireo-nodes” in Cygnus cluster, based on CPU Intel Xeon Gold, GPU NVIDIA Tesla V100 x4 and FPGA Nallatech 520N with Intel Stratix10 [17]. Architecture of this nodes shown of figure 5.

Characteristic comparison for Cygnus supercomputer node and heterogeneous system from EngineCL test setup shown on table 1. At the same time, for EngineCL was shown that performance improvement from heterogeneity was obtained for all benchmark tasks ("Matrix multiplication", "Mersenne Twister", "Watermarking", "Sobel Filter", "Nearest Neighbor", "AES Decrypt"), but energy consumption improvement was detected only for "Sobel Filter" [16], which leaves a research gap for searching energy-optimization methods for this kind of system.

Consequently, this two works have a lack of energy consumption optimization for described systems, and despite existing methods of power management and optimization described in survey of FPGA optimization methods for data center energy eficiency [ 18]. Finding “general” solution for FPGA-kind of system is complicated due to the necessity of reconfiguring of hardware for each specific task (job), but nevertheless, energy optimization constraints with proper criteria, described in “Energy optimization theory” section of this paper can be applied to multi-hybird hardware FPGA systems to optimize power consumption.

4. Conclusions

This paper shows modern theories and approaches for power consumption planning and optimizations for heterogeneous HPC systems, including optimization model for MPP system, described in third section of this paper. As this problem in NP-complete, heuristics approaches for finding solutions was mentioned. Results from mentioned solutions can be implemented on hardware or software level via DPM technologies. At the same time mentioned solutions is well suited to only CPU-GPU coupled systems, but not for CPU-GPU-FPGA coupled systems. For last one there is existing power management techniques, like easy-to use in FPGA DCD, but the is a lack of schedulers and general approaches for implementing solution from theoretical optimal model. Therefore, future work involves further search ways of amplification methods, including heuristic solutions of power consumption planning in FPGA-coupled HPC systems. ConFigurable Computing and FPGAs (ReConFig), 2018, pp. 1–4. doi:10.1109/RECONFIG. 2018.8641720. [16] M. Dávila, R. Nozal, R. Gran Tejero, M. Villarroya, D. Suárez Gracia, J. Bosque, Cooperative CPU, GPU, and FPGA heterogeneous execution with EngineCL, The Journal of Supercomputing 75 (2019). doi:10.1007/s11227-019-02768-y. [17] T. Boku, N. Fujita, R. Kobayashi, O. Tatebe, Cygnus - World First Multihybrid Accelerated Cluster with GPU and FPGA Coupling, in: Workshop Proceedings of the 51st International Conference on Parallel Processing, ICPP Workshops ’22, Association for Computing Machinery, 2023, pp. 1–8. doi:10.1145/3547276.3548629. [18] M. Tibaldi, C. Pilato, A Survey of FPGA Optimization Methods for Data Center Energy Eficiency, IEEE Transactions on Sustainable Computing (2023) 343–362. doi: 10.1109/ TSUSC.2023.3273852.

[1]

Jones , How to stop data centres from gobbling up the world's electricity , Nature 561 ( 2018 ) 163 - 166 . doi: 10 .1038/d41586-018-06610-y.

[2]

A. S. G.

Andrae , T. Edler, On Global Electricity Usage of Communication Technology: Trends to 2030 , Challenges 6 ( 2015 ) 117 - 157 . doi: 10 .3390/challe6010117.

[3]

Haj-Yahya ,

Mendelson ,

Y. B.

Asher ,

Chattopadhyay , Energy Eficient High Performance Processors: Recent Approaches for Designing Green High Performance Computing , Springer, 2018 .

[4]

Kocot ,

Czarnul ,

Proficz , Energy-Aware Scheduling for High-Performance Computing Systems: A Survey, Energies 16 ( 2023 ). doi: 10 .3390/en16020890.

[5]

Saha , M. Purohit, NP-completeness of the Active Time Scheduling Problem , 2021 . URL: http://arxiv.org/abs/2112.03255.

[6]

Silvano ,

Ielmini ,

Ferrandi ,

Fiorin ,

Curzel ,

Benini ,

Conti ,

Garofalo ,

Zambelli ,

Calore , et. al., A Survey on Deep Learning Hardware Accelerators for Heterogeneous HPC Platforms , 2023 . doi: 10 .48550/arXiv.2306.15552.

[7]

Raca ,

Umboh ,

Mehofer ,

Scholz , Runtime and energy constrained work scheduling for heterogeneous systems , Journal of Supercomputing 78 ( 2022 ) 17150 - 17177 . doi: 10 . 1007/s11227-022-04556-7.

[8]

Chang ,

H. N.

Gabow ,

Khuller , A Model for Minimizing Active Processor Time , in: L. Epstein , P. Ferragina (Eds.), Algorithms - ESA 2012, Lecture Notes in Computer Science , Springer, 2012 , pp. 289 - 300 . doi: 10 .1007/978-3- 642 -33090-2_ 26 .

[9]

Cocaña-Fernández ,

Ranilla , L. Sánchez, Energy-eficient allocation of computing node slots in HPC clusters through parameter learning and hybrid genetic fuzzy system modeling , Journal of Supercomputing 71 ( 2015 ) 1163 - 1174 . doi: 10 .1007/s11227-014-1320-9.

[10]

Safari ,

Khorsand , Energy-aware scheduling algorithm for time-constrained workflow tasks in DVFS-enabled cloud environment , Simulation Modelling Practice and Theory 87 ( 2018 ) 311 - 326 . doi: 10 .1016/j.simpat. 2018 . 07 .006.

[11]

Scogland ,

Azose ,

Rohr ,

Rivoire ,

Bates ,

Hackenberg , Node variability in large-scale power measurements: perspectives from the Green500, Top500 and EEHPCWG, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , SC '15 , Association for Computing Machinery, New York, NY, USA, 2015 . doi: 10 .1145/2807591.2807653.

[12]

Hu ,

Li ,

He , A reformed task scheduling algorithm for heterogeneous distributed systems with energy consumption constraints , Neural Computing and Applications 32 ( 2020 ). doi:10.1007/s00521-019-04415-2.

[13]

Sterling ,

Brodowicz , M. Anderson, High Performance Computing: Modern Systems and Practices, Morgan Kaufmann, 2017 .

[14]

Ramos , T. Hoefler, Modeling communication in cache-coherent SMP systems: a casestudy with Xeon Phi, in: Proceedings of the 22nd international symposium on Highperformance parallel and distributed computing , HPDC '13 , Association for Computing Machinery, 2018 , pp. 97 - 108 . doi: 10 .1145/2462902.2462916.

[15]

P. S.

Käsgen ,

Weinhardt ,

Hochberger , A Coarse-Grained Reconfigurable Array for High-Performance Computing Applications , in: 2018 International Conference on Re-