Energy Efficiency Platform Characterization for Heterogeneous Multicore Architectures Hergys Rexha Sébastien Lafond Faculty of Science and Engineering Faculty of Science and Engineering Åbo Akademi University Åbo Akademi University Turku, Finland Turku, Finland hrexha@abo.fi slafond@abo.fi Abstract—Runtime estimation of power dissipation and perfor- 1TW [12]. The emergence of the Internet of Things (IoT) mance is crucial in every computing platform. In mobile systems, with devices operating at the edge of the network, poses a special focus is set on energy efficiency in order to achieve a new challenge to the Cloud to provide efficient service the longest possible battery life and at the same time adhering to performance requirements. Powered by heterogeneous SoC’s, provisioning. IoT devices are low powered devices and their mobile systems are called to reach an energy efficient state of usage promises to decrease the overall power consumption execution, with a runtime system or scheduler that requires by increasing energy efficiency, but their number could be knowledge on the current performance and power dissipation. overwhelming with the consequence of having a ”rebound Today, highly heterogeneous architectures provide many actu- effect” [9]. Cisco predicts that by the year 2020 in the world ators to reach better efficiency, the effect of which is usually unknown at runtime. In this paper, we propose a fast approach to will be 50 billion IoT devices, which is an order of magnitude build an energy efficiency model based on hardware performance bigger than the number of smartphones and tablets working counters. Our approach obviates the need for power sensors today. So in this scenario, using the cloud services offered by present at the chip level and deals with high numbers of execution large datacenters to receive the data generated by IoT devices modes. In building the energy efficiency model we account for will not be a sustainable solution in terms of cost, latency, and the change in temperature which, as we show, has an impact on the optimal energy efficiency choice. The proposed approach environmental impact [6]. Recently the idea of edge devices reduces significantly the time to characterize the energy efficiency that provide the computation and storage closer to the source of a Multiprocessor System-on-Chip (MPSoC) and includes the of data has been formulated under the term of Edge or Fog environment temperature as a variable in determining the energy computing [25]. As an edge device example, we can mention efficiency. smartphones, as intermediates between body sensors and the Index Terms—MPSoC, energy efficiency models, platform configuration point, PMC, power models cloud services, gateways as intermediates for smart homes, or nano data centers that manage the caching or processing of video contents. By using these edge devices in the proximity I. I NTRODUCTION of data sources, we could have as an end result in a reduction The past years have seen rapid development in the amount of energy consumption w.r.t. implementing the logic in the of data produced, processed and exchanged through comput- cloud, and at the same time keeping latency requirements of ing systems, ranging from high-end server farms to simple certain applications [17]. household devices, and the trend of technology seems to fuel Therefore one key requirement of such computing sys- even more this direction. Based on electricity usage ascribed tems is undoubtedly energy efficiency. Basically, this means to Information and Communication Technology (ICT), it is that systems should minimize their energy consumption to predicted that by the end of 2030 this sector will use as much complete the required task and achieve a satisfying energy as 51% of global electricity production [5]. Following this proportionality [20]. One of the largest consumers of energy scenario, by the year 2030, the only ICT industry will be in computing environments is the CPU [8], which requires responsible for up to 23% of the globally released greenhouse special attention especially in the multicore era. Today mobile gas emissions [5]. A 2016 report [24] says that the US devices are using the same CPU as traditional gateways or datacenters held 350 million terabytes of data in 2015, and by cloudlets in Edge Computing. The need to achieve energy 2020 they will require 100TWh of electricity to operate. This efficiency in today’s MPSoC is stringent, especially for mobile is the equivalent of 7 nuclear power stations like Olkiluoto 3 devices that operate on battery, and that is a clear scenario in Finland. There is also an increase of datacenters capacity where the end user wants a better experience and longer in Europe, with London, Frankfurt, Paris, and Amsterdam battery life. which grew their electricity consumption by 200MW in 2017. Workload variability makes the control of energy expen- Countries like Ireland and Denmark in Europe are becoming diture especially difficult in mobile CPUs. Mobile devices a data base for the world’s biggest tech companies and by the are not the only which require energy efficient solutions, next 5 years promise to increase the power consumption by but also cloud providers need to lower the energy cost of computations and cooling [19]. Today large scale computing fs fs fB fB fs fs fB fB fs fs facilities are using energy as a resource to be scheduled and fB fs fs fB fB fB fs charge according to the energy consumption [14]. Heterogene- ity shows a promise to increase the energy efficiency levels Configurationx Configurationy Configurationz achieved in MPSoC, hence several paths have been followed by research and industry. For example, exploring heterogeneity fB fB fB fB fs fs fs fs inside the CPU chip by using multiple technologies with fB fs fs different power and performance characteristics or using cores fB fB that alternatively behave as out-of-order computing elements Configurationt Configurationu Configurationv or as in-order cores [22]. Probably one of the most popular and researched types of heterogeneity is the one provided by different computing cores integrated into the same physical chip. This type of heterogeneity is the one where computing cores share the same Instruction Set Architecture (ISA) but Big Cores Small Cores have different microarchitectures. However, an intelligent use of these power and performance tradeoffs proves to be not MPSoC a simple challenge [23]. Being able to predict the optimal choice between a number of hardware actuators such as the Fig. 1. Examples of possible platform configuration points in a multicore architecture number of cores, type of core and operating performance point, or Dynamic Voltage and Frequency Scaling (DVFS), is a difficult task that must be handled well in order to achieve choose the optimal power and performance trade-off. Unfortu- energy efficiency. nately, most of the hardware platforms today are not equipped With asymmetric multiprocessing (AMP) architecture there with power sensors, which significantly complicates energy- is a better way to respond to the diversity of applications efficient management of the system settings. present in the mobile environment. We have compute-intensive This paper follows our previous work which experimentally applications which need to produce results in real time and builds an energy efficiency model based on platform config- must use fast cores in order to meet the deadlines. On the uration points, for ARM big.LITTLE architecture [21]. As other side, background processes that may be memory bound platform configuration point we denoted the set of platform require little computation and are more suitable to run on actuators such as number, type of core, core performance level simple cores that achieve better levels of energy efficiency. or DVFS and core utilization level. The model is derived by Even within a single application, we have different “windows testing all the possible configuration points of the platform. of activity” which may require varying levels of computing Following the recent trend in platform complexity, this ap- intensity, e.g. reading, scrolling, responding through different proach is difficult to apply in the case of the combinatorial ex- messages inside a social media application. Recently industry plosion in the number of configuration points. The goal of this has moved towards increasing the level of heterogeneity found paper is to explore new approaches in providing knowledge inside a single chip. From examples such as ARM big.LITTLE of the platform energy efficiency to a runtime system based with two types of cores, to Mediatek tri-cluster MPSoC [16] on the concept of platform configuration points. We redefine which promise to increase performance and reduce power the set of parameters in the configuration point by removing dissipation. DynamIQ from ARM [1] advances the concept utilization level from the aforementioned description. Meaning of big.LITTLE by providing better flexibility in the cluster of the notion of platform configuration point is demonstrated organization and frequency setting. with several examples (from x to v) in a multicore platform High levels of heterogeneity present in recently embed- (Figure 1). In our energy efficiency model, we account for ded architectures produce an increase in the design space the environment temperature variable, which provides valuable exploration to find an efficient use of platform actuators. By information for the correct accounting of the CPU dissipated increasing the number and type of cores and the number of power. Knowing the large impact that static power has on the voltages and frequency levels for each computing element, energy efficiency achieved in today’s CPUs the second purpose there is an increasing number of operating points on which of this work is to build thermally aware energy efficiency the platform may perform. In this scenario making the right models. choice for execution could have a tremendous impact on energy efficiency. Temperature also has a major effect on the The contributions of this paper are the following: power dissipation of today’s systems [15], which makes it an • we propose an approach to characterize the energy ef- important factor to account for in order to make the optimal ficiency of a hardware platform based on the notion of energy efficient choice. configuration points. To manage efficiently the workload scenarios faced by • we include environment temperature in the energy effi- mobile devices, edge devices in IoT, or nano data centers, ciency model and show the impact this variable has on there is a need to continuously monitor power data in order to the relative efficiency of the points from the model. 2 II. R ELATED W ORK less energy consumption. Exploring the usage of platform actuators for energy man- III. CMOS POWER DISSIPATION agement was studied by different research works. The authors CMOS technology has been mostly used in MPSoCs due in [23], [10], and [18] all propose the creation of a runtime to the fact that has quite good noise immunity and low heat system which is able to manage the scheduling and mapping production while the device is in operation mode. Power in of threads dynamically with the objective of maximizing the these circuits can be divided into two categories: dynamic energy efficiency of MPSoC. In [23] a load balancer schedules power and static power. Dynamic power is created by the the workload in periodic time frames called epochs, wherein circuit activity (transistor switching) and is dependent on the each, a set of actions are performed to set the threads in usage scenario, clock rates, and I/O activity. Switching power the appropriate core type. The platform considered is highly is dissipated during the transistor changing from 0 to 1 and heterogeneous with 4 types of core and in each epoch the load vice versa, the dynamic power is defined as: balancer estimates the performance and power of every thread in each core type. This information is used by the internal 2 algorithm to decide where to map the threads. Similarly, Pdynamic = α ∗ C ∗ VDD ∗ fclk (1) in [18] is proposed a runtime scheme which is used to schedule dynamically workloads in a MPSoC. The approach is based on the sense-decide-act policy and operates on where C is the load capacitance, VDD is the source voltage, an aggressive heterogeneous environment. It uses regression α is the activity factor and f is the operating frequency. models for estimating performance and power of threads in Static power is dissipated due to the leakage currents on different core type and also the contribution of a thread in the transistors while they are in the “OFF” mode. The are a total load of a core. An evolutionary algorithm is used several sources of the leakage current which are strongly to decide in each term the scheduling of the threads. The influenced by the chip temperature. The dynamic part of the authors in [10] propose a run-time task allocation approach power dissipated from the chip is modeled by two terms in called SPARTA which categorizes task in computing bound or Equation 2, as a dynamic activity which relates to the active memory bound and a heuristic that selects the configuration running workloads and the background activity that represents that achieves the requested throughput with the minimal power the system processes that run on the background. In Equation 3 consumption. In these works is not considered the possibility the dynamic power is modeled by a single term due to the low of DVFS as a mechanism to reduce power consumption and power dissipated by background processes in the A7 cluster. also the hardware counters used for estimating performance Static power is modeled by the third term in Equation 2 and are not easily found in real hardware platforms. Sensors is dependent on temperature and the supply voltage. For the for estimating the power consumption of different mapping A7 cluster, there is no temperature sensor to monitor, hence decisions are not available in many of today’s platforms. the static part is modeled together with the dynamic power Finding the optimal configuration for executing workloads in dissipation of background activity. a data-center in order to achieve better energy efficiency is IV. P ROPOSED A PPROACH the goal presented in [11]. Authors present a programming and execution platform called Empya that uses hardware and Today embedded systems face a multitude of working software techniques to determine the best trade-off between scenarios that range from burst in high performance requests, performance and energy consumption. The run-time system to low power operation modes, going through the need to continuously monitors application performance and energy provide sustainable performance in thermally constrained sit- consumption through Running Average Power Limit (RAPL) uations. To do an efficient managing of such a number of use registers. As actuators, the system operates on the number of cases the runtime scheduling manager need to have refreshed threads to use and the power cap on the CPU. In contrast with information about the effect of changing different actuators this, our work focuses on heterogeneous platforms where for on the running applications. Thus there is a need for an achieving energy efficiency we use actuators such as number, energy efficiency model which is based on the current runtime type of core and DVFS point. In [26] authors target again power data. The envisioned system diagram is shown in Figure High-Performance Computing applications running on a single 2, where our work in this paper is focused in providing node with the goal of reducing the energy consumption by the platform configuration points database for helping the choosing the right configuration, which is composed of the scheduler decisions in reaching the optimal efficiency level number of cores and DVFS level. The work is based on of the running applications. the application-agnostic power model and the performance The work in this paper is based on power models for model of the application is obtained with a supervised learning mobile CPUs based on hardware program counters (HPC). The method of regression. Frequency, number of cores and input methodology for building such models is adopted from [27], size are used in the regression model. The methodology is which presents a statistical method for identifying and using clear and straightforward, but there is no mention of the hardware counters. Their analyses propose the usage of coun- performance requirement which is the value we trade off for ters which show a high correlation to power and have also the 3 The modelled formula for the power dissipation is showed in Equation 2 and 3, N X −1 2 2 PA15 = ( βn En VDD fclk ) + βb VDD fclk + f (VDD , T ) | {z } | {z } n=0 | {z } BG dynamic static dynamic activity (2) N X −1 2 PA7 = ( βn En VDD fclk ) + f (VDD , fclk ) (3) n=0 | {z } | {z } static and BG dynamic dynamic activity where N is the number of events selected, βn is the weight given to certain event, En is the number of events per second divided by the frequency (fclk ) in MHz, VDD is the operating voltage and T is the temperature of the core. The power model for the A15 has a thermal compensation Fig. 2. Proposed Approach schematics. term for calculating the static power and background dissipated power when the system is idling (Equation 2). In the power TABLE I model for A7 the static and background power are included in H ARDWARE EVENTS USED IN THE POWER MODELS the second term of Equation 3. This is related to the absence Event list of a thermal monitoring sensor in the A7 cluster. We have Nr ARM Cortex-A7 ARM Cortex-A15 calculated four sets of model coefficients for the parameters 1 L2D CACHE ACCESS:0x16 L2D CACHE LD:0x50 in each cluster, representing the power with a different number 2 MEM ACCESS:0x13 DP SPEC:0x73 of cores for each CPU type. The model parameters for each 3 L1I CACHE ACCESS:0x14 L1I CACHE ACCESS:0x14 4 UNALIGNED LDST:0x0F UNALIGNED LDST SP:0x6A core type are given in Tables II and III. In the tables, it is 5 CYCLE COUNT:0x11 BUS ACCESS:0x19 shown the event rate divided by the frequency in MHz, the 6 INST SPEC:0x1B weight given to each coefficient and the statistical significance. 7 CYCLE COUNT:0x11 In some model terms, f and V are respectively the operating frequency and voltage of each cluster (Table IV). The event rates are divided by the operating frequency in order to avoid smallest multicollinearity. The authors in [27] show that this correlation with it in the first term of power equations. The brings high model stability with an average error of 3,8%. power models need to be obtained only once by running on We start by building power models for two popular ARM the target platform a set of embedded representative workloads v7a architecture CPU’s, which are ARM Cortex-A7 and ARM which we call platform characterization set. After obtaining Cortex-A15. The micro-architecture limits the number of the power model we compute the energy efficiency table events which can be sampled at once: 6 counters for A15 which provides a sort of database of all the possible platform and 4 counters for A7 plus the cycle counter. The goal is configuration points and the resulting performance, power to search for those events which have the highest correla- and energy efficiency values. By having this information the tion with power dissipation and at the same time show the runtime system is able to make decisions about the mapping of smallest intercorrelation with each other. To have high model a certain application with regard of the performance. If there stability the predictors should be chosen to keep low levels is a change in the environment temperature above a certain of multicollinearity in multivariate models. First, is measured threshold, then the power dissipation can be recomputed and the correlation of all available events with the power, then the table is redefined for the new thermal level. the counters are divided into clusters which include events These models are build by running the characterization with high intercorrelation. Then, from each cluster is selected workload set in each of the operating points of both CPUs. the event which has more impact on the power dissipation The set contains workloads that test different levels of the but keeping a low Variance Inflation Factor (VIF). The total microarchitecture and memory subsystem. In part is composed amount of events for the A7 is 40 and for the A15 in 120, of real applications from the embedded domain, and for the among these are selected 7 for the A15 and 5 for the A7. other part synthetic benchmarks designed to stress specific The events used in the models are general and can be found parts of the CPU. Having the power models and by measuring on most core types used in mobile systems. For each core the performance in terms on instructions per second (IPS) we type, the events are listed on Table I. The power for A15 can obtain an energy efficiency model of the platform. The and A7 is divided in dynamic and static, plus the background model is presented as a table that lists all the platform con- power which is related to the operating system activities. figuration points with the energy efficiency levels achieved in 4 TABLE II The runtime system inputs temperature variations inside the M ODEL PARAMETERS AND P - VALUES FOR THE A15 model and can recompute the energy efficiency table by taking Nr Coefficient Weight p-Value into account the new level of static power. The new table 1 Intercept -5e-4 p