=Paper= {{Paper |id=Vol-2382/ICT4S2019_paper_9 |storemode=property |title=Energy Efficiency Platform Characterization for Heterogeneous Multicore Architectures |pdfUrl=https://ceur-ws.org/Vol-2382/ICT4S2019_paper_9.pdf |volume=Vol-2382 |authors=Hergys Rexha,Sébastien Lafond |dblpUrl=https://dblp.org/rec/conf/ict4s/RexhaL19 }} ==Energy Efficiency Platform Characterization for Heterogeneous Multicore Architectures== https://ceur-ws.org/Vol-2382/ICT4S2019_paper_9.pdf
      Energy Efficiency Platform Characterization for
          Heterogeneous Multicore Architectures
                               Hergys Rexha                                             Sébastien Lafond
                   Faculty of Science and Engineering                         Faculty of Science and Engineering
                         Åbo Akademi University                                    Åbo Akademi University
                              Turku, Finland                                             Turku, Finland
                              hrexha@abo.fi                                              slafond@abo.fi



   Abstract—Runtime estimation of power dissipation and perfor-        1TW [12]. The emergence of the Internet of Things (IoT)
mance is crucial in every computing platform. In mobile systems,       with devices operating at the edge of the network, poses
a special focus is set on energy efficiency in order to achieve        a new challenge to the Cloud to provide efficient service
the longest possible battery life and at the same time adhering
to performance requirements. Powered by heterogeneous SoC’s,           provisioning. IoT devices are low powered devices and their
mobile systems are called to reach an energy efficient state of        usage promises to decrease the overall power consumption
execution, with a runtime system or scheduler that requires            by increasing energy efficiency, but their number could be
knowledge on the current performance and power dissipation.            overwhelming with the consequence of having a ”rebound
Today, highly heterogeneous architectures provide many actu-           effect” [9]. Cisco predicts that by the year 2020 in the world
ators to reach better efficiency, the effect of which is usually
unknown at runtime. In this paper, we propose a fast approach to       will be 50 billion IoT devices, which is an order of magnitude
build an energy efficiency model based on hardware performance         bigger than the number of smartphones and tablets working
counters. Our approach obviates the need for power sensors             today. So in this scenario, using the cloud services offered by
present at the chip level and deals with high numbers of execution     large datacenters to receive the data generated by IoT devices
modes. In building the energy efficiency model we account for          will not be a sustainable solution in terms of cost, latency, and
the change in temperature which, as we show, has an impact
on the optimal energy efficiency choice. The proposed approach         environmental impact [6]. Recently the idea of edge devices
reduces significantly the time to characterize the energy efficiency   that provide the computation and storage closer to the source
of a Multiprocessor System-on-Chip (MPSoC) and includes the            of data has been formulated under the term of Edge or Fog
environment temperature as a variable in determining the energy        computing [25]. As an edge device example, we can mention
efficiency.                                                            smartphones, as intermediates between body sensors and the
   Index Terms—MPSoC, energy efficiency models, platform
configuration point, PMC, power models                                 cloud services, gateways as intermediates for smart homes, or
                                                                       nano data centers that manage the caching or processing of
                                                                       video contents. By using these edge devices in the proximity
                       I. I NTRODUCTION
                                                                       of data sources, we could have as an end result in a reduction
   The past years have seen rapid development in the amount            of energy consumption w.r.t. implementing the logic in the
of data produced, processed and exchanged through comput-              cloud, and at the same time keeping latency requirements of
ing systems, ranging from high-end server farms to simple              certain applications [17].
household devices, and the trend of technology seems to fuel              Therefore one key requirement of such computing sys-
even more this direction. Based on electricity usage ascribed          tems is undoubtedly energy efficiency. Basically, this means
to Information and Communication Technology (ICT), it is               that systems should minimize their energy consumption to
predicted that by the end of 2030 this sector will use as much         complete the required task and achieve a satisfying energy
as 51% of global electricity production [5]. Following this            proportionality [20]. One of the largest consumers of energy
scenario, by the year 2030, the only ICT industry will be              in computing environments is the CPU [8], which requires
responsible for up to 23% of the globally released greenhouse          special attention especially in the multicore era. Today mobile
gas emissions [5]. A 2016 report [24] says that the US                 devices are using the same CPU as traditional gateways or
datacenters held 350 million terabytes of data in 2015, and by         cloudlets in Edge Computing. The need to achieve energy
2020 they will require 100TWh of electricity to operate. This          efficiency in today’s MPSoC is stringent, especially for mobile
is the equivalent of 7 nuclear power stations like Olkiluoto 3         devices that operate on battery, and that is a clear scenario
in Finland. There is also an increase of datacenters capacity          where the end user wants a better experience and longer
in Europe, with London, Frankfurt, Paris, and Amsterdam                battery life.
which grew their electricity consumption by 200MW in 2017.                Workload variability makes the control of energy expen-
Countries like Ireland and Denmark in Europe are becoming              diture especially difficult in mobile CPUs. Mobile devices
a data base for the world’s biggest tech companies and by the          are not the only which require energy efficient solutions,
next 5 years promise to increase the power consumption by              but also cloud providers need to lower the energy cost of
computations and cooling [19]. Today large scale computing
                                                                                    fs    fs         fB   fB    fs   fs      fB        fB   fs        fs
facilities are using energy as a resource to be scheduled and             fB
                                                                                    fs    fs         fB                      fB        fB        fs
charge according to the energy consumption [14]. Heterogene-
ity shows a promise to increase the energy efficiency levels
                                                                         Configurationx              Configurationy          Configurationz
achieved in MPSoC, hence several paths have been followed
by research and industry. For example, exploring heterogeneity
                                                                          fB   fB                    fB    fB   fs   fs                     fs        fs
inside the CPU chip by using multiple technologies with                                                                           fB
                                                                                                                fs   fs
different power and performance characteristics or using cores            fB   fB
that alternatively behave as out-of-order computing elements             Configurationt              Configurationu          Configurationv
or as in-order cores [22]. Probably one of the most popular
and researched types of heterogeneity is the one provided by
different computing cores integrated into the same physical
chip. This type of heterogeneity is the one where computing
cores share the same Instruction Set Architecture (ISA) but                              Big Cores                        Small Cores
have different microarchitectures. However, an intelligent use
of these power and performance tradeoffs proves to be not                                                  MPSoC
a simple challenge [23]. Being able to predict the optimal
choice between a number of hardware actuators such as the              Fig. 1. Examples of possible platform configuration points in a multicore
                                                                       architecture
number of cores, type of core and operating performance
point, or Dynamic Voltage and Frequency Scaling (DVFS), is
a difficult task that must be handled well in order to achieve
                                                                       choose the optimal power and performance trade-off. Unfortu-
energy efficiency.
                                                                       nately, most of the hardware platforms today are not equipped
   With asymmetric multiprocessing (AMP) architecture there
                                                                       with power sensors, which significantly complicates energy-
is a better way to respond to the diversity of applications
                                                                       efficient management of the system settings.
present in the mobile environment. We have compute-intensive
                                                                          This paper follows our previous work which experimentally
applications which need to produce results in real time and
                                                                       builds an energy efficiency model based on platform config-
must use fast cores in order to meet the deadlines. On the
                                                                       uration points, for ARM big.LITTLE architecture [21]. As
other side, background processes that may be memory bound
                                                                       platform configuration point we denoted the set of platform
require little computation and are more suitable to run on
                                                                       actuators such as number, type of core, core performance level
simple cores that achieve better levels of energy efficiency.
                                                                       or DVFS and core utilization level. The model is derived by
Even within a single application, we have different “windows
                                                                       testing all the possible configuration points of the platform.
of activity” which may require varying levels of computing
                                                                       Following the recent trend in platform complexity, this ap-
intensity, e.g. reading, scrolling, responding through different
                                                                       proach is difficult to apply in the case of the combinatorial ex-
messages inside a social media application. Recently industry
                                                                       plosion in the number of configuration points. The goal of this
has moved towards increasing the level of heterogeneity found
                                                                       paper is to explore new approaches in providing knowledge
inside a single chip. From examples such as ARM big.LITTLE
                                                                       of the platform energy efficiency to a runtime system based
with two types of cores, to Mediatek tri-cluster MPSoC [16]
                                                                       on the concept of platform configuration points. We redefine
which promise to increase performance and reduce power
                                                                       the set of parameters in the configuration point by removing
dissipation. DynamIQ from ARM [1] advances the concept
                                                                       utilization level from the aforementioned description. Meaning
of big.LITTLE by providing better flexibility in the cluster
                                                                       of the notion of platform configuration point is demonstrated
organization and frequency setting.
                                                                       with several examples (from x to v) in a multicore platform
   High levels of heterogeneity present in recently embed-
                                                                       (Figure 1). In our energy efficiency model, we account for
ded architectures produce an increase in the design space
                                                                       the environment temperature variable, which provides valuable
exploration to find an efficient use of platform actuators. By
                                                                       information for the correct accounting of the CPU dissipated
increasing the number and type of cores and the number of
                                                                       power. Knowing the large impact that static power has on the
voltages and frequency levels for each computing element,
                                                                       energy efficiency achieved in today’s CPUs the second purpose
there is an increasing number of operating points on which
                                                                       of this work is to build thermally aware energy efficiency
the platform may perform. In this scenario making the right
                                                                       models.
choice for execution could have a tremendous impact on
energy efficiency. Temperature also has a major effect on the             The contributions of this paper are the following:
power dissipation of today’s systems [15], which makes it an             • we propose an approach to characterize the energy ef-
important factor to account for in order to make the optimal               ficiency of a hardware platform based on the notion of
energy efficient choice.                                                   configuration points.
   To manage efficiently the workload scenarios faced by                 • we include environment temperature in the energy effi-
mobile devices, edge devices in IoT, or nano data centers,                 ciency model and show the impact this variable has on
there is a need to continuously monitor power data in order to             the relative efficiency of the points from the model.



                                                                   2
                    II. R ELATED W ORK                                less energy consumption.

   Exploring the usage of platform actuators for energy man-                        III. CMOS POWER DISSIPATION
agement was studied by different research works. The authors             CMOS technology has been mostly used in MPSoCs due
in [23], [10], and [18] all propose the creation of a runtime         to the fact that has quite good noise immunity and low heat
system which is able to manage the scheduling and mapping             production while the device is in operation mode. Power in
of threads dynamically with the objective of maximizing the           these circuits can be divided into two categories: dynamic
energy efficiency of MPSoC. In [23] a load balancer schedules         power and static power. Dynamic power is created by the
the workload in periodic time frames called epochs, wherein           circuit activity (transistor switching) and is dependent on the
each, a set of actions are performed to set the threads in            usage scenario, clock rates, and I/O activity. Switching power
the appropriate core type. The platform considered is highly          is dissipated during the transistor changing from 0 to 1 and
heterogeneous with 4 types of core and in each epoch the load         vice versa, the dynamic power is defined as:
balancer estimates the performance and power of every thread
in each core type. This information is used by the internal
                                                                                                         2
algorithm to decide where to map the threads. Similarly,                             Pdynamic = α ∗ C ∗ VDD ∗ fclk                (1)
in [18] is proposed a runtime scheme which is used to
schedule dynamically workloads in a MPSoC. The approach
is based on the sense-decide-act policy and operates on               where C is the load capacitance, VDD is the source voltage,
an aggressive heterogeneous environment. It uses regression           α is the activity factor and f is the operating frequency.
models for estimating performance and power of threads in             Static power is dissipated due to the leakage currents on
different core type and also the contribution of a thread in          the transistors while they are in the “OFF” mode. The are
a total load of a core. An evolutionary algorithm is used             several sources of the leakage current which are strongly
to decide in each term the scheduling of the threads. The             influenced by the chip temperature. The dynamic part of the
authors in [10] propose a run-time task allocation approach           power dissipated from the chip is modeled by two terms in
called SPARTA which categorizes task in computing bound or            Equation 2, as a dynamic activity which relates to the active
memory bound and a heuristic that selects the configuration           running workloads and the background activity that represents
that achieves the requested throughput with the minimal power         the system processes that run on the background. In Equation 3
consumption. In these works is not considered the possibility         the dynamic power is modeled by a single term due to the low
of DVFS as a mechanism to reduce power consumption and                power dissipated by background processes in the A7 cluster.
also the hardware counters used for estimating performance            Static power is modeled by the third term in Equation 2 and
are not easily found in real hardware platforms. Sensors              is dependent on temperature and the supply voltage. For the
for estimating the power consumption of different mapping             A7 cluster, there is no temperature sensor to monitor, hence
decisions are not available in many of today’s platforms.             the static part is modeled together with the dynamic power
Finding the optimal configuration for executing workloads in          dissipation of background activity.
a data-center in order to achieve better energy efficiency is
                                                                                       IV. P ROPOSED A PPROACH
the goal presented in [11]. Authors present a programming
and execution platform called Empya that uses hardware and               Today embedded systems face a multitude of working
software techniques to determine the best trade-off between           scenarios that range from burst in high performance requests,
performance and energy consumption. The run-time system               to low power operation modes, going through the need to
continuously monitors application performance and energy              provide sustainable performance in thermally constrained sit-
consumption through Running Average Power Limit (RAPL)                uations. To do an efficient managing of such a number of use
registers. As actuators, the system operates on the number of         cases the runtime scheduling manager need to have refreshed
threads to use and the power cap on the CPU. In contrast with         information about the effect of changing different actuators
this, our work focuses on heterogeneous platforms where for           on the running applications. Thus there is a need for an
achieving energy efficiency we use actuators such as number,          energy efficiency model which is based on the current runtime
type of core and DVFS point. In [26] authors target again             power data. The envisioned system diagram is shown in Figure
High-Performance Computing applications running on a single           2, where our work in this paper is focused in providing
node with the goal of reducing the energy consumption by              the platform configuration points database for helping the
choosing the right configuration, which is composed of the            scheduler decisions in reaching the optimal efficiency level
number of cores and DVFS level. The work is based on                  of the running applications.
the application-agnostic power model and the performance                 The work in this paper is based on power models for
model of the application is obtained with a supervised learning       mobile CPUs based on hardware program counters (HPC). The
method of regression. Frequency, number of cores and input            methodology for building such models is adopted from [27],
size are used in the regression model. The methodology is             which presents a statistical method for identifying and using
clear and straightforward, but there is no mention of the             hardware counters. Their analyses propose the usage of coun-
performance requirement which is the value we trade off for           ters which show a high correlation to power and have also the



                                                                  3
                                                                      The modelled formula for the power dissipation is showed
                                                                      in Equation 2 and 3,

                                                                                   N
                                                                                   X −1
                                                                                               2               2
                                                                        PA15 = (        βn En VDD fclk ) + βb VDD  fclk + f (VDD , T )
                                                                                                           | {z } | {z }
                                                                                    n=0
                                                                                  |         {z        }     BG dynamic        static
                                                                                        dynamic activity
                                                                                                                                     (2)
                                                                                 N
                                                                                 X −1
                                                                                             2
                                                                        PA7 = (       βn En VDD fclk ) + f (VDD , fclk )             (3)
                                                                                  n=0
                                                                                                         |      {z      }
                                                                                |         {z        } static and BG dynamic
                                                                                      dynamic activity

                                                                      where N is the number of events selected, βn is the weight
                                                                      given to certain event, En is the number of events per second
                                                                      divided by the frequency (fclk ) in MHz, VDD is the operating
                                                                      voltage and T is the temperature of the core.
                                                                         The power model for the A15 has a thermal compensation
               Fig. 2. Proposed Approach schematics.
                                                                      term for calculating the static power and background dissipated
                                                                      power when the system is idling (Equation 2). In the power
                           TABLE I                                    model for A7 the static and background power are included in
          H ARDWARE EVENTS USED IN THE POWER MODELS                   the second term of Equation 3. This is related to the absence
                          Event list                                  of a thermal monitoring sensor in the A7 cluster. We have
  Nr       ARM Cortex-A7                  ARM Cortex-A15              calculated four sets of model coefficients for the parameters
  1    L2D CACHE ACCESS:0x16            L2D CACHE LD:0x50             in each cluster, representing the power with a different number
  2       MEM ACCESS:0x13                  DP SPEC:0x73               of cores for each CPU type. The model parameters for each
  3    L1I CACHE ACCESS:0x14          L1I CACHE ACCESS:0x14
  4     UNALIGNED LDST:0x0F          UNALIGNED LDST SP:0x6A
                                                                      core type are given in Tables II and III. In the tables, it is
  5      CYCLE COUNT:0x11                BUS ACCESS:0x19              shown the event rate divided by the frequency in MHz, the
  6                                       INST SPEC:0x1B              weight given to each coefficient and the statistical significance.
  7                                     CYCLE COUNT:0x11              In some model terms, f and V are respectively the operating
                                                                      frequency and voltage of each cluster (Table IV). The event
                                                                      rates are divided by the operating frequency in order to avoid
smallest multicollinearity. The authors in [27] show that this        correlation with it in the first term of power equations. The
brings high model stability with an average error of 3,8%.            power models need to be obtained only once by running on
   We start by building power models for two popular ARM              the target platform a set of embedded representative workloads
v7a architecture CPU’s, which are ARM Cortex-A7 and ARM               which we call platform characterization set. After obtaining
Cortex-A15. The micro-architecture limits the number of               the power model we compute the energy efficiency table
events which can be sampled at once: 6 counters for A15               which provides a sort of database of all the possible platform
and 4 counters for A7 plus the cycle counter. The goal is             configuration points and the resulting performance, power
to search for those events which have the highest correla-            and energy efficiency values. By having this information the
tion with power dissipation and at the same time show the             runtime system is able to make decisions about the mapping of
smallest intercorrelation with each other. To have high model         a certain application with regard of the performance. If there
stability the predictors should be chosen to keep low levels          is a change in the environment temperature above a certain
of multicollinearity in multivariate models. First, is measured       threshold, then the power dissipation can be recomputed and
the correlation of all available events with the power, then          the table is redefined for the new thermal level.
the counters are divided into clusters which include events              These models are build by running the characterization
with high intercorrelation. Then, from each cluster is selected       workload set in each of the operating points of both CPUs.
the event which has more impact on the power dissipation              The set contains workloads that test different levels of the
but keeping a low Variance Inflation Factor (VIF). The total          microarchitecture and memory subsystem. In part is composed
amount of events for the A7 is 40 and for the A15 in 120,             of real applications from the embedded domain, and for the
among these are selected 7 for the A15 and 5 for the A7.              other part synthetic benchmarks designed to stress specific
The events used in the models are general and can be found            parts of the CPU. Having the power models and by measuring
on most core types used in mobile systems. For each core              the performance in terms on instructions per second (IPS) we
type, the events are listed on Table I. The power for A15             can obtain an energy efficiency model of the platform. The
and A7 is divided in dynamic and static, plus the background          model is presented as a table that lists all the platform con-
power which is related to the operating system activities.            figuration points with the energy efficiency levels achieved in



                                                                  4
                         TABLE II                                         The runtime system inputs temperature variations inside the
         M ODEL PARAMETERS AND P - VALUES FOR THE A15                  model and can recompute the energy efficiency table by taking
  Nr               Coefficient                 Weight    p-Value       into account the new level of static power. The new table
   1                Intercept                   -5e-4     p