Energy Efficiency Platform Characterization for
          Heterogeneous Multicore Architectures
                               Hergys Rexha                                             Sébastien Lafond
                   Faculty of Science and Engineering                         Faculty of Science and Engineering
                         Åbo Akademi University                                    Åbo Akademi University
                              Turku, Finland                                             Turku, Finland
                              hrexha@abo.fi                                              slafond@abo.fi


   Abstract—Runtime estimation of power dissipation and perfor-        1TW [12]. The emergence of the Internet of Things (IoT)
mance is crucial in every computing platform. In mobile systems,       with devices operating at the edge of the network, poses
a special focus is set on energy efficiency in order to achieve        a new challenge to the Cloud to provide efficient service
the longest possible battery life and at the same time adhering
to performance requirements. Powered by heterogeneous SoC’s,           provisioning. IoT devices are low powered devices and their
mobile systems are called to reach an energy efficient state of        usage promises to decrease the overall power consumption
execution, with a runtime system or scheduler that requires            by increasing energy efficiency, but their number could be
knowledge on the current performance and power dissipation.            overwhelming with the consequence of having a ”rebound
Today, highly heterogeneous architectures provide many actu-           effect” [9]. Cisco predicts that by the year 2020 in the world
ators to reach better efficiency, the effect of which is usually
unknown at runtime. In this paper, we propose a fast approach to       will be 50 billion IoT devices, which is an order of magnitude
build an energy efficiency model based on hardware performance         bigger than the number of smartphones and tablets working
counters. Our approach obviates the need for power sensors             today. So in this scenario, using the cloud services offered by
present at the chip level and deals with high numbers of execution     large datacenters to receive the data generated by IoT devices
modes. In building the energy efficiency model we account for          will not be a sustainable solution in terms of cost, latency, and
the change in temperature which, as we show, has an impact
on the optimal energy efficiency choice. The proposed approach         environmental impact [6]. Recently the idea of edge devices
reduces significantly the time to characterize the energy efficiency   that provide the computation and storage closer to the source
of a Multiprocessor System-on-Chip (MPSoC) and includes the            of data has been formulated under the term of Edge or Fog
environment temperature as a variable in determining the energy        computing [25]. As an edge device example, we can mention
efficiency.                                                            smartphones, as intermediates between body sensors and the
   Index Terms—MPSoC, energy efficiency models, platform
configuration point, PMC, power models                                 cloud services, gateways as intermediates for smart homes, or
                                                                       nano data centers that manage the caching or processing of
                                                                       video contents. By using these edge devices in the proximity
                       I. I NTRODUCTION
                                                                       of data sources, we could have as an end result in a reduction
   The past years have seen rapid development in the amount            of energy consumption w.r.t. implementing the logic in the
of data produced, processed and exchanged through comput-              cloud, and at the same time keeping latency requirements of
ing systems, ranging from high-end server farms to simple              certain applications [17].
household devices, and the trend of technology seems to fuel              Therefore one key requirement of such computing sys-
even more this direction. Based on electricity usage ascribed          tems is undoubtedly energy efficiency. Basically, this means
to Information and Communication Technology (ICT), it is               that systems should minimize their energy consumption to
predicted that by the end of 2030 this sector will use as much         complete the required task and achieve a satisfying energy
as 51% of global electricity production [5]. Following this            proportionality [20]. One of the largest consumers of energy
scenario, by the year 2030, the only ICT industry will be              in computing environments is the CPU [8], which requires
responsible for up to 23% of the globally released greenhouse          special attention especially in the multicore era. Today mobile
gas emissions [5]. A 2016 report [24] says that the US                 devices are using the same CPU as traditional gateways or
datacenters held 350 million terabytes of data in 2015, and by         cloudlets in Edge Computing. The need to achieve energy
2020 they will require 100TWh of electricity to operate. This          efficiency in today’s MPSoC is stringent, especially for mobile
is the equivalent of 7 nuclear power stations like Olkiluoto 3         devices that operate on battery, and that is a clear scenario
in Finland. There is also an increase of datacenters capacity          where the end user wants a better experience and longer
in Europe, with London, Frankfurt, Paris, and Amsterdam                battery life.
which grew their electricity consumption by 200MW in 2017.                Workload variability makes the control of energy expen-
Countries like Ireland and Denmark in Europe are becoming              diture especially difficult in mobile CPUs. Mobile devices
a data base for the world’s biggest tech companies and by the          are not the only which require energy efficient solutions,
next 5 years promise to increase the power consumption by              but also cloud providers need to lower the energy cost of
computations and cooling [19]. Today large scale computing
                                                                                    fs    fs         fB   fB    fs   fs      fB        fB   fs        fs
facilities are using energy as a resource to be scheduled and             fB
                                                                                    fs    fs         fB                      fB        fB        fs
charge according to the energy consumption [14]. Heterogene-
ity shows a promise to increase the energy efficiency levels
                                                                         Configurationx              Configurationy          Configurationz
achieved in MPSoC, hence several paths have been followed
by research and industry. For example, exploring heterogeneity
                                                                          fB   fB                    fB    fB   fs   fs                     fs        fs
inside the CPU chip by using multiple technologies with                                                                           fB
                                                                                                                fs   fs
different power and performance characteristics or using cores            fB   fB
that alternatively behave as out-of-order computing elements             Configurationt              Configurationu          Configurationv
or as in-order cores [22]. Probably one of the most popular
and researched types of heterogeneity is the one provided by
different computing cores integrated into the same physical
chip. This type of heterogeneity is the one where computing
cores share the same Instruction Set Architecture (ISA) but                              Big Cores                        Small Cores
have different microarchitectures. However, an intelligent use
of these power and performance tradeoffs proves to be not                                                  MPSoC
a simple challenge [23]. Being able to predict the optimal
choice between a number of hardware actuators such as the              Fig. 1. Examples of possible platform configuration points in a multicore
                                                                       architecture
number of cores, type of core and operating performance
point, or Dynamic Voltage and Frequency Scaling (DVFS), is
a difficult task that must be handled well in order to achieve
                                                                       choose the optimal power and performance trade-off. Unfortu-
energy efficiency.
                                                                       nately, most of the hardware platforms today are not equipped
   With asymmetric multiprocessing (AMP) architecture there
                                                                       with power sensors, which significantly complicates energy-
is a better way to respond to the diversity of applications
                                                                       efficient management of the system settings.
present in the mobile environment. We have compute-intensive
                                                                          This paper follows our previous work which experimentally
applications which need to produce results in real time and
                                                                       builds an energy efficiency model based on platform config-
must use fast cores in order to meet the deadlines. On the
                                                                       uration points, for ARM big.LITTLE architecture [21]. As
other side, background processes that may be memory bound
                                                                       platform configuration point we denoted the set of platform
require little computation and are more suitable to run on
                                                                       actuators such as number, type of core, core performance level
simple cores that achieve better levels of energy efficiency.
                                                                       or DVFS and core utilization level. The model is derived by
Even within a single application, we have different “windows
                                                                       testing all the possible configuration points of the platform.
of activity” which may require varying levels of computing
                                                                       Following the recent trend in platform complexity, this ap-
intensity, e.g. reading, scrolling, responding through different
                                                                       proach is difficult to apply in the case of the combinatorial ex-
messages inside a social media application. Recently industry
                                                                       plosion in the number of configuration points. The goal of this
has moved towards increasing the level of heterogeneity found
                                                                       paper is to explore new approaches in providing knowledge
inside a single chip. From examples such as ARM big.LITTLE
                                                                       of the platform energy efficiency to a runtime system based
with two types of cores, to Mediatek tri-cluster MPSoC [16]
                                                                       on the concept of platform configuration points. We redefine
which promise to increase performance and reduce power
                                                                       the set of parameters in the configuration point by removing
dissipation. DynamIQ from ARM [1] advances the concept
                                                                       utilization level from the aforementioned description. Meaning
of big.LITTLE by providing better flexibility in the cluster
                                                                       of the notion of platform configuration point is demonstrated
organization and frequency setting.
                                                                       with several examples (from x to v) in a multicore platform
   High levels of heterogeneity present in recently embed-
                                                                       (Figure 1). In our energy efficiency model, we account for
ded architectures produce an increase in the design space
                                                                       the environment temperature variable, which provides valuable
exploration to find an efficient use of platform actuators. By
                                                                       information for the correct accounting of the CPU dissipated
increasing the number and type of cores and the number of
                                                                       power. Knowing the large impact that static power has on the
voltages and frequency levels for each computing element,
                                                                       energy efficiency achieved in today’s CPUs the second purpose
there is an increasing number of operating points on which
                                                                       of this work is to build thermally aware energy efficiency
the platform may perform. In this scenario making the right
                                                                       models.
choice for execution could have a tremendous impact on
energy efficiency. Temperature also has a major effect on the             The contributions of this paper are the following:
power dissipation of today’s systems [15], which makes it an             • we propose an approach to characterize the energy ef-
important factor to account for in order to make the optimal               ficiency of a hardware platform based on the notion of
energy efficient choice.                                                   configuration points.
   To manage efficiently the workload scenarios faced by                 • we include environment temperature in the energy effi-
mobile devices, edge devices in IoT, or nano data centers,                 ciency model and show the impact this variable has on
there is a need to continuously monitor power data in order to             the relative efficiency of the points from the model.


                                                                   2
                    II. R ELATED W ORK                                less energy consumption.

   Exploring the usage of platform actuators for energy man-                        III. CMOS POWER DISSIPATION
agement was studied by different research works. The authors             CMOS technology has been mostly used in MPSoCs due
in [23], [10], and [18] all propose the creation of a runtime         to the fact that has quite good noise immunity and low heat
system which is able to manage the scheduling and mapping             production while the device is in operation mode. Power in
of threads dynamically with the objective of maximizing the           these circuits can be divided into two categories: dynamic
energy efficiency of MPSoC. In [23] a load balancer schedules         power and static power. Dynamic power is created by the
the workload in periodic time frames called epochs, wherein           circuit activity (transistor switching) and is dependent on the
each, a set of actions are performed to set the threads in            usage scenario, clock rates, and I/O activity. Switching power
the appropriate core type. The platform considered is highly          is dissipated during the transistor changing from 0 to 1 and
heterogeneous with 4 types of core and in each epoch the load         vice versa, the dynamic power is defined as:
balancer estimates the performance and power of every thread
in each core type. This information is used by the internal
                                                                                                         2
algorithm to decide where to map the threads. Similarly,                             Pdynamic = α ∗ C ∗ VDD ∗ fclk                (1)
in [18] is proposed a runtime scheme which is used to
schedule dynamically workloads in a MPSoC. The approach
is based on the sense-decide-act policy and operates on               where C is the load capacitance, VDD is the source voltage,
an aggressive heterogeneous environment. It uses regression           α is the activity factor and f is the operating frequency.
models for estimating performance and power of threads in             Static power is dissipated due to the leakage currents on
different core type and also the contribution of a thread in          the transistors while they are in the “OFF” mode. The are
a total load of a core. An evolutionary algorithm is used             several sources of the leakage current which are strongly
to decide in each term the scheduling of the threads. The             influenced by the chip temperature. The dynamic part of the
authors in [10] propose a run-time task allocation approach           power dissipated from the chip is modeled by two terms in
called SPARTA which categorizes task in computing bound or            Equation 2, as a dynamic activity which relates to the active
memory bound and a heuristic that selects the configuration           running workloads and the background activity that represents
that achieves the requested throughput with the minimal power         the system processes that run on the background. In Equation 3
consumption. In these works is not considered the possibility         the dynamic power is modeled by a single term due to the low
of DVFS as a mechanism to reduce power consumption and                power dissipated by background processes in the A7 cluster.
also the hardware counters used for estimating performance            Static power is modeled by the third term in Equation 2 and
are not easily found in real hardware platforms. Sensors              is dependent on temperature and the supply voltage. For the
for estimating the power consumption of different mapping             A7 cluster, there is no temperature sensor to monitor, hence
decisions are not available in many of today’s platforms.             the static part is modeled together with the dynamic power
Finding the optimal configuration for executing workloads in          dissipation of background activity.
a data-center in order to achieve better energy efficiency is
                                                                                       IV. P ROPOSED A PPROACH
the goal presented in [11]. Authors present a programming
and execution platform called Empya that uses hardware and               Today embedded systems face a multitude of working
software techniques to determine the best trade-off between           scenarios that range from burst in high performance requests,
performance and energy consumption. The run-time system               to low power operation modes, going through the need to
continuously monitors application performance and energy              provide sustainable performance in thermally constrained sit-
consumption through Running Average Power Limit (RAPL)                uations. To do an efficient managing of such a number of use
registers. As actuators, the system operates on the number of         cases the runtime scheduling manager need to have refreshed
threads to use and the power cap on the CPU. In contrast with         information about the effect of changing different actuators
this, our work focuses on heterogeneous platforms where for           on the running applications. Thus there is a need for an
achieving energy efficiency we use actuators such as number,          energy efficiency model which is based on the current runtime
type of core and DVFS point. In [26] authors target again             power data. The envisioned system diagram is shown in Figure
High-Performance Computing applications running on a single           2, where our work in this paper is focused in providing
node with the goal of reducing the energy consumption by              the platform configuration points database for helping the
choosing the right configuration, which is composed of the            scheduler decisions in reaching the optimal efficiency level
number of cores and DVFS level. The work is based on                  of the running applications.
the application-agnostic power model and the performance                 The work in this paper is based on power models for
model of the application is obtained with a supervised learning       mobile CPUs based on hardware program counters (HPC). The
method of regression. Frequency, number of cores and input            methodology for building such models is adopted from [27],
size are used in the regression model. The methodology is             which presents a statistical method for identifying and using
clear and straightforward, but there is no mention of the             hardware counters. Their analyses propose the usage of coun-
performance requirement which is the value we trade off for           ters which show a high correlation to power and have also the


                                                                  3
                                                                      The modelled formula for the power dissipation is showed
                                                                      in Equation 2 and 3,

                                                                                   N
                                                                                   X −1
                                                                                               2               2
                                                                        PA15 = (        βn En VDD fclk ) + βb VDD  fclk + f (VDD , T )
                                                                                                           | {z } | {z }
                                                                                    n=0
                                                                                  |         {z        }     BG dynamic        static
                                                                                        dynamic activity
                                                                                                                                     (2)
                                                                                 N
                                                                                 X −1
                                                                                             2
                                                                        PA7 = (       βn En VDD fclk ) + f (VDD , fclk )             (3)
                                                                                  n=0
                                                                                                         |      {z      }
                                                                                |         {z        } static and BG dynamic
                                                                                      dynamic activity

                                                                      where N is the number of events selected, βn is the weight
                                                                      given to certain event, En is the number of events per second
                                                                      divided by the frequency (fclk ) in MHz, VDD is the operating
                                                                      voltage and T is the temperature of the core.
                                                                         The power model for the A15 has a thermal compensation
               Fig. 2. Proposed Approach schematics.
                                                                      term for calculating the static power and background dissipated
                                                                      power when the system is idling (Equation 2). In the power
                           TABLE I                                    model for A7 the static and background power are included in
          H ARDWARE EVENTS USED IN THE POWER MODELS                   the second term of Equation 3. This is related to the absence
                          Event list                                  of a thermal monitoring sensor in the A7 cluster. We have
  Nr       ARM Cortex-A7                  ARM Cortex-A15              calculated four sets of model coefficients for the parameters
  1    L2D CACHE ACCESS:0x16            L2D CACHE LD:0x50             in each cluster, representing the power with a different number
  2       MEM ACCESS:0x13                  DP SPEC:0x73               of cores for each CPU type. The model parameters for each
  3    L1I CACHE ACCESS:0x14          L1I CACHE ACCESS:0x14
  4     UNALIGNED LDST:0x0F          UNALIGNED LDST SP:0x6A
                                                                      core type are given in Tables II and III. In the tables, it is
  5      CYCLE COUNT:0x11                BUS ACCESS:0x19              shown the event rate divided by the frequency in MHz, the
  6                                       INST SPEC:0x1B              weight given to each coefficient and the statistical significance.
  7                                     CYCLE COUNT:0x11              In some model terms, f and V are respectively the operating
                                                                      frequency and voltage of each cluster (Table IV). The event
                                                                      rates are divided by the operating frequency in order to avoid
smallest multicollinearity. The authors in [27] show that this        correlation with it in the first term of power equations. The
brings high model stability with an average error of 3,8%.            power models need to be obtained only once by running on
   We start by building power models for two popular ARM              the target platform a set of embedded representative workloads
v7a architecture CPU’s, which are ARM Cortex-A7 and ARM               which we call platform characterization set. After obtaining
Cortex-A15. The micro-architecture limits the number of               the power model we compute the energy efficiency table
events which can be sampled at once: 6 counters for A15               which provides a sort of database of all the possible platform
and 4 counters for A7 plus the cycle counter. The goal is             configuration points and the resulting performance, power
to search for those events which have the highest correla-            and energy efficiency values. By having this information the
tion with power dissipation and at the same time show the             runtime system is able to make decisions about the mapping of
smallest intercorrelation with each other. To have high model         a certain application with regard of the performance. If there
stability the predictors should be chosen to keep low levels          is a change in the environment temperature above a certain
of multicollinearity in multivariate models. First, is measured       threshold, then the power dissipation can be recomputed and
the correlation of all available events with the power, then          the table is redefined for the new thermal level.
the counters are divided into clusters which include events              These models are build by running the characterization
with high intercorrelation. Then, from each cluster is selected       workload set in each of the operating points of both CPUs.
the event which has more impact on the power dissipation              The set contains workloads that test different levels of the
but keeping a low Variance Inflation Factor (VIF). The total          microarchitecture and memory subsystem. In part is composed
amount of events for the A7 is 40 and for the A15 in 120,             of real applications from the embedded domain, and for the
among these are selected 7 for the A15 and 5 for the A7.              other part synthetic benchmarks designed to stress specific
The events used in the models are general and can be found            parts of the CPU. Having the power models and by measuring
on most core types used in mobile systems. For each core              the performance in terms on instructions per second (IPS) we
type, the events are listed on Table I. The power for A15             can obtain an energy efficiency model of the platform. The
and A7 is divided in dynamic and static, plus the background          model is presented as a table that lists all the platform con-
power which is related to the operating system activities.            figuration points with the energy efficiency levels achieved in


                                                                  4
                         TABLE II                                         The runtime system inputs temperature variations inside the
         M ODEL PARAMETERS AND P - VALUES FOR THE A15                  model and can recompute the energy efficiency table by taking
  Nr               Coefficient                 Weight    p-Value       into account the new level of static power. The new table
   1                Intercept                   -5e-4     p<e-4        needs to be searched for configuration points that satisfy the
   2          EP H 0x11 ∗ f ∗ V 2              7.9e-10    p<e-4        performance request with the highest level of efficiency. A
   3    (EP H 0x1b − EP H 0x73) ∗ f ∗ V 2       e-10      p<e-4        basic schematic of the proposed approach is given in Figure 2.
   4          EP H 0x50 ∗ f ∗ V 2              8.7e-9     p<e-4
   5          EP H 0x6a ∗ f ∗ V 2                e-8      p<e-4                          V. E XPERIMENTAL SETUP
   6          EP H 0x73 ∗ f ∗ V 2              2.6e-11   p<2e-3
   7          EP H 0x14 ∗ f ∗ V 2              6.4e-11    p<e-3           To evaluate our approach we used an ODROID XU3
   8          EP H 0x19 ∗ f ∗ V 2              1.9e-9     p<e-4        development board from HARDKERNEL. The application
   9                    V                        0.17     p<e-4        processor implements the ARM big.LITTLE architecture with
  10                 f ∗V2                     1.6e-4     p<e-4
                                                                       two clusters composed of 4 cores each. The big cluster consists
  11                    T                      2.3e-2     p<e-3
  12                   T2                      2.9e-4    p<4e-3        of a high-performance Cortex-A15 quad-core block, and a
  13                 V ∗ T2                    -3.5e-5    p<e-3        low power Cortex-A7 quad-core CPU. The board description
  14                 V ∗T                      1.1e-2     p<e-3        is complete with a Mali-T628 GPU and 2GB LPDDR3 of
                                                                       memory. The board contains 4 current sensors that offer
                         TABLE III                                     the possibility to measure power dissipation in four differ-
          M ODEL PARAMETERS AND P - VALUES FOR THE A7                  ent domains: big cluster (A15), LITTLE cluster (A7), GPU
         Nr       Coefficient        Weight      p-Value               and memory. Besides this, the board contains 4 temperature
         1         Intercept         -7.2e-4    p<0.003                sensors for the cores in the big cluster and one temperature
         2    EP H 0x11 ∗ f ∗ V 2    1.9e-10      p<e-4                sensor for the GPU. The characteristics of the hardware can
         3    EP H 0x14 ∗ f ∗ V 2    2.2e-10      p<e-4                be found in Table IV.
         4    EP H 0x13 ∗ f ∗ V 2    4.3e-10      p<e-4
         5    EP H 0x16 ∗ f ∗ V 2    1.4e-9       p<e-4
                                                                                                  TABLE IV
         6    EP H 0x0f ∗ f ∗ V 2    9.4e-11    p<0.0004                         C HARACTERISTICS OF THE EXPERIMENTAL BOARD

                                                                               Characteristic        ODROID Development Board
                                                                               Model                 XU3
terms of instructions per Joule, performance point (instructions               SoC                   Exynos 5422 Octa core
per second) and the power dissipation (W). The table is used to                CPU’s                 Cortex-A15/A7
decide the optimal configuration point for an application that                      cores            4+4
                                                                               Frequency A7 (MHz)
has defined performance requirements. Once an application                           min              200
is submitted into the system or is resumed by the scheduler,                        max              1400
the runtime system can sample the hardware counters in a                       Frequency A15 (MHz)
                                                                                    min              200
single frequency level and scans the table to find the optimal                      max              2000
configuration point, to run the application, in terms of energy                Voltage A7 (V)
efficiency. In this work, we consider multi-threaded applica-                       min              0.9
                                                                                    max              1.24
tions, which matches our methodology of achieving optimal                      Voltage A15 (V)
levels of energy efficiency by using configuration points that                      min              0.9
possibly use several cores. In the case where the performance                       max              1.36
requirement of the application changes, the control logic of
the runtime system can select another configuration point that            To build the power model we used a set of benchmarks
provides the requested performance level and has a high level          from different application domains. We call the training set as
of energy efficiency. When the temperature of the environment          the platform characterization workloads. In the platform char-
changes above a certain threshold, the power model can be              acterization set we include a sequence of 76 workloads which
used to recompute the energy efficiency table in accordance            consists of a collection of synthetic and real world applications
with the new temperature conditions. A temperature increase            from Roy Longbottom [4], PARSEC [7], CoremarkPro [2],
in the outside environment produces an increased level of static       ParMiBench [13] and Multibench [3]. A full list of the used
power in the CPU, which affects the relative efficiencies of the       workloads is in Table V.
configurations inside the energy efficiency table. The runtime            The choice of the workload set is based on the idea of all-
system can continuously monitor the power usage of the                 inclusiveness of applications that characterize the embedded
running application in order to not exceed the Thermal Design          systems domain.
Power (TDP) of the CPU. By sampling the performance                       Experiments were conducted in different environments to
counters of each running application the power model shows             account for the outside temperature change in the SoC power
the power dissipation at runtime of the running applications,          dissipation. The goal here is to evaluate the change in the
thus the runtime system can make a decision of reducing the            energy efficiency table in accordance with temperature. For the
power dissipation of certain applications by choosing another          first environment, the board fan is running with 100% speed
configuration point from the system.                                   with the system located in a highly refrigerated environment.


                                                                   5
                                              Fig. 3. Configuration points from the model


In the second case, the board is working with the fan disabled        environment 1, the system running in a highly refrigerated
in a normal outside temperature to account for a high tem-            environment (we call it “cold” case). In Environment 2, the
perature outside the environment. In the third case, the board        system is running without a fan with an outside temperature
is working with the fan always on in a normal environment,            of 25◦ C (we call it “hot” case). Environment 3, consists of
to justify the middle case. In Table, VI on Section V we will         the system running on a 25◦ C outside temperature with the
show the result of the energy efficiency table computed in            fan always on at 100% speed (we call it “middle” case).
different environments.                                               We noticed the relative order of configuration points changes
                                                                      between the environments and so does the energy efficiency
                        VI. R ESULTS                                  levels achieved.
   By using the power and performance models defined previ-              The top rows of the energy efficiency table for different
ously we are able to derive an energy efficiency model which          temperature environments are shown in Table VI. Different
is based on platform configuration points. In Figure 3 we show        temperature levels produce different order of configuration
the efficiency of all configuration points from the model. Each       points and efficiency levels achieved. This shows that there
point describes a single configuration that provides a certain        is a need to change the platform configuration point when the
level of performance in terms of instructions per second and          temperature changes significantly, in order to keep the high
energy efficiency. By going towards high levels of performance        levels of energy efficiency.
we notice a decrease in the density of the points. This means            In Figure 5 we show a possible runtime scenario. We are
that fewer options for achieving good energy efficiency levels.       running Basicmath test application with a required level of
The list of configurations is organized as an energy efficiency       performance such as e.g. 1,61E+9 inst/s in a system with
table that lists all possible configuration points with their         a temperature t1 , according to the model the optimal con-
efficiency and performance. An example of the table derived           figuration point for this performance level is composed by
from the workloads in the training set of the power model is          2a7@400MHz + 4a15@500MHz. In the case, the temperature
shown in Table VII. By searching inside the table we find             increases to t2 , then the efficiency of that configuration point
several sets of configuration points that provide the same            decreases and thus we need to reconfigure with the new table
performance but with different energy efficiency levels, some         that shows that we should execute the application by using the
of the sets are shown in Figure 4. First usage of the table           following configuration 4a7@700MHz + 4a15@200MHz. An-
would be the one for choosing the optimal configuration point         other example is shown with the performance requirement of
based on a certain requirement for the performance level. As it       3,27E+9 inst/s, where again there is a need for reconfiguration
is shown by Figure 4, it is possible to gain in terms of energy       in order to keep high levels of energy efficiency.
efficiency if we make the right choice for the configuration             The change in the environment temperature of the system
point. As a second objective of our work, we wanted to test           (from “cold” to “hot”) produces large differences in the energy
the effect of temperature on the relative energy efficiency of        efficiency levels that the model defines as an optimal config-
configuration points in the model. For testing thermal effects        uration point for the required performance. By looking at the
on the efficiency model, we choose to run a testing application       first 100 highly energy efficient configurations in the energy
with the system located in different environments. We run             efficiency table, we find few test cases, whereby changing the
Basicmath application from the ParMiBench suite [13]. In              configuration point when the system temperature changes the


                                                                  6
                                      Fig. 4. List of configuration points grouped in different performance classes


     Fig. 5. Reconfigure examples in two temperature environments


                                                                               Fig. 7. Power errors for configuration points with high level of energy
                                                                               efficiency
     Fig. 6. Configuration points with high energy efficiency levels


gain in terms of energy efficiency is up to 33%. By searching                  the upper outer layer of the scatter plot we have a situation
for new target reconfiguration points we account for the same                  like in Figure 6. Those points show the configurations with the
performance or 5% bigger. An interesting observation can be                    optimal energy efficiency for a certain level of performance
noticed in Figure 3 where all points are plotted in the energy                 at a defined temperature. Or otherwise, we can think of the
efficiency and performance graph. If we take the points from                   graph as the result of scanning the model from the lowest


                                                                           7
                        TABLE V                                                              TABLE VI
            P LATFORM C HARACTERIZATION S ET                    T OP E NERGY E FFICIENCY C ONFIGURATIONS FOR THREE ENVIRONMENTS

                     List of benchmarks                                                   Temperature Environment 1
                                                                       Configuration     Energy Efficiency (Ins/J) Power(W)    Performance(Ins/s)
    Suite         Workload                                       4a7/200MHz4a15/500MHz          1,517e+10            0,465          1,61e+09
                  core                                           4a7/200MHz4a15/700MHz          1,515e+10            0,599           2e+09
                  linear alg-mid-100x100-sp                      4a7/200MHz4a15/400MHz          1,512e+10            0,382          1,39e+09
                  loops-all-mid-10k-sp                           4a7/200MHz4a15/300MHz          1,511e+10            0,305          1,17e+09
                  nnet test                                      4a7/200MHz4a15/200MHz           1,50e+10            0,219          9,37e+08
 CoremarkPro                                                                ....                    ....              ....             ....
                  parser-125k
                  radix2-big-64k                                                          Temperature Environment 2
                                                                 4a7/200MHz3a15/300MHz          1,424e+10            0,333         9,92e+08
                  sha-test
                                                                 4a7/200MHz3a15/500MHz          1,421e+10            0,518         1,32e+09
                  zip-test                                       4a7/200MHz3a15/400MHz          1,420e+10            0,428         1,15e+09
                  4M-check                                       4a7/200MHz3a15/600MHz          1,420e+10            0,608         1,48e+09
                  4M-check-reassembly                            4a7/200MHz3a15/700MHz          1,416e+10            0,697         1,61e+09
                  4M-check-reassembly-tcp                                 ....                      ....              ....            ....
                  4M-check-reassembly-tcp-cmykw2-rotatew2                                 Temperature Environment 3
                  4M-check-reassembly-tcp-x264w2                 4a7/200MHz4a15/600MHz           1,49e+10            0,586         1,82e+09
                  4M-cmykw2                                      4a7/200MHz4a15/400MHz           1,49e+10            0,415         1,39e+09
                  4M-cmykw2-rotatew2                             4a7/200MHz3a15/700MHz           1,49e+10            0,668          2e+09
                                                                 4a7/200MHz3a15/500MHz          1,480e+10            0,511         1,61e+09
                  4M-reassembly
                                                                 4a7/200MHz3a15/300MHz          1,486e+10            0,337         1,17e+09
                  4M-rotatew2                                             ....                      ....              ....            ....
                  4M-tcp-mixed
                  4M-x264w2
                  empty-wld
                                                                                              TABLE VII
                  iDCT-4M
                  iDCT-4Mw1                                                      O RDERED E NERGY E FFICIENCY TABLE .
  MultiBench
                  ippktcheck-4M
                                                                  C          C(Nl /Fl /Nb /Fb )     Perf.(inst/s)   Pavg (W)    Efficiency(inst/J)
                  ippktcheck-4Mw1
                                                                  1     4a7/200MHz/4a15/600MHz     2.219115e+09     0.699744     7.889801e+09
                  ipres-4M
                                                                  2     4a7/200MHz/4a15/500MHz     1.916094e+09     0.600826     7.885497e+09
                  ipres-4Mw1                                      3     4a7/200MHz/4a15/700MHz     2.475814e+09     0.788427     7.872383e+09
                  md5-4M                                          4     4a7/200MHz/4a15/800MHz     2.723064e+09     0.873142     7.861730e+09
                  md5-4Mw1                                        5     4a7/200MHz/4a15/400MHz     1.601398e+09     0.501352     7.857119e+09
                  rgbcmyk-4M                                      6     4a7/200MHz/4a15/300MHz     1.294310e+09     0.402370     7.830159e+09
                  rgbcmyk-4Mw1                                    7     4a7/200MHz/4a15/900MHz     3.042998e+09     1.010040     7.765476e+09
                  rotate-4Ms1                                     8     4a7/200MHz/4a15/200MHz     9.541939e+08     0.293673     7.763320e+09
                  rotate-4Ms1w1                                   9     4a7/300Mhz 4a15/600MHz     2.338974e+09     0.728441     7.647120e+09
                  rotate-4Ms64                                    10    4a7/300Mhz 4a15/500MHz     2.035953e+09     0.629523     7.642816e+09
                  rotate-4Ms64w1                                  11    4a7/300Mhz 4a15/700MHz     2.595672e+09     0.817124     7.629703e+09
                  x264-4Mq                                        12    4a7/300Mhz 4a15/800MHz     2.842923e+09     0.901839     7.619049e+09
                  x264-4Mqw1                                      13    4a7/300Mhz 4a15/400MHz     1.721256e+09     0.530049     7.614439e+09
                  automotive/qsort                                14    4a7/300Mhz 4a15/300MHz     1.414169e+09     0.431067     7.587478e+09
                  network/dijkstra                                15    4a7/200Mhz 4a15/1000MHz    3.310742e+09     1.173238     7.580142e+09
   MiBench                                                         .                 .                    .             .               .
                  consumer/typeset
                  telecomm/adpcm                                 4078         1a15/1800MHz         1.193975e+09     1.795146     6.651129e+08
                  blackscholes                                   4079         1a15/1700MHz         1.176482e+09     1.776230     6.623477e+08
                  bodytrack                                      4080         1a15/1600MHz         1.101565e+09     1.670471     6.594337e+08
                  canneal
                  dedup
  Parsec-3.0      ferret
                  fluidanimate                                  performance point and keeping only those points which have
                  freqmine                                      higher performance and the highest possible level of energy
                  streamcluster
                  swaptions                                     efficiency. As a further validation of our approach, we measure
                  Office/stringsearch                           in percentage the difference between the predicted power
                  Network/Patricia/Parallel
                  Automotive/Susan/Parallel                     dissipation and the measured power in configuration points
  ParmiBench
                  Automotive/Bitcount/Parallel                  with high levels of energy efficiency. The results are shown
                  Network/Dijkstra/Parallel
                  Office/stringsearch/Parallel                  in Figure 7, where we notice the highest error is 2,82%. We
                  rl-linpack-neon                               measure the model errors in configurations that provide the
                  rl-linpack-FSSP
Roy-Longbottom    rl-whetstone
                                                                highest levels of energy efficiency for different performance
                  rl-busspeed                                   levels. These are more intriguing configuration points, which
                  rl-dhrystone
                  lat ctx
                                                                give the best of the platform’s energy efficiency. Knowing that
                  lat fs                                        most of the time these points will be used as configuration
                  lat ops                                       options, having a low error rate from the model is very useful.
                  lat proc
                  lat fifo
                  lat http                                                               VII. C ONCLUSION
                  lat pagefault
                  lat select                                       In this work, we present an approach for building an energy
   Lmbench
                  lat sem                                       efficiency model which is based on platform configuration
                  lat unix connect
                  lat mem rd                                    points. The target of the approach are heterogeneous platforms
                  bw mem                                        which are continuously increasing the depth of heterogeneity.
                  tlb lmb3-tlb
                  line                                          The model is based on hardware performance counters which
  Whetstone       whetstone                                     are widely available in today’s CPU architectures. The set
  Drystone        dhrystone
                                                                of workloads for building the model is representative of the


                                                            8
embedded domain which has shown to be more critical to the                               2.8GHz tri-gear deca-core CPU complex with optimized power-delivery
energy efficient application execution. But also, the training                           network for mobile SoC performance. In 2017 IEEE International Solid-
                                                                                         State Circuits Conference (ISSCC), pages 56–57, February 2017.
set, in inclusive of the IoT world. The novelty of this approach                    [17] Jozef Mocnej, Martin Miškuf, Peter Papcun, and Iveta Zolotová. Impact
compared to previous works is that it doesn’t necessarily                                of Edge Computing Paradigm on Energy Consumption in IoT. IFAC-
need power sensors for measuring the power dissipation in                                PapersOnLine, 51(6):162–167, 2018. 15th IFAC Conference on Pro-
                                                                                         grammable Devices and Embedded Systems PDeS 2018 Citation Key:
each configuration point, but by sampling the counters on                                MOCNEJ2018162.
one configuration point we can characterize the efficiency of                       [18] Tiago Mück, Santanu Sarma, and Nikil Dutt. Run-DMC: Runtime Dy-
other configuration points. From all the points in the model,                            namic Heterogeneous Multicore Performance and Power Estimation for
                                                                                         Energy Efficiency. In Proceedings of the 10th International Conference
we show that less than 1% of them (see points in Figure 7)                               on Hardware/Software Codesign and System Synthesis, CODES ’15,
represent the highest levels of energy efficiency possible, in                           pages 173–182, Piscataway, NJ, USA, 2015. IEEE Press.
all the performance spectrum offered by the platform. Also,                         [19] V. Petrucci, M. A. Laurenzano, J. Doherty, Y. Zhang, D. Mossé,
                                                                                         J. Mars, and L. Tang. Octopus-Man: QoS-driven task management
we include the environment temperature as a variable for                                 for heterogeneous multicores in warehouse-scale computers. In 2015
defining the need for application reconfiguration. As we show                            IEEE 21st International Symposium on High Performance Computer
by the tests if the temperature changes, by reconfiguring the                            Architecture (HPCA), pages 246–258, February 2015.
                                                                                    [20] L. Ramapantulu, D. Loghin, and Y. M. Teo. On Energy Proportionality
application execution we can gain up to 33% in terms of                                  and Time-Energy Performance of Heterogeneous Clusters. In 2016 IEEE
energy efficiency.                                                                       International Conference on Cluster Computing (CLUSTER), pages
                                                                                         221–230, September 2016.
                             R EFERENCES                                            [21] Hergys Rexha, Simon Holmbacka, and Sebastien Lafond. Core Level
                                                                                         Utilization for Achieving Energy Efficiency in Heterogeneous Sys-
 [1] Arm DynamIQ Technology for the next era of compute                                  tems. In 2017 25th Euromicro International Conference on Parallel,
     -    Processors     blog     -    Processors     -   Arm     Community.             Distributed and Network-Based Processing (PDP), pages 401–407, St.
     https://community.arm.com/processors/b/blog/posts/arm-dynamiq-                      Petersburg, Russia, 2017. IEEE.
     technology-for-the-next-era-of-compute.                                        [22] V. Saripalli, G. Sun, A. Mishra, Y. Xie, S. Datta, and V. Narayanan.
 [2] CPU Benchmark – CoreMark-PRO – EEMBC Embedded Micro-                                Exploiting Heterogeneity for Energy Efficiency in Chip Multiprocessors.
     processor Benchmark Consortium. https://www.eembc.org/coremark-                     IEEE Journal on Emerging and Selected Topics in Circuits and Systems,
     pro/index.php.                                                                      1(2):109–119, June 2011.
 [3] EEMBC          -      MultiBench         -     Multicore     Benchmark.        [23] Santanu Sarma, T. Muck, Luis A. D. Bathen, N. Dutt, and A. Nico-
     https://www.eembc.org/multibench/.                                                  lau. SmartBalance: A Sensing-driven Linux Load Balancer for Energy
 [4] Roy Longbottom’s PC Benchmark Collection - Free PC Benchmarks.                      Efficiency of Heterogeneous MPSoCs. In Proceedings of the 52Nd
     http://www.roylongbottom.org.uk/.                                                   Annual Design Automation Conference, DAC ’15, pages 109:1–109:6,
 [5] Anders S. G. Andrae and Tomas Edler. On global electricity usage of                 New York, NY, USA, 2015. ACM.
     communication technology: Trends to 2030. Challenges, 6(1):117–157,            [24] Arman Shehabi, Sarah Josephine Smith, Dale A. Sartor, Richard E.
     2015.                                                                               Brown, Magnus Herrlin, Jonathan G. Koomey, Eric R. Masanet,
 [6] M. Ashouri, P. Davidsson, and R. Spalazzese. Cloud, edge, or both?                  Nathaniel Horner, Inês Lima Azevedo, and William Lintner. United
     towards decision support for designing iot applications. In 2018 Fifth              states data center energy usage report. Technical report, 06/2016 2016.
     International Conference on Internet of Things: Systems, Management            [25] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu. Edge computing: Vision and
     and Security, pages 155–162, Oct 2018.                                              challenges. IEEE Internet of Things Journal, 3(5):637–646, Oct 2016.
 [7] Christian Bienia. Benchmarking Modern Multiprocessors. PhD thesis,             [26] Vitor R. G. Silva, Alex Furtunato, Kyriakos Georgiou, Kerstin Eder, and
     Princeton University, January 2011.                                                 Samuel Xavier-de-Souza. Energy-Optimal Configurations for Single-
 [8] W. L. Bircher and L. K. John. Complete System Power Estimation                      Node HPC Applications. arXiv:1805.00998 [cs], May 2018.
     Using Processor Performance Events. IEEE Transactions on Computers,            [27] M. J. Walker, S. Diestelhorst, A. Hansson, A. K. Das, S. Yang, B. M.
     61(4):563–577, April 2012.                                                          Al-Hashimi, and G. V. Merrett. Accurate and Stable Run-Time Power
 [9] Peter M. Corcoran. Third time is the charm - why the world just might be            Modeling for Mobile and Embedded CPUs. IEEE Transactions on
     ready for the internet of things this time around. CoRR, abs/1704.00384,            Computer-Aided Design of Integrated Circuits and Systems, 36(1):106–
     2017.                                                                               119, January 2017.
[10] Bryan Donyanavard, Tiago Mück, Santanu Sarma, and Nikil Dutt.
     SPARTA: Runtime Task Allocation for Energy Efficient Heteroge-
     neous Many-cores. In Proceedings of the Eleventh IEEE/ACM/IFIP
     International Conference on Hardware/Software Codesign and System
     Synthesis, CODES ’16, pages 27:1–27:10, New York, NY, USA, 2016.
     ACM.
[11] C. Eibel, T. Do, R. Meissner, and T. Distler. Empya: Saving Energy in
     the Face of Varying Workloads. In 2018 IEEE International Conference
     on Cloud Engineering (IC2E), pages 134–140, April 2018.
[12] EIRGRID. All-island generation capacity statement 2017-2026, 2017.
[13] S. M. Z. Iqbal, Y. Liang, and H. Grahn. ParMiBench - An Open-Source
     Benchmark for Embedded Multiprocessor Systems. IEEE Computer
     Architecture Letters, 9(2):45–48, February 2010.
[14] V. Jimenez, F. Cazorla, R. Gioiosa, E. Kursun, C. Isci, A. Buyukto-
     sunoglu, P. Bose, and M. Valero. Energy-Aware Accounting and Billing
     in Large-Scale Computing Facilities. IEEE Micro, 31(3):60–71, May
     2011.
[15] J. S. Lee, K. Skadron, and S. W. Chung. Predictive Temperature-Aware
     DVFS. IEEE Transactions on Computers, 59(1):127–133, January 2010.
[16] H. Mair, E. Wang, A. Wang, P. Kao, Y. Tsai, S. Gururajarao,
     R. Lagerquist, J. Son, G. Gammie, G. Lin, A. Thippana, K. Li,
     M. Rahman, W. Kuo, D. Yen, Y. Zhuang, U. Fu, H. Wang, M. Peng,
     C. Wu, T. Dosluoglu, A. Gelman, D. Dia, G. Gurumurthy, T. Hsieh,
     W. Lin, R. Tzeng, J. Wu, C. Wang, and U. Ko. 3.4 A 10nm FinFET


                                                                                9