<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enabling power-awareness for the Xen Hypervisor⇤</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matteo Ferroni Politecnico di Milano</string-name>
          <email>matteo.ferroni@polimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juan A. Colmenares Samsung Research America</string-name>
          <email>juan.col@samsung.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>John D. Kubiatowicz University of California Berkeley</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Marco D. Santambrogio Politecnico di Milano</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Steven Hofmeyr Lawrence Berkeley National Laboratory</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>2</fpage>
      <lpage>7</lpage>
      <abstract>
        <p>Virtualization allows simultaneous execution of multi-tenant workloads on the same platform, either a server or an embedded system. Unfortunately, it is non-trivial to attribute hardware events to multiple virtual tenants, as some system's metrics relate to the whole system (e.g., RAPL energy counters). Virtualized environments have then a rather incomplete picture of how tenants use the hardware, limiting their optimization capabilities. Thus, we propose XeMPower, a lightweight monitoring solution for Xen that precisely accounts hardware events to guest workloads. It also enables attribution of CPU power consumption to individual tenants. We show that XeMPower introduces negligible overhead in power consumption, aiming to be a reference design for power-aware virtualized environments.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Categories and Subject Descriptors</title>
      <p>H.4 [Software and its engineering, Virtual machines,
Performance monitoring]</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>In the last few years, embedded systems have experienced
a shift from microcontrollers to multi-core processors, as
these have become cheaper, smaller, and less power-hungry.
This shift brings two advantages: 1) multiple embedded
applications can be consolidated on the same
System-onChip (SoC), improving the overall resource utilization, and
2) some applications can exploit concurrency and parallelism
to obtain better performance.</p>
      <p>
        In the context of embedded systems, hardware-assisted
and software virtualization technologies have been developed
to allow colocated applications to share physical resources
while having strong security and isolation [
        <xref ref-type="bibr" rid="ref22 ref24 ref26">24, 26, 22</xref>
        ].
      </p>
      <p>Those technologies seek to oe↵r a stable and predictable
execution environment to make it easier for embedded
applications to meet di↵erent Quality of Service (QoS)
requirements.</p>
      <p>
        The virtualized runtime can be a full-fledged guest
Operating System (OS) or more suitable to embedded systems, a
light-weight OS (e.g., [
        <xref ref-type="bibr" rid="ref14 ref25">14, 25</xref>
        ]), customized for a specific
application. Applications executing in such runtimes generally
have di↵erent performance objectives, such as: hard and
soft deadlines, and peak throughput. Moreover, they are
often di↵erent from one another in terms of workload
limitations (i.e., memory-bound, I/O-bound, or CPU-bound)
and evolving load patterns (e.g., algorithmic phases).
      </p>
      <p>Unfortunately, this high heterogeneity comes at a price:
The isolation between simultaneously resident applications,
enforced by virtualization, shifts the burden of optimization
from developers to the hypervisor itself because only a
privileged arbiter can thoroughly observe what happens on the
bare metal. Hence, it becomes clear that a smart online
monitoring strategy is necessary to accurately observe and
model applications’ behavior to guarantee requirements and
optimize physical resources utilization.</p>
      <p>
        Since power consumption is currently a key technological
limitation [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], recent works propose approaches to
optimizing power [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], while maintaining Service Level Agreements
(SLAs) with each hosted guest. Again, these approaches
have an essential need for precise and thorough observation
of both hardware and guests’ behavior over time. Lacking
appropriate tools, many of these approaches employ custom
monitoring solutions or rely on outdated tools that do not
provide support for the latest hardware monitoring features.
Often, these ad-hoc approaches overlook the impact of
measurements on the overall system’s behavior.
      </p>
      <p>
        Seeking to fill the gap, this paper proposes XeMPower, a
lightweight hardware and resource monitoring solution for
the Xen hypervisor [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. It is meant to be agnostic to the
hosted applications, and results show it incurs negligible
power consumption overhead. XeMPower has been released
as open source,1 and it aims to be a reference design for
future works in the field of virtualized systems.
      </p>
      <p>
        To prove its e↵ectiveness, we present a use case in which
XeMPower precisely accounts hardware events to virtual
guests, enabling real-time attribution of CPU power
consumption to each guest or “domain”.2 XeMPower starts with
socket-level energy measurements through the Intel Running
Average Power Limit (RAPL) interface [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], and then
utilizes a performance-counter-driven model to account for the
proportional uses of energy by simultaneously resident
domains over time. This proportional attribution of power is
XeMPower’s secret sauce – the contribution is evaluated by
measuring a subset of architectural performance counters
related to each domain, but regardless of the physical core.
      </p>
      <p>The paper is organized as follows. Section 2 presents an
1Available at: https://bitbucket.org/necst/xempower-4.6
2Adopting Xen terminology, the remainder of this paper will
refer to virtual guests as domains.
overview of XeMPower. Section 3 details the
implementation of the tool, while Section 4 shows how to attribute
power to each domain. Next, Section 5 investigates the
performance overhead of XeMPower while its limitations are
discussed in Section 6. Finally, Section 7 presents the
related work and Section 8 concludes.</p>
    </sec>
    <sec id="sec-3">
      <title>2. PROPOSED APPROACH</title>
      <p>XeMPower is a lightweight monitoring solution for Xen
designed to: 1) provide precise attribution of hardware
events to virtual tenants, 2) be agnostic to the mapping
between virtual and physical resources, hosted applications
and scheduling policies, and 3) add negligible overhead.</p>
      <p>Our approach uses hypervisor-level instrumentation to
monitor every context switch between domains. More
precisely, the monitoring flow proceeds as follows:</p>
      <p>A. At each context switch and before the domain
chosen by the scheduler starts running on a CPU, we
begin counting the hardware events of interest. From
that moment the configured Performance Monitoring
Counter (PMC) registers in the CPU store the counts
associated with the domain that is about to run.
B. At the next context switch, the PMC values are read
from the registers and accounted to the domain that
was running. The counters are then cleared for the
next domain to run.</p>
      <p>C. Steps A and B are performed at every context switch
on every system’s CPU (i.e., physical core or hardware
thread). The reason is that each domain may have
multiple virtual CPUs (VCPUs). Socket-level energy
measurements are also read (via Intel RAPL interface)
at each context switch.</p>
      <p>D. Finally, the PMC values are aggregated by domain
and finally reported or used for other estimations (e.g.,
power consumption per domain).</p>
    </sec>
    <sec id="sec-4">
      <title>3. IMPLEMENTATION</title>
      <p>
        XeMPower implementation is inspired by XenMon [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
a performance monitoring tool for Xen. Unlike XeMPower
and other works discussed in Section 7, XenMon does not
collect PMC reads. Nevertheless, since XenMon’s authors
report a maximum overhead of 1-2%, their implementation
approach was an interesting starting point for our work and
a reasonable baseline to compare our overhead with.
      </p>
      <p>XeMPower operates at two levels (see Figure 1). At the
first level, PMC reads are collected inside the Xen kernel
and then aggregated by the XeMPower daemon running in
e
im
T
txe tch A1
tn i
co sw B2
txe tch A2
tn i
co sw B1
1
2
1</p>
      <p>A1
Core 0
…
…</p>
      <p>B2
2
1
txe tch A2
tn i
co sw B1
txe tch A1
tn i
co sw B3
3</p>
      <p>A3
Core N
nom BB22
e
aD B1
re B1
w
oP B3
M
e
X</p>
      <p>I
L
rC B2
e B2
ow B1
PM B1
eX B3
…</p>
      <p>Xen Kernel
…
Hardware events per core,</p>
      <p>energy per socket</p>
      <p>Dom0, while at the second level, a CLI program reports
aggregated values. In this section, we describe implementation
details of the components forming the proposed toolchain.
3.1</p>
    </sec>
    <sec id="sec-5">
      <title>Xen Kernel Instrumentation</title>
      <p>
        Xen runs a separate scheduler instance on each CPU,
and each scheduler instance has its own queue
containing runnable VCPUs of domains [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Xen kernel’s
schedule() function3 preempts the currently running VCPU
(scheduler-independent), chooses the VCPU that will run
next (scheduler-dependent), and then makes the chosen
VCPU run (scheduler-independent). Hence, this function is
a suitable place to incorporate the steps A and B presented
in Section 2.
      </p>
      <p>
        Even though there are libraries and APIs (e.g., PAPI [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ])
that give developers access to hardware events
independently from the underlying architecture, we decided to
directly use RDMSR and WRMSR assembly instructions to set the
count of desired hardware events as well as read and clear
the CPU’s PMC. The reason is that these operations are
performed at every context switch and we want the
overhead to be as low as possible at the kernel level, in terms
of execution time and memory footprint. We then accept
the trade o↵ and tie the current implementation to the
Intel instruction set; however, other architectures (e.g., ARM
and AMD) can be supported by modifying the registers
addresses at compile time.
      </p>
      <p>
        Our current XeMPower implementation only counts
architectural performance monitoring events. We made that
decision because these events have consistent visible
behavior across processor implementations [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Moreover,
previous work shows that they are the most significant metrics to
correlate CPU power consumption [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], which is the focus
of our motivating use case in Section 4. Since the available
3Source code: xen/common/schedule.c
Instruction Retired
UnHalted Core Cycles
UnHalted Reference Cycles
LLC Reference
LLC Misses
Branch Instruction Retired
Branch Misses Retired
Register mapping
IA32_FIXED_CTR0
IA32_FIXED_CTR1
IA32_FIXED_CTR2
IA32_PMC0
IA32_PMC1
IA32_PMC2
      </p>
      <p>IA32_PMC3
PMCs are limited (e.g., 8 per core and 4 per hardware thread
on Intel Sandy Bridge 2nd Gen processors), we map some
monitoring events onto 4 PMCs and others are counted
using auxiliary fixed-function counters. Table 1 summarizes
the monitored events and their register mapping.</p>
      <p>Regarding power monitoring, Intel RAPL interface
provides dedicated read-only registers that can be accessed like
standard PMCs. These are available since Sandy Bridge
2nd generation processors and provide CPU power
measurements with a time granularity of 1ms approximately.
XeMPower currently samples the register
MSR_PKG_ENERGY_STATUS, which accumulates the actual energy consumption (in
Joules) of the whole processor package; the average power
consumption is then easily obtained as energy/time for the
time window considered. For the moment, we decided not
to sample the other RAPL power planes (related to
onchip DRAM and “uncore” devices) because their availability
varies across di↵erent processors.</p>
      <p>
        Finally, we need to expose the collected data to a higher
level. For that, we use xentrace [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], a lightweight trace
capturing facility present in Xen that can record events at
arbitrary control points in the hypervisor. We tag every
trace record with the ID of the scheduled domain and its
current VCPU, as well as a timestamp to be able to later
reconstruct the trace flow.
3.2
      </p>
    </sec>
    <sec id="sec-6">
      <title>XeMPower Daemon</title>
      <p>The stream of trace records produced by xentrace flows
from the Xen kernel to the XeMPower daemon running in
Dom0 (see Figure 1). The daemon, a user-space program
written in C, receives the records and performs aggregation
operations on them. Note that we do not use the xentrace
user-space tool, as it can produce a very large amount of
data that may potentially cause intense disk writes. Our
daemon directly accesses xentrace memory bu↵ers, to avoid
any additional access to disk.</p>
      <p>We defined two bitmasks, TRC_POWER_PMC and
TRC_POWER_RAPL, to di↵erentiate trace records with
PMC and RAPL events in the xentrace bu↵ers (one per
hardware thread). These bu↵ers are constantly monitored
by the XeMPower daemon – when a new record arrives, a
callback function is invoked to process and store it.</p>
      <p>The XeMPower daemon performs aggregations in three
stream processing stages. First, records are grouped in
tumbling windows with a configurable time interval. Second,
in each tumbling window an aggregation is performed per
hardware event. In this stage, the deamon also stores the
die↵rence between the values of the RAPL energy counter
at the beginning and the end of the tumbling window.
Finally, in each tumbling window and for each hardware event
PMCs are collated per domain. Note that after aggregating
the records the notions of physical and virtual CPUs
disappear, bringing about a hardware-agnostic data structure.</p>
      <p>The XeMPower daemon allocates a shared memory region
to store a configurable number of tumbling windows in a
circular bu↵er. Processes other than the deamon can only read
from the region. Shared access to the tumbling windows
allows multiple front-end applications to read and display
different statistics from the same data. The tumbling window
time interval, the capacity of the circular bu↵er of tumbling
windows, and other configuration parameters can be
specified at compilation time. Currently, the default value for the
tumbling window interval is 100 ms and the circular bu↵er’s
capacity is 100. These values are used in our experiments
reported in Section 5.
3.3</p>
    </sec>
    <sec id="sec-7">
      <title>XeMPower Command Line Interface</title>
      <p>XeMPower CLI is a basic command line tool written in
Python. It periodically scans the tumbling windows
produced by the XeMPower deamon (in the shared memory
region), and performs aggregations in two time intervals:
every second and every 10 seconds. It is also in charge of
converting the RAPL counter values into energy
consumption values (in Joules). The conversion factor is given by
the MSR_RAPL_POWER_UNIT register, which is
architecturespecific and can be read once when the XeMPower deamon is
started. The socket power consumption is then obtained as
the ratio of the energy consumption and the considered time
interval. XeMPower CLI is designed to show live statistics
on console or to log them into a file for a later processing.
4.</p>
    </sec>
    <sec id="sec-8">
      <title>USE CASE: PER-DOMAIN CPU POWER</title>
    </sec>
    <sec id="sec-9">
      <title>ATTRIBUTION</title>
      <p>As a motivating use case, we describe how XeMPower can
perform per-domain attribution of CPU power consumption.</p>
      <p>
        Zhai et al. [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] examined multiple metrics (such as
instruction counts, and last-level-cache references and misses)
in a wide range of microbenchmarks, including a busy-loop
benchmark (high instruction issue rate), a pointer chasing
benchmark (high cache miss rate), a CPU and memory
intensive benchmark (to mimic virus behavior), and a set
of bubble-up benchmarks that incur adjustable amounts of
pressure on the memory systems. They concluded that
nonhalted cycle is the best metric to correlate power
consumption (linear correlation coecient above 0.95). Such high
correlation suggests that the higher the rate of non-halted
cycles for a domain is, the more CPU power the domain
consumes.
      </p>
      <p>We then decided to use this result along with the data
produced by XeMPower. The approach is simple:
1. For each tumbling window, XeMPower CLI calculates
the power consumed by the whole socket, while
XeMPower daemon calculates the total number of
nonhalted cycles (one of the PMC traced).
2. Since we have the number of non-halted cycles per
domain, we estimate the percentage of non-halted cycles
for each domain over the total number of non-halted
cycles. This percentage is adopted as the contribution
of each domain to the whole CPU power consumption.
3. Finally, we split the socket power consumption
proportionally to the estimated contributions for each
domain.</p>
      <p>The proposed approach works well even when CPU power
states (i.e., C-states and P-states) are enabled. XeMPower
is not a↵ected by CPU voltage and frequency scaling, as it
continues to measure the actual socket power consumption
and to trace and account hardware events consistently.</p>
      <p>Note that we do not claim that this trivial use case is
highly accurate. Instead, we just present it as an
example of how XeMPower enables online attribution of
coarsegrained measurements to multiple tenants on a virtualized
environment, thanks to per-domain accounting of hardware
events.</p>
    </sec>
    <sec id="sec-10">
      <title>EXPERIMENTAL RESULTS</title>
      <p>XeMPower aims to be the tool of choice for any
computing system demanding precise and thorough observations of
hardware events attributed to domains in Xen. Since the
tool is meant to continuously provide statistics at run-time,
one of its key requirements is to add negligible overhead to
the monitored system. Therefore, in this section we
empirically show that XeMPower monitoring components incur
very low overhead under di↵erent configurations and
workload conditions. We defined the overhead metric as the
die↵rence in the system’s power consumption while using
XeMPower versus an o↵-the-shelf Xen 4.6 installation.
5.1</p>
    </sec>
    <sec id="sec-11">
      <title>Experimental Setup and Test Cases</title>
      <p>
        Our test platform is a machine equipped with a
2.8GHz quad-core Intel Xeon E5-1410 processor (4 hardware
threads) and 32GB of RAM. We use a Watts up? PRO
meter [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to independently monitor the entire machine’s power
consumption without being influenced by the system
configuration in use.
      </p>
      <p>
        We conduct our experiments under three system
configurations: 1) the baseline configuration uses o↵-the-shelf Xen
4.4, 2) the patched configuration uses Xen modified as
described in Section 3 without running the XeMPower
daemon, and 3) the monitoring configuration is the same as the
patched configuration but with the XeMPower daemon
actually running and reporting statistics to an attached
console. In all three configurations we assign a single virtual
CPU (VCPU) and 4GB of RAM to Dom0, and also
dedicate physical core 0 to it. Dedicating core 0 to Dom0, besides
adhering to Xen best practices [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], results in that any
computational overhead introduced by XeMPower monitoring
phase in Dom0 can be measured as an increment in power
consumption on core 0 and in the whole system.
      </p>
      <p>We consider four runtime scenarios: an idle scenario in
which the system only runs Dom0, and the running-n
scenarios, where n = {1, 2, 3} indicates the number of guest
domains in addition to Dom0. Each guest domain repeatedly
runs a multi-threaded compute-bound microbenchmark4 on
three VCPUs and uses a stripped-down Linux 3.14 as the
guest OS. The idea in the running-n scenarios is to stress the
system with an increasing number of CPU-intensive tenant
applications, thus increasing the amount of data traced by
the Xen kernel and collected in Dom0.
4CoEVP, a simplified proxy material science
application from the ExMatEx Center. It is available at
https://github.com/exmatex/CoEVP.
Finally, we define two test cases for the running-n
scenarios. In the pinned-VCPU case, each guest domain has
each VCPU assigned to a dedicated physical CPU. In the
unpinned-VCPU case, on the other hand, the guest domains
are assigned VCPUs with no physical mapping (i.e., VCPUs
can migrate between physical CPUs). The idea is to increase
the number of context switches and thereby the amount of
traces reported to Dom0.
5.2</p>
    </sec>
    <sec id="sec-12">
      <title>Results and Discussion</title>
      <p>We compare the power that our test platform consumes
for the di↵erent scenarios and test cases under the
baseline (b), patched (p), and monitoring (m) configurations.
Under each configuration, we run the idle scenario and the
running-1,2,3 scenarios, with and without VCPUs pinned to
dedicated physical CPUs (i.e., pinned-VCPU and
unpinnedVCPU test cases). We report the system’s mean power
consumption (µ) in Watts over a 60-second interval. We
performed a set of 40 independent experiments for each [test
case, scenario, configuration] combination.</p>
      <p>Table 2 and Table 3 present the system’s mean power
consumption for the pinned-VCPU and unpinned-VCPU test
cases, respectively, across the considered scenarios and
configurations. Empirical mean power values are reported with
their 95% confidence interval.</p>
      <p>
        At a glance, we can see how measurements are pretty
close. However, given the limited accuracy of the power
meter, some of them may seem misleading, e.g., the mean
power consumption of the baseline case sometimes is higher
than the others. This is the reason why we estimate an
upper bound ✏ for the maximum overhead by performing the
following hypothesis test [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]:
      </p>
      <p>T (µ) :=
⇢ H0 : µ ✏ + µb</p>
      <p>H1 : µ &lt; ✏ + µb,
where a rejection of the null hypothesis H0 means that there
is strong statistical evidence that the power consumption
overhead is lower than ✏ (or equivalently, the mean µ is lower
than the baseline mean µb increased by ✏). We compute ✏
for the considered test cases and scenarios, ensuring average
values of power consumption (µ) with confidence ↵ = 5%.</p>
      <p>Table 4 shows the values of ✏ across the considered test
cases and scenarios for the patched and monitoring
configurations. The values in parenthesis represent the
percentage overheads relative to the mean power consumption (i.e.,
µp and µm, respectively). Our results indicate (with
confidence ↵ = 5%) that XeMPower introduces an overhead not
greater than 1.18W (1.58%), observed for the
[unpinnedVCPU, running-3, patched ] case. In all the other cases, the
overhead is less than 1W, and less than 1% in relative terms.</p>
      <p>
        This is a satisfactory result when compared to a
maximum overhead of 1-2% observed for XenMon [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], which we
adopted as a reference point for our XeMPower
implementation. We consider this overhead a negligible and reasonable
price to pay, given the high-precision information that
XeMPower can provide at runtime.
      </p>
    </sec>
    <sec id="sec-13">
      <title>LIMITATIONS</title>
      <p>We are actively working to bring XeMPower to the next
level. It currently o↵ers little flexibility since it monitors a
fixed set of PMCs, while we want to be able to configure the
set of monitored PMCs at runtime, as well as parametrized
tumbling windows for the per-domain attribution of CPU
power consumption. Moreover, we want to extend the tool
to deal with Non-Uniform Memory Access (NUMA)
systems. We plan to evaluate the overhead introduced by such
flexibility improvements.</p>
      <p>
        Additional experimental studies will involve die↵rent
hardware platforms, like ARM-based mobile systems and
microservers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Moreover, XeMPower should be evaluated
with other than compute-bound workloads.
      </p>
      <p>
        Finally, the presented approach to power consumption
attribution to domains is very simple, as it was a mere example
to show the tool’s potential. We are currently exploring ways
to improve its accuracy, for example with o✏ine
characterization of both hardware and guest workloads. As shown
in [
        <xref ref-type="bibr" rid="ref17 ref19 ref28">17, 28, 19</xref>
        ], data-driven power models can be exploited
at runtime to improve the accuracy of power estimations
and to make predictions for the near future.
7.
      </p>
    </sec>
    <sec id="sec-14">
      <title>RELATED WORK</title>
      <p>
        Performance monitoring and profiling has always been
crucial in every computing system over the last 30 years
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The need for a constant monitoring solution has then
grown, especially in virtualization environments, where the
same hardware is shared between multiple tenants.
Unfortunately, every monitoring tool is a↵ected by a tradeo↵
between accuracy and overhead ; the e↵ective implementation
of these systems is then far from trivial. In the literature,
this problem has been tackled with two die↵rent approaches:
code instrumentation and performance counter monitoring.
      </p>
      <p>
        Code instrumentation solutions, like Valgrind [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] and
IgProf [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], inject extra code in the applications at compile
time and/or runtime, allowing complex analysis, e.g., on
memory and cache accesses. These tools are excellent for
an initial analysis of errors and ineciencies in programs,
but are not suitable for performing runtime analysis in
production, as the overhead introduced is often high [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>
        Performance counter tools, on the other hand, focus
on sampling system’s events at di↵erent granularity (e.g.,
thread level, process level, set of processors, or the entire
systems). These tools provide information on hardware
utilization that may not be closely related to the application
domain, but their overhead can be tuned accordingly to the
actual needs [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. They die↵r in functionality, data
granularity, level of abstraction, and interfaces they rely on.
      </p>
      <p>
        Low-level performance counter libraries do not hide
architecture-specific event types from the user and lie
directly on the hardware. Perf [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and OProfile [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] are the
most popular tools available; they make use of kernel
modules to access di↵erent categories of events: hardware events,
software events (context switches or minor faults), and
tracepoint events (disk I/O and TCP events).
      </p>
      <p>
        Higher-level libraries (e.g., PAPI [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]) hide
microarchitecture event types behind a uniform API. They
support event multiplexing to compensate for the limited
number of performance counter registers that can be monitored
at a time: only a subset of the desired event sets is monitored
during subsections of a program’s execution, then results are
scaled to statistically estimate rates for the entire program.
      </p>
      <p>
        In addition, some works in the literature focus on PMC
virtualization [
        <xref ref-type="bibr" rid="ref16 ref21 ref27">27, 16, 21</xref>
        ], providing low-level metrics to
virtual tenants. As XeMPower, all these solutions require to
patch Xen Hypervisor’s kernel to implement operations that
require privileged access, such as reprogramming counters or
setting up interrupt handlers.
      </p>
      <p>
        In the context of Xen, the most common solution is
Xenoprof [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], a system-wide statistical profiling toolkit based on
OProfile and specifically crafted for the hypervisor. It is
a valid solution to profile a standard workload running in
Dom0 or other domains in active mode (i.e., the domain
itself collects its own hardware event counters). However,
when profiling in passive mode (i.e., the domain is treated
as a “black box”), the results indicate which domain is
running at sample time but do not delve more deeply into what
is being executed. Therefore, it does not satisfy the
requirement of being agnostic to hosted applications.
      </p>
      <p>
        Another interesting tool is Perfctr-Xen [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. It supports
performance counter virtualization in Xen for: (1)
paravirtualized guest kernels, using hypercalls to communicate
performance counter configuration changes to the hypervisor; (2)
fully-virtualized guest kernels, using the “save-and-restore”
approach for all registers; and (3) a hybrid approach that
offers a tradeo↵ between the first two. Similar to XeMPower,
Perfctr-Xen re-programs the Performance Monitoring Unit
(PMU) configuration registers (e.g., event selectors) at
every context switch. Although this tool is good for workload
profiling inside a domain, it is not designed as a centralized
runtime monitoring solution.
      </p>
    </sec>
    <sec id="sec-15">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>We presented XeMPower, a lightweight monitoring
solution for Xen that precisely accounts hardware events to
virtual guests. As a motivating use case, we described its use
in online attribution of CPU power consumption to
individual domains. Our results show that XeMPower can provide
continuous statistics with very low overhead compared to an
o-↵the-shelf Xen installation.</p>
      <p>As future work, we plan to adopt the tool as a starting
point and improve the accuracy of CPU power consumption
attribution to domains, considering, for example, other
Performance Monitoring Counters (PMCs) in the estimation of
domains’ contributions. In addition, we plan to explore the
complementary use of o✏ine characterization of both
hardware and guest workloads in order to predict power
consumption before their final deployment.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] Intel Xeon processor D product family technical overview</article-title>
          . https://software.intel.com/en-us/articles/ intel-xeon
          <article-title>-processor-d-product-family-technical-overview</article-title>
          . Accessed:
          <fpage>2016</fpage>
          -07-11.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>[2] Tuning Xen for performance</article-title>
          . http: //wiki.xenproject.org/wiki/Tuning Xen for Performance.
          <source>Accessed: 2015-11-19.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] The unocial linux perf events web-page</article-title>
          . http://web.eece.maine.edu/˜vweaver/projects/perf events/. Accessed:
          <fpage>2015</fpage>
          -11-13.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>[4] Watts up plug load meters</article-title>
          . https://www.wattsupmeters.com/secure/products.php. Accessed:
          <fpage>2015</fpage>
          -11-19.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>[5] Intel 64 and IA-32 Architectures Software Developer's Manual</source>
          , volume
          <string-name>
            <surname>B.</surname>
          </string-name>
          <year>2015</year>
          .
          <volume>19</volume>
          -
          <fpage>2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Barham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dragovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Fraser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Harris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Neugebauer</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Pratt, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Warfield</surname>
          </string-name>
          .
          <article-title>Xen and the art of virtualization</article-title>
          .
          <source>In 19th ACM Symposium on Operating Systems Principles</source>
          , pages
          <fpage>164</fpage>
          -
          <lpage>177</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Browne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dongarra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Garner</surname>
          </string-name>
          , G. Ho, and
          <string-name>
            <given-names>P.</given-names>
            <surname>Mucci</surname>
          </string-name>
          .
          <article-title>A portable programming interface for performance evaluation on modern processors</article-title>
          .
          <source>Int. J. High Perform. Comput. Appl.</source>
          ,
          <volume>14</volume>
          (
          <issue>3</issue>
          ):
          <fpage>189</fpage>
          -
          <lpage>204</lpage>
          , Aug.
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chisnall</surname>
          </string-name>
          .
          <article-title>The Definitive Guide to the Xen Hypervisor</article-title>
          . Prentice Hall Press, Upper Saddle River, NJ, USA, first edition,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Delimitrou</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Kozyrakis</surname>
          </string-name>
          . Paragon:
          <article-title>QoS-aware scheduling for heterogeneous datacenters</article-title>
          .
          <source>In 18th ACM Int'l Conference on Architectural Support for Programming Languages and Operating Systems</source>
          , pages
          <fpage>77</fpage>
          -
          <lpage>88</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Eulisse</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Tuura</surname>
          </string-name>
          . Igprof profiling tool.
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Graham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. B.</given-names>
            <surname>Kessler</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Mckusick</surname>
          </string-name>
          .
          <article-title>Gprof: A call graph execution profiler</article-title>
          .
          <volume>17</volume>
          (
          <issue>6</issue>
          ):
          <fpage>120</fpage>
          -
          <lpage>126</lpage>
          ,
          <year>1982</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gardner</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Cherkasova</surname>
          </string-name>
          . Xenmon:
          <article-title>QoS monitoring and performance profiling tool</article-title>
          . Hewlett-Packard
          <string-name>
            <surname>Labs</surname>
          </string-name>
          ,
          <source>Tech. Rep. HPL-2005-187</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Henkel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Khdr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pagani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Shafique</surname>
          </string-name>
          .
          <article-title>New trends in dark silicon</article-title>
          .
          <source>In 52nd ACM/EDAC/IEEE Design Automation Conference (DAC)</source>
          ,, pages
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kivity</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Laor</surname>
          </string-name>
          , G. Costa,
          <string-name>
            <given-names>P.</given-names>
            <surname>Enberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Har'El</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Marti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Zolotarov</surname>
          </string-name>
          .
          <article-title>OSv: optimizing the operating system for virtual machines</article-title>
          .
          <source>In USENIX Annual Technical Conference</source>
          , pages
          <fpage>61</fpage>
          -
          <lpage>72</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Levon</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Elie</surname>
          </string-name>
          .
          <article-title>Oprofile: A system profiler for</article-title>
          <source>Linux</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Menon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Turner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J.</given-names>
            <surname>Janakiraman</surname>
          </string-name>
          , and
          <string-name>
            <surname>W. Zwaenepoel.</surname>
          </string-name>
          <article-title>Diagnosing performance overheads in the Xen virtual machine environment</article-title>
          .
          <source>In 1st ACM/USENIX Int'l Conference on Virtual Execution Environments</source>
          , pages
          <fpage>13</fpage>
          -
          <lpage>23</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>C.</given-names>
            <surname>Mobius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dargie</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Schill</surname>
          </string-name>
          .
          <article-title>Power consumption estimation models for processors, virtual machines, and servers</article-title>
          .
          <source>IEEE Transactions on Parallel and Distributed Systems</source>
          ,
          <volume>25</volume>
          (
          <issue>6</issue>
          ):
          <fpage>1600</fpage>
          -
          <lpage>1614</lpage>
          ,
          <year>June 2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>D. C.</given-names>
            <surname>Montgomery</surname>
          </string-name>
          and
          <string-name>
            <given-names>G. C.</given-names>
            <surname>Runger</surname>
          </string-name>
          . Applied statistics and probability for engineers. Wiley. com,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nacci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Trovo`</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Maggi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ferroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cazzola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sciuto</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Santambrogio</surname>
          </string-name>
          .
          <article-title>Adaptive and flexible smartphone power modeling</article-title>
          .
          <source>Mobile Networks and Applications</source>
          ,
          <volume>18</volume>
          (
          <issue>5</issue>
          ):
          <fpage>600</fpage>
          -
          <lpage>609</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>N.</given-names>
            <surname>Nethercote</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Seward</surname>
          </string-name>
          .
          <article-title>Valgrind: A framework for heavyweight dynamic binary instrumentation</article-title>
          .
          <source>In 28th ACM SIGPLAN Conference on Programming Language Design and Implementation</source>
          , pages
          <fpage>89</fpage>
          -
          <lpage>100</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nikolaev</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Back.</surname>
          </string-name>
          Perfctr-Xen:
          <article-title>a framework for performance counter virtualization</article-title>
          .
          <volume>46</volume>
          (
          <issue>7</issue>
          ):
          <fpage>15</fpage>
          -
          <lpage>26</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>D.</given-names>
            <surname>Rossier</surname>
          </string-name>
          .
          <article-title>EmbeddedXen: A revisited architecture of the Xen hypervisor to support ARM-based embedded virtualization</article-title>
          .
          <source>White paper, Switzerland</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>E.</given-names>
            <surname>Rotem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Naveh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ananthakrishnan</surname>
          </string-name>
          , E. Weissmann, and
          <string-name>
            <given-names>D.</given-names>
            <surname>Rajwan</surname>
          </string-name>
          .
          <article-title>Power-management architecture of the Intel microarchitecture code-named Sandy Bridge</article-title>
          .
          <source>IEEE Micro</source>
          ,
          <volume>32</volume>
          (
          <issue>2</issue>
          ):
          <fpage>20</fpage>
          -
          <lpage>27</lpage>
          , Mar.
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Semnanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Englert</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <article-title>Virtualization technology and its impact on computer hardware architecture</article-title>
          .
          <source>In Eighth Int'l Conference on Information Technology: New Generations (ITNG)</source>
          , pages
          <fpage>719</fpage>
          -
          <lpage>724</lpage>
          . IEEE,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>C.-C. Tsai</surname>
            ,
            <given-names>K. S.</given-names>
          </string-name>
          <string-name>
            <surname>Arora</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Bandi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Jannen</surname>
            , J. John,
            <given-names>H. A.</given-names>
          </string-name>
          <string-name>
            <surname>Kalodner</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Kulkarni</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Oliveira</surname>
            , and
            <given-names>D. E.</given-names>
          </string-name>
          <string-name>
            <surname>Porter</surname>
          </string-name>
          .
          <article-title>Cooperation and security isolation of library OSes for multi-process applications</article-title>
          .
          <source>In 9th European Conference on Computer Systems</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>S.</given-names>
            <surname>Xi</surname>
          </string-name>
          , J. Wilson,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Gill.</surname>
          </string-name>
          RT-Xen:
          <article-title>Towards real-time hypervisor scheduling in Xen</article-title>
          .
          <source>In 2011 IEEE Int'l Conference on Embedded Software</source>
          , pages
          <fpage>39</fpage>
          -
          <lpage>48</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. T.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>Metis: a profiling toolkit based on the virtualization of hardware performance counters</article-title>
          .
          <source>Human-centric Computing and Information Sciences</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Eranian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Mars</surname>
          </string-name>
          . Happy:
          <article-title>Hyperthread-aware power profiling dynamically</article-title>
          .
          <source>In USENIX Annual Technical Conference</source>
          , pages
          <fpage>211</fpage>
          -
          <lpage>217</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>