<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Preliminary Approach for Modeling Energy Efficiency for K-Means Clustering Applications in Data Centers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Da Qi Ren</string-name>
          <email>daqi.ren@huawei.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jianhuan Wen</string-name>
          <email>wenjianhuan@huawei.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhenya Li</string-name>
          <email>lizhenya@huawei.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Futurewei Technologies 2330 Central Expressway</institution>
          ,
          <addr-line>Santa Clara, CA, 95050</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>57</fpage>
      <lpage>70</lpage>
      <abstract>
        <p>Energy efficiency is at the forefront of evaluating the performance of a data center in delivering green solutions for analyzing information in big data. A large scale distributed system is usually composed of a large number of powerhungry components. A methodology to model data processing flows inside the architecture of a data center is proposed to analyze the critical power constraints at the level of software. The model allows obtaining system characteristic values, thus benefiting analysts by providing the necessary environmental information to predict the power-efficiency alternative with the evaluation of energy consumption. An apparatus is designed comprising a receiver configured to receive a plurality of power measurements from a plurality of power sensors, and a processor coupled to the receiver and configured to determine an amount of power used by a processing element in a data center. The functionality and capability of this method for quantitative energy analysis of big data are validated by benchmarks and measurements performed on a real data center platform.</p>
      </abstract>
      <kwd-group>
        <kwd>Big Data</kwd>
        <kwd>Energy Aware Computing</kwd>
        <kwd>Performance modeling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Power dissipation is one of the crucial problems in the design of data intensive computing
because big data applications need hundreds of hours of computations, consuming
enormous amounts of energy. Large storage, many-core computing device and high speed
network become the major choice in various big data processing platforms, the power
consumption of such systems have been continually increasing. MapReduce/Hadoop has
become one of the most important approaches in solving data intensive applications,
where Hadoop Distributed File System (HDFS) works as the data storage mechanism and
MapReduce as the computing engine. MapReduce/Hadoop does not change the nature of
power consumption, however, much less research has been carried out to improve the
____________________________
Copyright © 2017 for the individual papers by the papers' authors. Copying permitted for private
and academic purposes. This volume is published and copyrighted by its editors.
power performance with the integrated parallel programming paradigms in the
architecture. Towards optimizing the power efficiency, we investigate software
methodologies to analyze the power utilization through algorithm design and
programming techniques.</p>
      <p>
        Many algorithm level energy-aware design methods have been studied on parallel and
distributed platforms. A typical approach in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] introduces an integrated power model for
a many-core computer to predict execution times and calculate dynamic power events. By
integrating an analytical timing model and an empirical power model, the power
consumption of workloads is predicted. Higher level models use more indirect and
approximate design parameters, such as the algorithm level power model that we have
introduced in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The advantage of the approach is that instruction mixture
information, pipelining structure and out of order processing information can be covered
in the data flows that are measured.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Power Model</title>
      <sec id="sec-2-1">
        <title>Power Measurement and Modeling</title>
        <p>An apparatus is designed comprising a receiver configured to receive a plurality of power
measurements from a plurality of power measurement instruments; and a processor with
an energy measurement and estimation daemon coupled to the receiver and configured to:
determine average power usage by a processing element in a data center by determining a
summation of the plurality of power measurements; determine data to watt ratio that
indicates power cost for the processing element to process that amount of data; determine
the estimated execution time in a performance run for processing the amount of data by
the processing element; and determine the estimated energy consumption that indicates
the amount of energy to be used by the processing element to process the amount of data.
In the apparatus shown in Fig.1, the plurality of power measurements are received from
the plurality of power measurement instruments located at the device level; located at a
circuit board level; and located at an integrated circuit level of a processing device in the
data center. The data to watt ratio is determined by dividing data throughput of the
processing element over a given time period by the amount of power used by the
processing element over the given time period.</p>
        <p>The estimated execution time is determined by dividing amount of data to be processed
by the processing element by maximum data throughput of the processing element. the
estimated energy consumption is determined by multiplying the estimated execution time
by average amount of power used, determined according to the summation of the
plurality of power measurements, wherein the apparatus is disposed within a data center
system comprising a plurality of additional apparatuses, wherein the estimated energy for
each of the additional apparatuses is determined in a manner similar to the apparatus, and
wherein the estimated energy for the apparatus and the estimated energy for the additional
apparatuses are aggregated and recorded by the data center.</p>
        <p>As shown in Fig. 2, a method for tuning performance of a processing element for
handling data flow is introduced in this work. It comprises controlling the behavior of
low-level code of a program executing on the processing element; abstracting and
modeling the behavior of the program for analysis and study; determining an estimate of
energy used by the processing element in performing computational tasks; determining an
overall power model for a multiprocessing platform; determining computational capacity
of a parallel processing device; optimizing coding strategies according to the overall
power model; and tuning the processing element to increase a power efficiency level of
the processing element. In the method, determining the estimate of energy used by the
processing element in performing computational tasks comprises multiplying an estimated
execution time for executing the computational tasks by an average amount of power
expected to be used in performing the computational tasks. The average amount of power
expected to be used in performing the computational tasks is determined according to the
amount of power used by the processing element while operating in a steady-state. The
computational capacity of the parallel processing device is determined according to
micro-processors of the parallel processing device, programming language used in the
parallel processing device, and characteristics of the computation performed on the
parallel processing device. Tuning the processing element to increase power efficiency
level comprises at least one of domain partitioning, load parallelization, dynamic
frequency scaling, or workload scheduling. The overall power model is determined
according to the accumulation of power measurements for each component in the
processing element. The method operates in both time domain and at hardware domain,
frequency of each component is varied, computing capacity and power consumption vary
according to the frequency of each component, and wherein the frequency of each
component is varied to tune the processing element to satisfy the performance
requirement. The method facilitates prediction of power usage, and comprises energy
model that is based on workload characteristics of the processing element.</p>
        <p>Fig. 3. Combined CPU usage, physical memory usage, disk IO read and write rate for all 6 nodes
in the cluster, when executing K-means clustering with iterations I = 25.</p>
        <p>Fig. 4. Combined CPU usage, physical memory usage, disk IO read and write rate for the cluster,
when executing K-means clustering with iterations I = 25.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Energy Consumption Characteristics</title>
        <p>
          An application can be modeled based on the computing patterns and characteristics of its
workload. The energy approximation is the summation of the products of each component
power and its execution time [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. An overall power model can be built for the entire
multiprocessing platform based on the same. Computation capability of a parallel processing
device, e.g. many-core compute device and storage device, is determined by its
microarchitecture, programming language and characteristics of the computation performed on
it [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The methodology imports hardware power parameters to the software algorithm
study, then estimates power consumption with program analysis. One of the advantages is
that it allows obtaining design characteristic values at the early programming stage, thus
benefiting programmers by providing necessary environment information for choosing the
best power-efficient alternative.
3
3.1
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Workload Analysis</title>
      <sec id="sec-3-1">
        <title>K-means Clustering</title>
        <p>
          K-means clustering is one of the most popular flat clustering algorithms and can be
configured to match the complexity of real-world use cases [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. It is an unsupervised
clustering technique comprising the following steps: Given total objects n, and a number
of clusters k, repeat calculating and regrouping the n objects into k (≤ n) sets so as to
minimize the sum of squares (WCSS) (i.e. variance) within each cluster. The algorithm is
repeated until desired convergence level is achieved.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Power Measurement</title>
        <p>Energy is consumed by each device in a System Under Test (SUT). A total power is the
summation of each power source P  1im pi , where m is the total number of devices, pi
is the power measurement of each device i. In real measurement, power results in each
sampling period can be plotted together to obtain a power chart for the program, and the
chart shows the power usage against time during the execution. For each device or
subsystem, the calculation defined for Energy metrics is E  T P(t) dt , where T is the
0
elapsed time for the performance run, P(t) is the power measured at time t.</p>
        <p>Errors
0.000%
9.026%
0.000%
6.389%
0.000%
9.031%
0.000%
6.760%
0.000%
7.300%
0.000%
5.800%
0.000%
3.813%
0.000%
3.114%
0.000%
4.573%
0.000%
3.954%
0.000%
5.200%
0.000%
4.600%
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Program Segments</title>
        <p>
          The K-means clustering computation includes three major phases as the follows [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]: 1)
data generation phase: to generate random object data and store the data in storage; 2)
clustering phase: to drop outliers, scale observations, distance measure, clustering, adjust
parameters, repeat multiple iterations and finally choose representative components; 3)
writing phase: marking data with the cluster information and write the data back to
storage. Let T DataGen represents the time for data generation in phase 1; T iteration
represents the time for clustering computation in phase 2; and T Writeback represents the
time to write the data in phase 3. The estimated energy consumption for completing the
program is the average power multiplied by the execution time, as shown in Equation (1).
        </p>
        <p>Eestimation   PApvhearasegePower T phase
(1)
where E is the energy estimation, PApvhearasegePower is the average power consumption while
the application run in each steps, T phase is the computation time for the K-means
clustering phases of data generation, clustering and writing.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Equations and Measurements</title>
        <p>Power consumption of a computing platform is measured or calculated for each node
separately. The total power consumption can be modeled as</p>
        <p>P(w)  N PCiompute (wi )  M PStorage (w j )  PNetwork (w) (2)</p>
        <p>i1 j1
where P, PCompute, PStorage and PNetwork represent the power of the total, computing device,
storage device and network devices, respectively. N and M are the numbers of computing
devices and storage devices involved in the computation of workload w. wi and w j
represent the workload assigned to Compute _ devicei and Compute _ devicej respectively.
The power consumption of computing device is dominant among all the entities
consuming power in each node. The analysis and prediction of the power consumption of
a Hadoop cluster are introduced below, by using the benchmark results of K-means
clustering.
3.5</p>
      </sec>
      <sec id="sec-3-5">
        <title>Power Model</title>
        <p>
          While data size and a computing platform and configuration are fixed, the time for data
generation T DataGen will take an approximate identical time in each run of K-mean
clustering on the platform. According to the conditions of convergence and the accuracy
threshold setup, a K-means clustering may take a different number of iterations to
complete. Based on the algorithm, the time to complete each iteration T iteration can be
considered as a fixed value [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. After finished the last iteration, the time to write the data
back T Writeback can be assumed as the same since the data size will not be changed.
Assuming the total number of iterations is k, the total computing time can be represented
in Equation (3).
        </p>
        <p>And the total energy consumption can be represented in Equation (4).</p>
        <sec id="sec-3-5-1">
          <title>TTotal  T DaraGen +kT iteration +T write</title>
        </sec>
        <sec id="sec-3-5-2">
          <title>ETotal  EDaraGen +kEiteration +Ewrite</title>
          <p>(3)
(4)
4
4.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Performance Results</title>
      <sec id="sec-4-1">
        <title>Power Consuming Experiments</title>
        <p>
          For the experiments in this work, we use six HUAWEI Tecal RH2288 V2 Rack servers,
each with 2 Intel Xeon Processor E5-2680 (Sandy Bridge-EP) running at 2.7 GHz [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
Each processor has 8 cores (16 hyperthreaded) and a 20MB L3 cache. They are connected
through two QuickPath links, each providing a unidirectional transmission rate of up to
8.0 GT/s. Each server has 24 8GB double data rate 3 (DDR3) at 1066Mhz dimms of main
memory with a total ot 192GB. Each server is configured with eight 2.5” SAS HDDs with
7.2TB capacity. One SAS disk hosts the OS and the remaining 7 are configured for
HDFS. The server provides four onboard gigabit Ethernet (GE) ports, of which two were
bonded to double bandwidth. Our system runs CDH 5 on centos 6.5. Cloudera is
configured as one master node and five worker nodes. Each of the nodes has 189.1GBytes
of memory. We use Apache Hadoop 2.6 and open source HiBench K-means that is
popular for processing large datasets built for distributed applications. The power analyzer
used in this test is HIOKI 3600.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Power Consuming Characterization</title>
        <p>The low level processing is finally turned to be a sparse metrics production to computer
the vector distances. It makes the computing devices to execute the input data follow
inside each K iteration Single Instruction Multiple Data (SIMD), i.e. multiply and plus
operations repeatedly in a streaming way. Therefore, when the data size, the processor’s
frequency and temperature are invariant, the CPU power can be modeled as a constant
value inside each K-iteration.</p>
        <p>The resource usage and the consumption of K-means clustering computations for 25
iterations are shown in Fig.3 and Fig. 4 for the whole cluster and one single node,
respectively. This status is true in all Hadoop nodes, the total energy used to complete the
K-means clustering task is linearly related to the input data size, I iterations and K number
of clusters. For experimental purposes, a random sample of 1.2 billion records with 20
variables was used. Power characterization and stress testing was done by varying the
number of K for a fix data set of 225GB.
5
5.1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Energy and Estimation</title>
      <sec id="sec-5-1">
        <title>CPU Usage and CPU Power</title>
        <p>For estimating the relationships between the CPU usage and the energy consumption, we
first plotted the CPU usage and power against time. Visual inspection indicates a strong
relationship between CPU usage and power. For the SIMD character of this experiment,
the relationship is considered to be linear, allowing us to use statistical methods to predict
power by CPU usage.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Workload Characters, Performance and Power</title>
        <p>Fig. 5 shows the computation steps include data generation; clustering phase; and data
marking and writing back for K-mean iteration I=10, I=15 and I=25, respectively. Fig. 6
shows the energy measurement of one worker node for K-mean iteration I=25. While the
total data size is identical, the time for data generation can be fixed as a constant,
therefore the corresponding energy consumption is determined. The clustering
computation follows the SIMD rule, the time taken is linear to the number of computing
iterations. If a single iteration takes time T and consumes energy E, the total energy
consumption during K-means with I=25 can be modeled as by one E energy multiplied by
number of iterations, i.e. 25E. Because the number of objects N will not change at the
marking and writing back stage, the corresponding time and energy consumption can be
identically determined. A detailed energy consumption is listed in Table 1.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Error Analysis</title>
      <p>Power modeling and estimation at software level is relative accurate because of the
integration between resource usage, the platform and physical measurement. However, the
errors in power estimation of K-means clustering algorithm lies in the throughput to
power ratio displacement. It causes the power values to not precisely fix to the same
throughput when the workload size is small. This can be seen in the measurement of the
peak dynamic power where the curve is not a straight line. The K-means R/W overheads
and under-foots may also slightly change at each time when a new iteration of clustering
launches. The power phase leg to the assembly execution will cause the calculation error,
especially when the data size is small.</p>
      <p>The clustering phase of K-means has two stages, namely iteration (CPU-bound) and
regrouping (I/O-bound). On the platform, CPU dominate the overall power consumption.
The overhead introduced by storage devices, network devices are negligible compared
with CPU. A linear model was developed to predict the power consumption based on
CPU usage. The results of modeling are shown in Table 1. The energy consumption
during clustering was calculated in multiple ways. The energy can simply be calculated
from power using area under the curve. From Table.1, when iteration number I=10 the
energy used by one server node during actual computation was 2,816,095 J, while the
same energy calculated from the model is found to be 3,006,472 J, with an error of
6.76%. When iteration number I=15 and I =25, the energy used by one server node
estimated using the model is found with an error of 3.11% and 4.60% compared with the
real measurement values, respectively. All calculations verify the results despite having
synchronization between measurements and process runtime. This is significant because it
allows a programmer to tune the power efficiency of an application by simply tune the
CPU usage such as frequency, etc.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion and Future Work</title>
      <p>A general approach is introduced for quantitative power estimation and analysis on
Hadoop clusters for big data computing. The workload characteristics have been analyzed.
Based on this, power parameters are captured by measuring the power of each component
in a PE on a real system. The Hadoop PE’s power feature in executing the K-means
clustering program can then be analyzed and concluded. The power consumption of a
Kmeans program with same computational characteristics of any size that is running on the
PE can be predicted based on the power consumption feature. Finally the approach is
validated by measuring real computation power on the target hardware. The power
analysis method can be refined to enhance its precision by including more components,
and based on it the power parameters can be tuned for obtaining the best power
performance for given problems. These will be considered in our future work.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgement</title>
      <p>We thank Mr. Chaitanya Kundety for his assistance with experiments and the comments that
improved the manuscript.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Sunpyo</given-names>
            <surname>Hong</surname>
          </string-name>
          , Hyesoon Kim, “
          <article-title>An integrated GPU power and performance model”</article-title>
          ,
          <source>International Symposiumon Computer Architecture</source>
          , Saint-Malo, France, (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Luan</surname>
          </string-name>
          , Depei Qian:
          <article-title>iMeter: An integrated VM power model based on performance profiling</article-title>
          .
          <source>In: Future Generation Computer Systems</source>
          , Volume
          <volume>36</volume>
          ,
          <string-name>
            <surname>July</surname>
            <given-names>2014</given-names>
          </string-name>
          , Pages
          <fpage>267</fpage>
          -
          <lpage>286</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Da</given-names>
            <surname>Qi</surname>
          </string-name>
          <article-title>Ren: Algorithm level power efficiency optimization for CPU-GPU processing element in data intensive SIMD/SPMD computing</article-title>
          .
          <source>In: Journal of Parallel and Distributed Computing</source>
          <volume>71</volume>
          (
          <issue>2</issue>
          ):
          <fpage>245</fpage>
          -
          <lpage>253</lpage>
          ,
          <year>Feb 2011</year>
          , DOI: 10.1016/j.jpdc.
          <year>2010</year>
          .
          <volume>10</volume>
          .007.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>M.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rafatirah</surname>
          </string-name>
          , A. Sasan1,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <article-title>Homayoun1: System and Architecture Level Characterization of Big Data Applications on Big and Little Core Server Architectures</article-title>
          .
          <source>In: 2015 IEEE International Conference on Big Data</source>
          , Santa Clara, USA, Oct 29-Nov 1,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Tou</surname>
            ,
            <given-names>J. T.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Gonzalez</surname>
          </string-name>
          , R. C:
          <article-title>Pattern Recognition Principles</article-title>
          .
          <source>In: Addison-Wesley; 2nd edition (August</source>
          <year>1977</year>
          ), ISBN-
          <volume>10</volume>
          :
          <fpage>0201075873</fpage>
          , ISBN-
          <volume>13</volume>
          :
          <fpage>978</fpage>
          -
          <lpage>0201075878</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Intel® Xeon®
          <article-title>Processor E5 v3 Product Family, Processor Specification Update</article-title>
          ,
          <year>Sep 2016</year>
          . https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeone5-v3
          <string-name>
            <surname>-</surname>
          </string-name>
          spec-update.
          <source>pdf .</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>