Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia


                       A Preliminary Approach for Modeling Energy
                       Efficiency for K-Means Clustering Applications
                                       in Data Centers

                                       Da Qi Ren, Jianhuan Wen and Zhenya Li

                                               Futurewei Technologies
                                2330 Central Expressway, Santa Clara, CA, 95050, USA
                             {daqi.ren, wenjianhuan, lizhenya}@huawei.com


                    Abstract. Energy efficiency is at the forefront of evaluating the performance of a
                    data center in delivering green solutions for analyzing information in big data. A
                    large scale distributed system is usually composed of a large number of power-
                    hungry components. A methodology to model data processing flows inside the
                    architecture of a data center is proposed to analyze the critical power constraints at
                    the level of software. The model allows obtaining system characteristic values, thus
                    benefiting analysts by providing the necessary environmental information to predict
                    the power-efficiency alternative with the evaluation of energy consumption. An
                    apparatus is designed comprising a receiver configured to receive a plurality of
                    power measurements from a plurality of power sensors, and a processor coupled to
                    the receiver and configured to determine an amount of power used by a processing
                    element in a data center. The functionality and capability of this method for
                    quantitative energy analysis of big data are validated by benchmarks and
                    measurements performed on a real data center platform.

                    Keywords: Big Data, Energy Aware Computing, Performance modeling.


             1      Introduction

             Power dissipation is one of the crucial problems in the design of data intensive computing
             because big data applications need hundreds of hours of computations, consuming
             enormous amounts of energy. Large storage, many-core computing device and high speed
             network become the major choice in various big data processing platforms, the power
             consumption of such systems have been continually increasing. MapReduce/Hadoop has
             become one of the most important approaches in solving data intensive applications,
             where Hadoop Distributed File System (HDFS) works as the data storage mechanism and
             MapReduce as the computing engine. MapReduce/Hadoop does not change the nature of
             power consumption, however, much less research has been carried out to improve the
             ____________________________
             Copyright © 2017 for the individual papers by the papers' authors. Copying permitted for private
             and academic purposes. This volume is published and copyrighted by its editors.


                                                             57
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia


             power performance with the integrated parallel programming paradigms in the
             architecture. Towards optimizing the power efficiency, we investigate software
             methodologies to analyze the power utilization through algorithm design and
             programming techniques.
                Many algorithm level energy-aware design methods have been studied on parallel and
             distributed platforms. A typical approach in [1] introduces an integrated power model for
             a many-core computer to predict execution times and calculate dynamic power events. By
             integrating an analytical timing model and an empirical power model, the power
             consumption of workloads is predicted. Higher level models use more indirect and
             approximate design parameters, such as the algorithm level power model that we have
             introduced in [2], [3]. The advantage of the approach is that instruction mixture
             information, pipelining structure and out of order processing information can be covered
             in the data flows that are measured.


                    Fig. 1. An apparatus is designed to collect power usage data from data center, model the
                                      characterizations and tune the power efficiency.


                                                            58
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia


             2      Power Model

             2.1    Power Measurement and Modeling

             An apparatus is designed comprising a receiver configured to receive a plurality of power
             measurements from a plurality of power measurement instruments; and a processor with
             an energy measurement and estimation daemon coupled to the receiver and configured to:
             determine average power usage by a processing element in a data center by determining a
             summation of the plurality of power measurements; determine data to watt ratio that
             indicates power cost for the processing element to process that amount of data; determine
             the estimated execution time in a performance run for processing the amount of data by
             the processing element; and determine the estimated energy consumption that indicates
             the amount of energy to be used by the processing element to process the amount of data.
             In the apparatus shown in Fig.1, the plurality of power measurements are received from
             the plurality of power measurement instruments located at the device level; located at a
             circuit board level; and located at an integrated circuit level of a processing device in the
             data center. The data to watt ratio is determined by dividing data throughput of the
             processing element over a given time period by the amount of power used by the
             processing element over the given time period.


             Fig. 2. A method for tuning performance of a processing element for handling data flow, comprises
             controlling the behavior of low-level code; abstracting and modeling the behavior of the program
             for analysis and study.


                                                            59
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia


                The estimated execution time is determined by dividing amount of data to be processed
             by the processing element by maximum data throughput of the processing element. the
             estimated energy consumption is determined by multiplying the estimated execution time
             by average amount of power used, determined according to the summation of the
             plurality of power measurements, wherein the apparatus is disposed within a data center
             system comprising a plurality of additional apparatuses, wherein the estimated energy for
             each of the additional apparatuses is determined in a manner similar to the apparatus, and
             wherein the estimated energy for the apparatus and the estimated energy for the additional
             apparatuses are aggregated and recorded by the data center.
                As shown in Fig. 2, a method for tuning performance of a processing element for
             handling data flow is introduced in this work. It comprises controlling the behavior of
             low-level code of a program executing on the processing element; abstracting and
             modeling the behavior of the program for analysis and study; determining an estimate of
             energy used by the processing element in performing computational tasks; determining an
             overall power model for a multiprocessing platform; determining computational capacity
             of a parallel processing device; optimizing coding strategies according to the overall
             power model; and tuning the processing element to increase a power efficiency level of
             the processing element. In the method, determining the estimate of energy used by the
             processing element in performing computational tasks comprises multiplying an estimated
             execution time for executing the computational tasks by an average amount of power
             expected to be used in performing the computational tasks. The average amount of power
             expected to be used in performing the computational tasks is determined according to the
             amount of power used by the processing element while operating in a steady-state. The
             computational capacity of the parallel processing device is determined according to
             micro-processors of the parallel processing device, programming language used in the
             parallel processing device, and characteristics of the computation performed on the
             parallel processing device. Tuning the processing element to increase power efficiency
             level comprises at least one of domain partitioning, load parallelization, dynamic
             frequency scaling, or workload scheduling. The overall power model is determined
             according to the accumulation of power measurements for each component in the
             processing element. The method operates in both time domain and at hardware domain,
             frequency of each component is varied, computing capacity and power consumption vary
             according to the frequency of each component, and wherein the frequency of each
             component is varied to tune the processing element to satisfy the performance
             requirement. The method facilitates prediction of power usage, and comprises energy
             model that is based on workload characteristics of the processing element.


                                                        60
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia


              Fig. 3. Combined CPU usage, physical memory usage, disk IO read and write rate for all 6 nodes
                         in the cluster, when executing K-means clustering with iterations I = 25.


                                                           61
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia


              Fig. 4. Combined CPU usage, physical memory usage, disk IO read and write rate for the cluster,
                                when executing K-means clustering with iterations I = 25.


                                                           62
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia


             2.2    Energy Consumption Characteristics
             An application can be modeled based on the computing patterns and characteristics of its
             workload. The energy approximation is the summation of the products of each component
             power and its execution time [2]. An overall power model can be built for the entire multi-
             processing platform based on the same. Computation capability of a parallel processing
             device, e.g. many-core compute device and storage device, is determined by its micro-
             architecture, programming language and characteristics of the computation performed on
             it [4], [5]. The methodology imports hardware power parameters to the software algorithm
             study, then estimates power consumption with program analysis. One of the advantages is
             that it allows obtaining design characteristic values at the early programming stage, thus
             benefiting programmers by providing necessary environment information for choosing the
             best power-efficient alternative.


             3      Workload Analysis

             3.1    K-means Clustering
             K-means clustering is one of the most popular flat clustering algorithms and can be
             configured to match the complexity of real-world use cases [5]. It is an unsupervised
             clustering technique comprising the following steps: Given total objects n, and a number
             of clusters k, repeat calculating and regrouping the n objects into k (≤ n) sets so as to
             minimize the sum of squares (WCSS) (i.e. variance) within each cluster. The algorithm is
             repeated until desired convergence level is achieved.


             3.2    Power Measurement
             Energy is consumed by each device in a System Under Test (SUT). A total power is the
             summation of each power source P   pi , where m is the total number of devices, pi
                                                   1 i  m

             is the power measurement of each device i. In real measurement, power results in each
             sampling period can be plotted together to obtain a power chart for the program, and the
             chart shows the power usage against time during the execution. For each device or
                                                                              T
             subsystem, the calculation defined for Energy metrics is E   P(t ) dt , where T is the
                                                                              0

             elapsed time for the performance run, P(t ) is the power measured at time t.


                                                              63
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia


            Fig. 5. CPU usage for the cluster, when executing K-means clustering with iterations I=10; I=15; and
                                                            I=25.


                        Fig. 6. Energy consumption model of K-means clustering with three phases.


                                                            64
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia


                 Table 1. The experiment results by compared the estimated energy consumption and the real
                                                    measurement values.

                                      Energy consumption of K-means Cluster
                                      Phases        Estimation(J)       Measurement(J)       Errors
                                    Generation      407,971.00                               0.000%
                                    Clustering      1,712,980.00        1,867,599.00         9.026%
                         CPU
                                    Writeback       299,213.00                               0.000%
                K=10                Overall         2,420,164.00        2,574,783.00         6.389%
                                    Generation      408,879.00                               0.000%
                                    Clustering      2,108,040.00        2,298,417.00         9.031%
                         Server
                                    Writeback       299,176.00                               0.000%
                                    Overall         2,816,095.00        3,006,472.00         6.760%
                                    Generation      408,031.00                               0.000%
                                    Clustering      2,644,470.00        2,851,399.00         7.300%
                         CPU
                                    Writeback       298,970.00                               0.000%
                K=15                Overall         3,351,471.00        3,558,400.00         5.800%
                                    Generation      409,612.00                               0.000%
                                    Clustering      3,162,060.00        3,282,625.00         3.813%
                         Server
                                    Writeback       299,891.00                               0.000%
                                    Overall         3,871,563.00        3,992,128.00         3.114%
                                    Generation      407,280.00                               0.000%
                                    Clustering      4,512,413.00        4,718,772.00         4.573%
                         CPU
                                    Writeback       299,170.00                               0.000%
                                    Overall         5,218,863.00        5,425,222.00         3.954%
                K=25
                                    Generation      408,311.00                               0.000%
                                    Clustering      5,130,147.00        5,411,022.00         5.200%
                         Server
                                    Writeback       299,771.00                               0.000%
                                    Overall         5,838,229.00        6,119,104.00         4.600%


                                                           65
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia


             3.3    Program Segments
             The K-means clustering computation includes three major phases as the follows [5]: 1)
             data generation phase: to generate random object data and store the data in storage; 2)
             clustering phase: to drop outliers, scale observations, distance measure, clustering, adjust
             parameters, repeat multiple iterations and finally choose representative components; 3)
             writing phase: marking data with the cluster information and write the data back to
                              DataGen                                                                  iteration
             storage. Let T             represents the time for data generation in phase 1; T
                                                                                           Writeback
             represents the time for clustering computation in phase 2; and T           represents the
             time to write the data in phase 3. The estimated energy consumption for completing the
             program is the average power multiplied by the execution time, as shown in Equation (1).

                                   Eestimation   PAveragePower
                                                     phase
                                                                  T phase               (1)

                                                       phase
             where E is the energy estimation, PAveragePower is the average power consumption while
                                                          phase
             the application run in each steps, T           is the computation time for the K-means
             clustering phases of data generation, clustering and writing.


             3.4    Equations and Measurements
             Power consumption of a computing platform is measured or calculated for each node
             separately. The total power consumption can be modeled as


                                             N                    M
                                   P(w)   PCompute
                                              i
                                                     (wi )   PStorage ( w j )  PNetwork ( w) (2)
                                            i 1                  j 1


             where P, PCompute , PStorage and PNetwork represent the power of the total, computing device,
             storage device and network devices, respectively. N and M are the numbers of computing
             devices and storage devices involved in the computation of workload w. wi and w j
             represent the workload assigned to Compute _ devicei and Compute _ device j respectively.
             The power consumption of computing device is dominant among all the entities
             consuming power in each node. The analysis and prediction of the power consumption of
             a Hadoop cluster are introduced below, by using the benchmark results of K-means
             clustering.


                                                               66
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia


             3.5    Power Model

             While data size and a computing platform and configuration are fixed, the time for data
                            DataGen
             generation T          will take an approximate identical time in each run of K-mean
             clustering on the platform. According to the conditions of convergence and the accuracy
             threshold setup, a K-means clustering may take a different number of iterations to
                                                                                            iteration
             complete. Based on the algorithm, the time to complete each iteration T                can be
             considered as a fixed value [5]. After finished the last iteration, the time to write the data
                     Writeback
             back T          can be assumed as the same since the data size will not be changed.
             Assuming the total number of iterations is k, the total computing time can be represented
             in Equation (3).

                                       TTotal  T DaraGen +kT iteration +T write   (3)
             And the total energy consumption can be represented in Equation (4).

                                       ETotal  E DaraGen +kE iteration +E write   (4)


             4      Performance Results

             4.1    Power Consuming Experiments

             For the experiments in this work, we use six HUAWEI Tecal RH2288 V2 Rack servers,
             each with 2 Intel Xeon Processor E5-2680 (Sandy Bridge-EP) running at 2.7 GHz [6].
             Each processor has 8 cores (16 hyperthreaded) and a 20MB L3 cache. They are connected
             through two QuickPath links, each providing a unidirectional transmission rate of up to
             8.0 GT/s. Each server has 24 8GB double data rate 3 (DDR3) at 1066Mhz dimms of main
             memory with a total ot 192GB. Each server is configured with eight 2.5” SAS HDDs with
             7.2TB capacity. One SAS disk hosts the OS and the remaining 7 are configured for
             HDFS. The server provides four onboard gigabit Ethernet (GE) ports, of which two were
             bonded to double bandwidth. Our system runs CDH 5 on centos 6.5. Cloudera is
             configured as one master node and five worker nodes. Each of the nodes has 189.1GBytes
             of memory. We use Apache Hadoop 2.6 and open source HiBench K-means that is
             popular for processing large datasets built for distributed applications. The power analyzer
             used in this test is HIOKI 3600.


                                                             67
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia


             4.2    Power Consuming Characterization

             The low level processing is finally turned to be a sparse metrics production to computer
             the vector distances. It makes the computing devices to execute the input data follow
             inside each K iteration Single Instruction Multiple Data (SIMD), i.e. multiply and plus
             operations repeatedly in a streaming way. Therefore, when the data size, the processor’s
             frequency and temperature are invariant, the CPU power can be modeled as a constant
             value inside each K-iteration.
                The resource usage and the consumption of K-means clustering computations for 25
             iterations are shown in Fig.3 and Fig. 4 for the whole cluster and one single node,
             respectively. This status is true in all Hadoop nodes, the total energy used to complete the
             K-means clustering task is linearly related to the input data size, I iterations and K number
             of clusters. For experimental purposes, a random sample of 1.2 billion records with 20
             variables was used. Power characterization and stress testing was done by varying the
             number of K for a fix data set of 225GB.


             5      Energy and Estimation

             5.1    CPU Usage and CPU Power
             For estimating the relationships between the CPU usage and the energy consumption, we
             first plotted the CPU usage and power against time. Visual inspection indicates a strong
             relationship between CPU usage and power. For the SIMD character of this experiment,
             the relationship is considered to be linear, allowing us to use statistical methods to predict
             power by CPU usage.


             5.2    Workload Characters, Performance and Power
             Fig. 5 shows the computation steps include data generation; clustering phase; and data
             marking and writing back for K-mean iteration I=10, I=15 and I=25, respectively. Fig. 6
             shows the energy measurement of one worker node for K-mean iteration I=25. While the
             total data size is identical, the time for data generation can be fixed as a constant,
             therefore the corresponding energy consumption is determined. The clustering
             computation follows the SIMD rule, the time taken is linear to the number of computing
             iterations. If a single iteration takes time T and consumes energy E, the total energy
             consumption during K-means with I=25 can be modeled as by one E energy multiplied by
             number of iterations, i.e. 25E. Because the number of objects N will not change at the
             marking and writing back stage, the corresponding time and energy consumption can be
             identically determined. A detailed energy consumption is listed in Table 1.


                                                          68
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia


             6     Error Analysis

             Power modeling and estimation at software level is relative accurate because of the
             integration between resource usage, the platform and physical measurement. However, the
             errors in power estimation of K-means clustering algorithm lies in the throughput to
             power ratio displacement. It causes the power values to not precisely fix to the same
             throughput when the workload size is small. This can be seen in the measurement of the
             peak dynamic power where the curve is not a straight line. The K-means R/W overheads
             and under-foots may also slightly change at each time when a new iteration of clustering
             launches. The power phase leg to the assembly execution will cause the calculation error,
             especially when the data size is small.
                The clustering phase of K-means has two stages, namely iteration (CPU-bound) and
             regrouping (I/O-bound). On the platform, CPU dominate the overall power consumption.
             The overhead introduced by storage devices, network devices are negligible compared
             with CPU. A linear model was developed to predict the power consumption based on
             CPU usage. The results of modeling are shown in Table 1. The energy consumption
             during clustering was calculated in multiple ways. The energy can simply be calculated
             from power using area under the curve. From Table.1, when iteration number I=10 the
             energy used by one server node during actual computation was 2,816,095 J, while the
             same energy calculated from the model is found to be 3,006,472 J, with an error of
             6.76%. When iteration number I=15 and I =25, the energy used by one server node
             estimated using the model is found with an error of 3.11% and 4.60% compared with the
             real measurement values, respectively. All calculations verify the results despite having
             synchronization between measurements and process runtime. This is significant because it
             allows a programmer to tune the power efficiency of an application by simply tune the
             CPU usage such as frequency, etc.


             7     Conclusion and Future Work

             A general approach is introduced for quantitative power estimation and analysis on
             Hadoop clusters for big data computing. The workload characteristics have been analyzed.
             Based on this, power parameters are captured by measuring the power of each component
             in a PE on a real system. The Hadoop PE’s power feature in executing the K-means
             clustering program can then be analyzed and concluded. The power consumption of a K-
             means program with same computational characteristics of any size that is running on the
             PE can be predicted based on the power consumption feature. Finally the approach is
             validated by measuring real computation power on the target hardware. The power
             analysis method can be refined to enhance its precision by including more components,
             and based on it the power parameters can be tuned for obtaining the best power
             performance for given problems. These will be considered in our future work.


                                                        69
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia


             Acknowledgement
             We thank Mr. Chaitanya Kundety for his assistance with experiments and the comments that
             improved the manuscript.


             References
              1. Sunpyo Hong, Hyesoon Kim, “An integrated GPU power and performance model”, Inter-
                 national Symposiumon Computer Architecture, Saint-Malo, France, (2010).
              2. H. Yang, Q. Zhao, Z. Luan, Depei Qian: iMeter: An integrated VM power model based on
                 performance profiling. In: Future Generation Computer Systems, Volume 36, July 2014, Pages
                 267–286.
              3. Da Qi Ren: Algorithm level power efficiency optimization for CPU–GPU processing element
                 in data intensive SIMD/SPMD computing. In: Journal of Parallel and Distributed Computing
                 71(2):245-253, Feb 2011, DOI: 10.1016/j.jpdc.2010.10.007.
              4. M. Malik, S. Rafatirah, A. Sasan1, H. Homayoun1: System and Architecture Level
                 Characterization of Big Data Applications on Big and Little Core Server Architectures. In:
                 2015 IEEE International Conference on Big Data, Santa Clara, USA, Oct 29-Nov 1, 2015.
              5. Tou, J. T., and Gonzalez, R. C: Pattern Recognition Principles. In: Addison-Wesley; 2nd
                 edition (August 1977), ISBN-10: 0201075873, ISBN-13: 978-0201075878
              6. Intel® Xeon®Processor E5 v3 Product Family, Processor Specification Update, Sep 2016.
                 https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-
                 e5-v3-spec-update.pdf .


                                                          70