FKmeans: A Fast Data Classification Technique to
   Handle Big Data Collected in Sensor Network
                                         Ola Majeda,‡, Hassan Harbb,†, Mohamad Hamzea,‡,, Ali Jabera,†
                                   a
                                       Computer Science department, Lebanese University, Beirut, Lebanon
                 b
                     Computer Science department, American University of Culture and Education (AUCE), Tyre, Lebanon
                     Emails: olamajed0@gmail.com, hassanhareb@auce.edu.lb, {mohammad.hamze,ali.jaber}@ul.edu.lb


     Abstract— In recent times, The development of wireless              together to evacuate huge amounts of redundant data steered
sensor netwoks (WSNs) assumes a noteworthy job in the ascent             on the networks, consequently to limit the amount of
of big data as a large number of their applications gather               transmission and conserve energy.
enormous amounts of data that require processing. Therefore,
WSN faces two noteworthy difficulties. To begin with, it
                                                                             In this work, we propose a fast kmeans, abbreviated by
handles the big data collection, and second, the energy of
sensors will be drained rapidly because of the immense volume            FKmeans clustering technique used for wireless sensor
of data gathering and transmission. Subsequently, flow look              networks in order to reduce the amount of transmitted data
into has been centered around data classification as a                   in the network, and thus save energy consumption. Two
proficient technique to decrease big data accumulation in                stage algorithms is proposed in this work which strongly
WSNs along these lines upgrading their lifetime. This paper              outperform the conventional Kmeans regarding time cost of
proposes a fast data classification technique called FKmeans,            distance calculation among data and center clusters. The
i.e. Fast Kmeans, committed to periodic applications in WSNs.            main purpose of the first stage (center selection stage) is to
FKmeans comprises of two phase algorithm to improve the                  find the ideal location of the centers by selecting a small
time cost of separation estimation of conventional Kmeans
                                                                         part of datasets instead of selecting the whole datasets. After
calculation accordingly, guarantee ensure fast data delivery to
the sink node. The main stage, i.e. center selection stage,              getting the center clusters from stage one, the second stage
chooses a little segment of datasets with the end goal to locate         will use the conventional Kmeans algorithm which is
the most ideal area of the centers. The second stage, i.e. cluster       adopted to the Euclidean distance for assigning each dataset
formation stage, uses the conventional Kmeans algorithm                  to its corresponding clusters. Hence, FKmeans will show a
adopted to the Euclidean distance where the underlying                   remarkable reduce of time cost when compared with
centers utilized are taken from the primary stage. Our                   traditional Kmeans, this is caused by small amount of
proposed technique is validated via simulations on real sensor           training data that is used in the first stage which resulted in
data and comparison with the conventional Kmeans algorithm.              few iteration loops in the second one.
The gotten outcomes demonstrate the adequacy of our
technique regarding enhancing the energy utilization and data
conveyance delay, without loss in data fidelity.                             The rest of the paper is talks about: Works related to
                                                                         data clustering in WSNs are shown in section II, next,
   Keywords—Wireless sensor networks, Periodic applications,             section III shows the architecture used in our network, after
Data Clustering, FKmeans Algorithm, Real Sensor Data.                    that, section IV talk about data clustering model which
                                                                         displayed at the aggregator level, section V show the real
                                                                         data sensors simulation, and finally section VI derives our
                        I. INTRODUCTION                                  paper and presents some perspectives.
   Wireless sensor networks (WSNs) have become a highly
active research today. Their applicability can be seen
diverse domains such as environment, human, medical,                                          II. RELATED WORK
industrial, etc. Typically, a WSN consists of a large number                 Data clustering, as defined by some researchers [1, 2, 3,
of sensor nodes which are densely deployed over the                      4, 5], is the process of grouping similar packets coming from
monitored area in order to collect, then transmit, the data              different sensors into groups or clusters thus, to eliminate
periodically to the remote sink node.                                    redundancy before sending final datasets to the sink; so that
                                                                         the number of transmissions through the network is
   Such type of data collection, usually referred to periodic            consequently reduced. The main goal behind classifying data
sensor network (PSN), generates a huge number of data                    is to eliminate redundant data transmission in order to
which makes data studying and analyzing a difficult task for             enhance the lifetime of the network and send only the
the enduser. Furthermore, sensor nodes have a non-                       information desired by the end user. Nowadays, data
renewable power supply and, once deployed, must work                     aggregation and clustering techniques are the most data
                                                                         classification methods in order to minimize the data
unattended. Thus, data collection and transmission in WSNs
                                                                         redundancy in WSNs.
should be minimized in order to reduce the energy
consumption and increase the network lifetime.                              Recently, clustering-based data reduction techniques
                                                                         have been used due to their importance in reducing the
    Hence, to dodge the previously mentioned issues,data                 energy consumption in WSNs [6, 7, 8]. In [6], the authors
clustering and classification techniques have been presented.            propose EBDSC scheme to enhance the lifetime of the
Clustering/Classification means to amass similar data                    network where the different devices balance the power


                                                                                                                                           12
consumption among them. The life duration of the node is
calculated if it is selected as a cluster head. The next cluster
head is the node that has the highest lifetime in the same
cluster. The authors in [9] propose DMLDA, an efficient
data aggregation technique that works on three tasks:
activating nodes, clustering nodes, and filtering messages.
    The authors propose in [10] a semi-structured
aggregation protocol based on multi-objective tree in WSNs.
Their main objective is to increase the aggregation
probability and then extend the network lifetime. The authors
present in [11] a Cycle-Based Data Aggregation Scheme                     Fig. 1. Distribution of sensors deployed in Intel lab
(CBDAS) for energy saving. In this technique, a grid of cells                                   network.
is created in the WSN, each with a head. These heads are
linked together to form a cyclic chain and then the data
transmission is reduced and the network lifetime is                      On the other hand, in the periodic collection model, data
prolonged. The authors suggest in [12] SFEB, which is a              are collected in a periodic basis where each period p is
Structure-Free and Energy-Balanced data aggregation                  partitioned into time slots. At each slot t, each sensor node Ni
protocol. This technique relies on two-phase aggregation             captures a new reading ri. At the end of the period p, Ni
process and a selection mechanism for dynamic aggregator             collects a vector of τ readings, i.e. Ri = {r1, r2, ..., rτ}, then it
that realizes the data gathering and reduces the number of           sends it to the sink (Fig. 2(a)). In our system, each sensor
transmissions.                                                       node sends periodically (period p) its data to the appropriate
                                                                     aggregator, which in turn sends it to the sink (Fig. 2(b)). Our
    Although most of the proposed techniques allow efficient         objective is to allow aggregator to classify datasets coming
data reduction, however they present several disadvantages.          from the sensors into groups of similar data then to send only
They are almost complex, sometimes they generate                     one useful information to the sink node.
communication overhead, and the sink may need some
transmissions to detect failures. In this paper, we present a
fast kmeans, abbreviated FKmeans, clustering technique for
wireless sensor networks to decrease the data transmission in
the network thus save the energy consumption. Then, in
order to evaluate our technique, we conducted a set of
simulations followed by experiments on a real environment
sensor networks.

         III. CLUSTER-BASED PERIODIC NETWORK
    After being deployed in the field of interest, sensor nodes
organize themselves in the network with the sink node. The
network’s topology plays an important role in WSNs because              (a)     periodic        data      (b) data filtering scheme
of its impact on energy consumption and the network
                                                                        acquisition
reliability. Indeed, There are two major topology have been
proposed for WSNs: tree-based and cluster-based. However,                         Fig. 2. Periodic data filtering scheme.
due to its ability to reduce transmission distance or hops
between sensors and the sink as well as to perform
aggregation processing at intermediate nodes, cluster-based                                IV. OUR TECHNIQUE
scheme has been more used compared to tree-based network.                At the end of each period, the aggregator will receive
Thus, in this paper, we are interested in the cluster-based          datasets coming from all sensors. Mostly, the spatial-
topology with the periodic data collection model in sensor           temporal correlation between nodes leads to high redundancy
networks.                                                            between the received datasets. Hence, the redundancy should
    From one hand, with cluster-based topology, we assume            be eliminated by the aggregator in order to save its energy
that each set of sensor nodes send their collected data to an        and reduce the big size of data sent to the sink.
intermediate nodes, called aggregators. Each aggregator has              In order to find redundant data sets received by the
an objective to clean data, using a specific filter defined later,   aggregator, we propose to use data clustering approach. Data
coming from neighboring sensor nodes before sending them             clustering is a data classification technique aims to group
to the sink. The aggregators can be defined prior to the             object having similar values with each other in order to
network deployment and could have more power than                    simplify their processing. In the literature, one can find a
normal sensor nodes, depending on the application                    huge number of data clustering algorithms like Kmeans,
requirements. Fig. 1 shows our sensor network architecture,          TopK neighboring, etc. However, Kmeans is one the most
where data transmission between sensor nodes and their               popular algorithms used in data classification/clustering.
appropriate aggregators is based on single-hop                       Unfortunately, traditional Kmeans suffer from its huge
communication.                                                       calculation time cost needed to find final datasets clusters. In
                                                                     order to overcome this problem, we propose a new version of
                                                                     Kmeans called FKmeans, Fast Kmeans, which highly
                                                                     enhances the time cost of traditional Kmeans. Our FKmeans


                                                                                                                                             13
consists of two stages of calculation, center selection and           In mathematics, the Euclidean distance is the ordinary
cluster formation stages, and uses Euclidean distance to          distance, i.e. straight line distance, between two points, sets
assign datasets to their proper clusters. In the next sections,   or objects. Let us consider two data sets, Ri and Rj, then the
we first recall traditional Kmeans and Euclidean distance         Euclidean distance (Ed) between them can be calculated as
then we details the two stages of our technique.                  follows:

A. Recall of Kmeans Algorithm
    Kmeans algorithm is based on the concept of
classifying/grouping data sets into K clusters using the          Where ri in Ri and rj in Rj.
means of sets. As a result, the similarity between sets in the
same cluster is high while the similarity between those in
different clusters is low. Kmeans clustering is a well-known      C. FKmeans: Fast Kmeans Algorithm
and well-studied exploratory data analysis technique. The             In data classification, the data latency represents one of
number of clusters is defined by K which is a positive integer    the main constraint that, in one hand, consumes energy of the
number. The main idea of Kmeans is to define K centroids,         aggregator due to the computational power and on other
one for each cluster. The process of Kmeans starts by taking,     hand, affects timely delivery of data to the sink node. One
each time, a data set from a given data sets then assigns it to   drawback of the traditional Kmeans algorithm is the time it
the nearest cluster centroid (Fig. 3).                            puts to generate the final clusters. This is due to the
                                                                  computation of Euclidean distance between all the received
                                                                  datasets and the centers of the clusters. Furthermore, this
                                                                  phase of cluster formation can be very complex, especially
                                                                  when it comes to sensor networks where readings’ sets can
                                                                  have ten hundreds or thousands elements.

                                                                       On the other hand, selecting randomly the centroids of
                                                                  the clusters at the initial step consumes a lot of time
                                                                  calculation at the aggregator level. Thus, it also affects the
                                                                  delivery time packet to the sink. Therefore, in order to
                                                                  minimize the data latency for the cluster formation, we
                                                                  propose a new version of Kmeans called FKmeans, Fast
                                                                  Kmeans, which is dedicated to periodic sensor applications
                                                                  having critical issue about the time delivery of packets to the
                                                                  sink node. Our FKmeans consists of two stages: center
                                                                  selection and cluster formation. In the next sections, we
                                                                  detail each of the proposed stage.

                                                                     1) Center Selection Stage
                                                                      Mostly, the efficiency and performance of the Kmeans
                                                                  algorithm is greatly affected by initial cluster centers as
          Fig. 3. Flow Chart of Kmeans Algorithm.                 different initial cluster centers often lead to different
                                                                  clustering. Thus, calculation time cost for the distance
    The first step is done when all points are assigned and an    between the centroids and the datasets will be high. Hence,
early clusters is forming. Then, we recalculate K new             selection of the initial center clusters is becoming a
centroids as centers of the new clusters. Once the new            challenge for Kmeans algorithm. To overcome this problem,
centroids are calculated, a new assigning has to be computed      researchers have proposed many techniques like density
between the datasets and the nearest new centroid. A loop         based, graph based, random based, etc. Unfortunately, most
has been formed. This loop results to change the location of      of these methods are very complex and not suitable to the
the K centroids step by step until no more changes are            WSN case that is characterized by small processing
noticed.                                                          capacity.
                                                                      The first stage of our adapted Kmeans is called center
                                                                  selection and aims to solve the above problem. We propose
B. Euclidean Distance
                                                                  to select a subset/training from the datasets coming to the
    Assigning data sets to the nearest cluster centroid is a      aggregator node in order to find the approximate final
fundamental process when applying Kmeans algorithm. To            cluster centers. Our intuition is to reduce the number of
do this, we propose to use distance functions as an important     iterations needed in the traditional Kmeans to obtain the
way of calculating the distance between two data sets.            final clusters, thus to enhance the processing time of the
Indeed, one can find a huge number of distance functions          Kmeans.
that have been used in the literature like Hamming, Cosine,           Obviously, the efficiency of the selection center stage is
Euclidean, etc. In this paper, we are interested in the           highly related to the percentage, represented by Ts (i.e.
Euclidean distance that is widely studied and used in             training size), of training datasets. Subsequently, increasing
different domains.                                                the value of Ts leads to increase the calculation time of
                                                                  FKmeans so no profit will be noticed compared to


                                                                                                                                    14
traditional Kmeans. On the other hand, the lowest the value                     V. PERFORMANCE EVALUATION
of Ts is the better processing time, and data delivery could           We introduce, in this section, the setup used to validate
be made but the error in the final obtained clusters will          the relevance and the efficiency of our proposal. We
increase thus the data accuracy at the sink node. Therefore,       conducted multiple series of simulations using a custom
selecting the appropriate value of Ts is very essential in the     Java based simulator. This section shows the simulation we
first stage of our technique. Indeed, we believe that Ts           conducted on data collected in the Intel sensor network [13].
should be determined by the decision makers or experts             In such network, 46 sensors are deployed in the Intel
depending on the application requirements. For instance, in        Berkeley Research Lab for approximately three months
health monitoring applications Ts must be lower than               collecting more than 3 millions of readings about weather
weather monitoring applications. Therefore, this parameter         conditions (temperature, humidity and light). Sensor
is based on the application criticality and the studied            sampling rate is fixed to 1 readings every 31 second. The
phenomenon. After selecting its value, the decision makers         positions of sensors inside the lab are shown in Fig. 5
assign the threshold Ts accordingly into all sensors nodes         (yellow sign indicates the dysfunction of some sensors). For
prior to deployment or they can adjust it online in function       simplicity reason, we show in this section the results of
of the application requirement.                                    temperature condition. The objective of our simulations was
                                                                   to confirm that our technique can successfully achieve
   2) Cluster Formation Stage                                      intended results for reducing the energy consumption in
After having the cluster centroids in the first stage, our         sensor nodes and extending network lifetime. In order to
objective now is to use such centers in the second stage in        evaluate the performance, we compare our results to the
order to form the final clusters. We believe that the obtained     traditional Kmeans.
centers will help in minimizing the within cluster sum of              In our simulations, we evaluated the performance using
squares error in the final clusters. Now, our objective in the     the following parameters:
second stage is to reduce the error with clusters, i.e. between          the period size, τ, takes the following values: 50,
datasets of each cluster. Figure 4 describes the procedure of               100, 150 and 200.
the second stage of our technique.                                       the percentage of data chosen, Ts, takes the
                                                                            following values: 5, 10, 15 and 20.
                                                                         the clusters number, K, takes the following values:
                                                                            4, 5, 6 and 7.


                                                                       Fig. 5. Distribution of sensors deployed in Intel lab
                                                                                             network.
         Fig. 4. Flow Chart of FKmeans Algorithm.

First, we determine the number of sets needed to find the          A. Execution Time
cluster centers in the first stage of our technique. Based on          Sometimes, delivering data as fast time as possible to the
this number, we randomly select the training sets among the        enduser is a crucial operation especially in e-health and
whole datasets R. The datasets in the training set represent       military applications. Fig. 6, show the execution time at
now the approximate centers of the clusters. Then, we              each sensor node for both FKmeans and the traditional
assign each set in training set to the nearest centers. This       Kmeans when varying the period size and the cluster
process is repeated until no more changes in the cluster           number. The results show that FKmean can optimize the
centers. At                                                        execution time, comparing always to the Kmeans, from 10%
this moment, the first stage is accomplished and the initial       (while varying taux from 50 to 100 measures) to 37% (while
centers are determined. After that, the second stage is            varying taux from 150 to 200 measures). Obviously, the
running where the process starts by considering the centers        execution time of FKmeans will be highly affected by the
obtained in the first stage as the initial centers to the          selection of the cluster centroids as well as the number of
clusters. Then, we assign each dataset in the whole datasets       iteration loops to obtain the final clusters.
to the nearest cluster center. Again, the loop is repeated until       Therefore, FKmeans outperforms the normal Kmeans
the cluster centers become fix in two successive loops.            where the processing time at the aggregator is twice
                                                                   accelerated when using FKmeans, compared to Kmeans


                                                                                                                                    15
algorithm to ensure fast data delivery to the sink node and     C. Variation of Sets Number Among Cluster
thus save energy.                                                   In this section, we study the distribution of sets between
                                                                the clusters after applying both FKmeans and Kmeans
                                                                algorithms along with the period number (Fig. 8). The
                                                                obtained results show that the sets are distributed in an
                                                                unequal way into the clusters. This confirms the behavior of
                                                                our FKmeans algorithm by classifying data sets based on
                                                                their dissimilarity and not in an equal way.


      (a) Ts=20, K=4                 (b) Ts=20, T=50
      Fig. 6. Processing time at the aggregator level.
                                                                                        (a) FKmeans
B. Iteration Loop
    One of the factor that can delay the delivery of message
is the number of iterations. In Fig. 7, we show how many
iterations are generated by the sensor at each period to find
the final clusters for both FKmeans and the Kmeans. It is
important to know that a high number of iterations can
increase the complexity of the proposed algorithm as well as
the data latency at the sensor. The obtained results show
that, The number of iterations is reduced by at least 20% as
                                                                                        (b) Kmeans
shown in these figure when applying FKmeans on the data
source. Therefore, FKmeans enhance the data latency by             Fig. 8. Number of sets in each cluster during periods,
reducing the number of iterations.                                                τ = 50, Ts = 20, K = 4.


                                                                D. Illustrative Example of Clusters
                                                                    In this section, we show an illustrative example to which
                                                                sensors are spatio-temporally correlated and how they are
                                                                clustered using our FKmeans algorithm, during a taken
                                                                period. We fixed the parameters as shown in Fig. 9. Based
                                                                on the figure, we can see that data generated by the sensors
                                                                in the lab are highly spatio-temporally correlated.
                                                                Furthermore, we can notice that a sensor is more correlated
                                                                to its nearest neighboring than the other nodes in the cluster.
                                                                However, sometimes, correlation between distant nodes can
                       (a) FKmeans                              be also seen due to the temporal correlation between their
                                                                generated data.


                       (b) Kmeans

      Fig. 7. Iteration loop number for FKmeans and                 Fig. 9. Example of FKmeans correlated sensors for
                   Kmeans, τ = 50, Ts = 20.                          each node during a period, τ = 50, Ts = 20, K = 4.


                                                                                                                                  16
E. Energy Consumption                                                                           REFERENCES
     In our technique, the aggregator will periodically receive    [1]  Jacques M Bahi, Abdallah Makhoul, and Maguy Medlej. An
sets of readings coming from all sensor nodes. After                    optimized in-network aggregation scheme for data collection in
                                                                        periodic sensor networks. In International Conference on Ad-Hoc
clustering redundant ones using FKmeans algorithm, the                  Networks and Wireless, pages 153–166. Springer, 2012.
aggregator selects centers of clusters to be sent to the sink      [2] Abdallah Makhoul, Hassan Harb, and David Laiymani. Residual
node, as a representative set of the cluster. In our simulation,        energy-based adaptive data collection approach for periodic sensor
we implemented the same energy model that used in [14] to               networks. Ad Hoc Networks, 35:149–160, 2015.
calculate the energy consumption in the aggregator level.          [3] Hassan Harb, Abdallah Makhoul, Rami Tawil, and Ali Jaber. Energy-
                                                                        efficient data aggregation and transfer in periodic sensor networks.
The proposed model computes the energy consumption in                   IET Wireless Sensor Systems, 4(4):149–158, 2014.
the aggregator when it receives data from sensors as well as       [4] Jacques M Bahi, Abdallah Makhoul, and Maguy Medlej. A two tiers
sending them to the sink. In addition, we compared our                  data aggregation scheme for periodic sensor networks. Adhoc &
results to those obtained with naïve approach where all                 Sensor Wireless Networks, 21(1), 2014.
datasets are sent from the aggregator to the sink without any      [5] Ramesh Rajagopalan and Pramod K Varshney. Data aggregation
clustering. Fig. 10 shows the energy consumed in aggregator             techniques in sensor networks: A survey. 2006.
depending on the period size. The obtained results show that       [6] Xiaoyan Kui, Jianxin Wang, Shigeng Zhang, and JLANNONG CAO.
                                                                        Energy balanced clustering data collection based on dominating set in
the energy consumption increases with the increasing of the             wireless sensor networks. Adhoc & Sensor Wireless Networks, 24,
period size while it is optimized, using FKmeans, up to 60%             2015.
compared to the naïve approach. Therefore, our proposed            [7] Pinghui Zou and Yun Liu. A data-aggregation scheme for wsn based
technique can be considered very efficiently in terms of                on optimal weight allocation. JNW, 9(1):100–107, 2014.
reducing the network energy consumption, thus, increasing          [8] M Shanmukhi and OBV Ramanaiah. Cluster-based comb-needle
its lifetime.                                                           model for energyefficient data aggregation in wireless sensor
                                                                        networks. In Applications and Innovations in Mobile Computing
                                                                        (AIMoC), 2015, pages 42–47. IEEE, 2015.
                                                                   [9] Tao Du, Zhe Qu, Qingbei Guo, and Shouning Qu. A high efficient
                                                                        and real time data aggregation scheme for wsns. International Journal
                                                                        of Distributed Sensor Networks, 11(6):261381, 2015.
                                                                   [10] Yao Lu, Ioan Sorin Comsa, Pierre Kuonen, and Beat Hirsbrunner.
                                                                        Dynamic data aggregation protocol based on multiple objective tree
                                                                        in wireless sensor networks. In Intelligent Sensors, Sensor Networks
                                                                        and Information Processing (ISSNIP), 2015 IEEE Tenth International
                                                                        Conference on, pages 1–7. IEEE, 2015.
                                                                   [11] Yung-Kuei Chiang, Neng-Chung Wang, and Chih-Hung Hsieh. A
                                                                        cycle-based data aggregation scheme for grid-based wireless sensor
                                                                        networks. Sensors, 14(5):8447–8464, 2014.
                                                                   [12] Chih-Min Chao and Tzu-Ying Hsiao. Design of structure-free and
                                                                        energy-balanced data aggregation in wireless sensor networks.
                                                                        Journal of Network and Computer Applications, 37:229–239, 2014.
         Fig. 10. Energy consumption in aggregator.                [13] Samuel Madden. http://db.csail.mit.edu/labdata/labdata.html. Intel
                                                                        Berkeley Research lab, 2004.
                                                                   [14] Rupali Rohankar, CP Katti, and Sushil Kumar. Comparison of energy
                    VI. CONCLUSION                                      efficient data collection techniques in wireless sensor network.
                                                                        Procedia Computer Science, 57:146–151, 2015.
    Wireless sensor networks (WSNs) are one of the most
advanced technologies nowadays. Therefore, researchers
have paid great attention in the past years to this hot field by
exploring the different challenges that cover it, while
presenting different solutions. Unfortunately, WSN suffers
from two major challenges: The big data collection that
complicate the decision and data analysis, and the energy


                                                                                                                                                17