FKmeans: A Fast Data Classification Technique to Handle Big Data Collected in Sensor Network Ola Majeda,‡, Hassan Harbb,†, Mohamad Hamzea,‡,, Ali Jabera,† a Computer Science department, Lebanese University, Beirut, Lebanon b Computer Science department, American University of Culture and Education (AUCE), Tyre, Lebanon Emails: olamajed0@gmail.com, hassanhareb@auce.edu.lb, {mohammad.hamze,ali.jaber}@ul.edu.lb Abstract— In recent times, The development of wireless together to evacuate huge amounts of redundant data steered sensor netwoks (WSNs) assumes a noteworthy job in the ascent on the networks, consequently to limit the amount of of big data as a large number of their applications gather transmission and conserve energy. enormous amounts of data that require processing. Therefore, WSN faces two noteworthy difficulties. To begin with, it In this work, we propose a fast kmeans, abbreviated by handles the big data collection, and second, the energy of sensors will be drained rapidly because of the immense volume FKmeans clustering technique used for wireless sensor of data gathering and transmission. Subsequently, flow look networks in order to reduce the amount of transmitted data into has been centered around data classification as a in the network, and thus save energy consumption. Two proficient technique to decrease big data accumulation in stage algorithms is proposed in this work which strongly WSNs along these lines upgrading their lifetime. This paper outperform the conventional Kmeans regarding time cost of proposes a fast data classification technique called FKmeans, distance calculation among data and center clusters. The i.e. Fast Kmeans, committed to periodic applications in WSNs. main purpose of the first stage (center selection stage) is to FKmeans comprises of two phase algorithm to improve the find the ideal location of the centers by selecting a small time cost of separation estimation of conventional Kmeans part of datasets instead of selecting the whole datasets. After calculation accordingly, guarantee ensure fast data delivery to the sink node. The main stage, i.e. center selection stage, getting the center clusters from stage one, the second stage chooses a little segment of datasets with the end goal to locate will use the conventional Kmeans algorithm which is the most ideal area of the centers. The second stage, i.e. cluster adopted to the Euclidean distance for assigning each dataset formation stage, uses the conventional Kmeans algorithm to its corresponding clusters. Hence, FKmeans will show a adopted to the Euclidean distance where the underlying remarkable reduce of time cost when compared with centers utilized are taken from the primary stage. Our traditional Kmeans, this is caused by small amount of proposed technique is validated via simulations on real sensor training data that is used in the first stage which resulted in data and comparison with the conventional Kmeans algorithm. few iteration loops in the second one. The gotten outcomes demonstrate the adequacy of our technique regarding enhancing the energy utilization and data conveyance delay, without loss in data fidelity. The rest of the paper is talks about: Works related to data clustering in WSNs are shown in section II, next, Keywords—Wireless sensor networks, Periodic applications, section III shows the architecture used in our network, after Data Clustering, FKmeans Algorithm, Real Sensor Data. that, section IV talk about data clustering model which displayed at the aggregator level, section V show the real data sensors simulation, and finally section VI derives our I. INTRODUCTION paper and presents some perspectives. Wireless sensor networks (WSNs) have become a highly active research today. Their applicability can be seen diverse domains such as environment, human, medical, II. RELATED WORK industrial, etc. Typically, a WSN consists of a large number Data clustering, as defined by some researchers [1, 2, 3, of sensor nodes which are densely deployed over the 4, 5], is the process of grouping similar packets coming from monitored area in order to collect, then transmit, the data different sensors into groups or clusters thus, to eliminate periodically to the remote sink node. redundancy before sending final datasets to the sink; so that the number of transmissions through the network is Such type of data collection, usually referred to periodic consequently reduced. The main goal behind classifying data sensor network (PSN), generates a huge number of data is to eliminate redundant data transmission in order to which makes data studying and analyzing a difficult task for enhance the lifetime of the network and send only the the enduser. Furthermore, sensor nodes have a non- information desired by the end user. Nowadays, data renewable power supply and, once deployed, must work aggregation and clustering techniques are the most data classification methods in order to minimize the data unattended. Thus, data collection and transmission in WSNs redundancy in WSNs. should be minimized in order to reduce the energy consumption and increase the network lifetime. Recently, clustering-based data reduction techniques have been used due to their importance in reducing the Hence, to dodge the previously mentioned issues,data energy consumption in WSNs [6, 7, 8]. In [6], the authors clustering and classification techniques have been presented. propose EBDSC scheme to enhance the lifetime of the Clustering/Classification means to amass similar data network where the different devices balance the power 12 consumption among them. The life duration of the node is calculated if it is selected as a cluster head. The next cluster head is the node that has the highest lifetime in the same cluster. The authors in [9] propose DMLDA, an efficient data aggregation technique that works on three tasks: activating nodes, clustering nodes, and filtering messages. The authors propose in [10] a semi-structured aggregation protocol based on multi-objective tree in WSNs. Their main objective is to increase the aggregation probability and then extend the network lifetime. The authors present in [11] a Cycle-Based Data Aggregation Scheme Fig. 1. Distribution of sensors deployed in Intel lab (CBDAS) for energy saving. In this technique, a grid of cells network. is created in the WSN, each with a head. These heads are linked together to form a cyclic chain and then the data transmission is reduced and the network lifetime is On the other hand, in the periodic collection model, data prolonged. The authors suggest in [12] SFEB, which is a are collected in a periodic basis where each period p is Structure-Free and Energy-Balanced data aggregation partitioned into time slots. At each slot t, each sensor node Ni protocol. This technique relies on two-phase aggregation captures a new reading ri. At the end of the period p, Ni process and a selection mechanism for dynamic aggregator collects a vector of τ readings, i.e. Ri = {r1, r2, ..., rτ}, then it that realizes the data gathering and reduces the number of sends it to the sink (Fig. 2(a)). In our system, each sensor transmissions. node sends periodically (period p) its data to the appropriate aggregator, which in turn sends it to the sink (Fig. 2(b)). Our Although most of the proposed techniques allow efficient objective is to allow aggregator to classify datasets coming data reduction, however they present several disadvantages. from the sensors into groups of similar data then to send only They are almost complex, sometimes they generate one useful information to the sink node. communication overhead, and the sink may need some transmissions to detect failures. In this paper, we present a fast kmeans, abbreviated FKmeans, clustering technique for wireless sensor networks to decrease the data transmission in the network thus save the energy consumption. Then, in order to evaluate our technique, we conducted a set of simulations followed by experiments on a real environment sensor networks. III. CLUSTER-BASED PERIODIC NETWORK After being deployed in the field of interest, sensor nodes organize themselves in the network with the sink node. The network’s topology plays an important role in WSNs because (a) periodic data (b) data filtering scheme of its impact on energy consumption and the network acquisition reliability. Indeed, There are two major topology have been proposed for WSNs: tree-based and cluster-based. However, Fig. 2. Periodic data filtering scheme. due to its ability to reduce transmission distance or hops between sensors and the sink as well as to perform aggregation processing at intermediate nodes, cluster-based IV. OUR TECHNIQUE scheme has been more used compared to tree-based network. At the end of each period, the aggregator will receive Thus, in this paper, we are interested in the cluster-based datasets coming from all sensors. Mostly, the spatial- topology with the periodic data collection model in sensor temporal correlation between nodes leads to high redundancy networks. between the received datasets. Hence, the redundancy should From one hand, with cluster-based topology, we assume be eliminated by the aggregator in order to save its energy that each set of sensor nodes send their collected data to an and reduce the big size of data sent to the sink. intermediate nodes, called aggregators. Each aggregator has In order to find redundant data sets received by the an objective to clean data, using a specific filter defined later, aggregator, we propose to use data clustering approach. Data coming from neighboring sensor nodes before sending them clustering is a data classification technique aims to group to the sink. The aggregators can be defined prior to the object having similar values with each other in order to network deployment and could have more power than simplify their processing. In the literature, one can find a normal sensor nodes, depending on the application huge number of data clustering algorithms like Kmeans, requirements. Fig. 1 shows our sensor network architecture, TopK neighboring, etc. However, Kmeans is one the most where data transmission between sensor nodes and their popular algorithms used in data classification/clustering. appropriate aggregators is based on single-hop Unfortunately, traditional Kmeans suffer from its huge communication. calculation time cost needed to find final datasets clusters. In order to overcome this problem, we propose a new version of Kmeans called FKmeans, Fast Kmeans, which highly enhances the time cost of traditional Kmeans. Our FKmeans 13 consists of two stages of calculation, center selection and In mathematics, the Euclidean distance is the ordinary cluster formation stages, and uses Euclidean distance to distance, i.e. straight line distance, between two points, sets assign datasets to their proper clusters. In the next sections, or objects. Let us consider two data sets, Ri and Rj, then the we first recall traditional Kmeans and Euclidean distance Euclidean distance (Ed) between them can be calculated as then we details the two stages of our technique. follows: A. Recall of Kmeans Algorithm Kmeans algorithm is based on the concept of classifying/grouping data sets into K clusters using the Where ri in Ri and rj in Rj. means of sets. As a result, the similarity between sets in the same cluster is high while the similarity between those in different clusters is low. Kmeans clustering is a well-known C. FKmeans: Fast Kmeans Algorithm and well-studied exploratory data analysis technique. The In data classification, the data latency represents one of number of clusters is defined by K which is a positive integer the main constraint that, in one hand, consumes energy of the number. The main idea of Kmeans is to define K centroids, aggregator due to the computational power and on other one for each cluster. The process of Kmeans starts by taking, hand, affects timely delivery of data to the sink node. One each time, a data set from a given data sets then assigns it to drawback of the traditional Kmeans algorithm is the time it the nearest cluster centroid (Fig. 3). puts to generate the final clusters. This is due to the computation of Euclidean distance between all the received datasets and the centers of the clusters. Furthermore, this phase of cluster formation can be very complex, especially when it comes to sensor networks where readings’ sets can have ten hundreds or thousands elements. On the other hand, selecting randomly the centroids of the clusters at the initial step consumes a lot of time calculation at the aggregator level. Thus, it also affects the delivery time packet to the sink. Therefore, in order to minimize the data latency for the cluster formation, we propose a new version of Kmeans called FKmeans, Fast Kmeans, which is dedicated to periodic sensor applications having critical issue about the time delivery of packets to the sink node. Our FKmeans consists of two stages: center selection and cluster formation. In the next sections, we detail each of the proposed stage. 1) Center Selection Stage Mostly, the efficiency and performance of the Kmeans algorithm is greatly affected by initial cluster centers as Fig. 3. Flow Chart of Kmeans Algorithm. different initial cluster centers often lead to different clustering. Thus, calculation time cost for the distance The first step is done when all points are assigned and an between the centroids and the datasets will be high. Hence, early clusters is forming. Then, we recalculate K new selection of the initial center clusters is becoming a centroids as centers of the new clusters. Once the new challenge for Kmeans algorithm. To overcome this problem, centroids are calculated, a new assigning has to be computed researchers have proposed many techniques like density between the datasets and the nearest new centroid. A loop based, graph based, random based, etc. Unfortunately, most has been formed. This loop results to change the location of of these methods are very complex and not suitable to the the K centroids step by step until no more changes are WSN case that is characterized by small processing noticed. capacity. The first stage of our adapted Kmeans is called center selection and aims to solve the above problem. We propose B. Euclidean Distance to select a subset/training from the datasets coming to the Assigning data sets to the nearest cluster centroid is a aggregator node in order to find the approximate final fundamental process when applying Kmeans algorithm. To cluster centers. Our intuition is to reduce the number of do this, we propose to use distance functions as an important iterations needed in the traditional Kmeans to obtain the way of calculating the distance between two data sets. final clusters, thus to enhance the processing time of the Indeed, one can find a huge number of distance functions Kmeans. that have been used in the literature like Hamming, Cosine, Obviously, the efficiency of the selection center stage is Euclidean, etc. In this paper, we are interested in the highly related to the percentage, represented by Ts (i.e. Euclidean distance that is widely studied and used in training size), of training datasets. Subsequently, increasing different domains. the value of Ts leads to increase the calculation time of FKmeans so no profit will be noticed compared to 14 traditional Kmeans. On the other hand, the lowest the value V. PERFORMANCE EVALUATION of Ts is the better processing time, and data delivery could We introduce, in this section, the setup used to validate be made but the error in the final obtained clusters will the relevance and the efficiency of our proposal. We increase thus the data accuracy at the sink node. Therefore, conducted multiple series of simulations using a custom selecting the appropriate value of Ts is very essential in the Java based simulator. This section shows the simulation we first stage of our technique. Indeed, we believe that Ts conducted on data collected in the Intel sensor network [13]. should be determined by the decision makers or experts In such network, 46 sensors are deployed in the Intel depending on the application requirements. For instance, in Berkeley Research Lab for approximately three months health monitoring applications Ts must be lower than collecting more than 3 millions of readings about weather weather monitoring applications. Therefore, this parameter conditions (temperature, humidity and light). Sensor is based on the application criticality and the studied sampling rate is fixed to 1 readings every 31 second. The phenomenon. After selecting its value, the decision makers positions of sensors inside the lab are shown in Fig. 5 assign the threshold Ts accordingly into all sensors nodes (yellow sign indicates the dysfunction of some sensors). For prior to deployment or they can adjust it online in function simplicity reason, we show in this section the results of of the application requirement. temperature condition. The objective of our simulations was to confirm that our technique can successfully achieve 2) Cluster Formation Stage intended results for reducing the energy consumption in After having the cluster centroids in the first stage, our sensor nodes and extending network lifetime. In order to objective now is to use such centers in the second stage in evaluate the performance, we compare our results to the order to form the final clusters. We believe that the obtained traditional Kmeans. centers will help in minimizing the within cluster sum of In our simulations, we evaluated the performance using squares error in the final clusters. Now, our objective in the the following parameters: second stage is to reduce the error with clusters, i.e. between  the period size, τ, takes the following values: 50, datasets of each cluster. Figure 4 describes the procedure of 100, 150 and 200. the second stage of our technique.  the percentage of data chosen, Ts, takes the following values: 5, 10, 15 and 20.  the clusters number, K, takes the following values: 4, 5, 6 and 7. Fig. 5. Distribution of sensors deployed in Intel lab network. Fig. 4. Flow Chart of FKmeans Algorithm. First, we determine the number of sets needed to find the A. Execution Time cluster centers in the first stage of our technique. Based on Sometimes, delivering data as fast time as possible to the this number, we randomly select the training sets among the enduser is a crucial operation especially in e-health and whole datasets R. The datasets in the training set represent military applications. Fig. 6, show the execution time at now the approximate centers of the clusters. Then, we each sensor node for both FKmeans and the traditional assign each set in training set to the nearest centers. This Kmeans when varying the period size and the cluster process is repeated until no more changes in the cluster number. The results show that FKmean can optimize the centers. At execution time, comparing always to the Kmeans, from 10% this moment, the first stage is accomplished and the initial (while varying taux from 50 to 100 measures) to 37% (while centers are determined. After that, the second stage is varying taux from 150 to 200 measures). Obviously, the running where the process starts by considering the centers execution time of FKmeans will be highly affected by the obtained in the first stage as the initial centers to the selection of the cluster centroids as well as the number of clusters. Then, we assign each dataset in the whole datasets iteration loops to obtain the final clusters. to the nearest cluster center. Again, the loop is repeated until Therefore, FKmeans outperforms the normal Kmeans the cluster centers become fix in two successive loops. where the processing time at the aggregator is twice accelerated when using FKmeans, compared to Kmeans 15 algorithm to ensure fast data delivery to the sink node and C. Variation of Sets Number Among Cluster thus save energy. In this section, we study the distribution of sets between the clusters after applying both FKmeans and Kmeans algorithms along with the period number (Fig. 8). The obtained results show that the sets are distributed in an unequal way into the clusters. This confirms the behavior of our FKmeans algorithm by classifying data sets based on their dissimilarity and not in an equal way. (a) Ts=20, K=4 (b) Ts=20, T=50 Fig. 6. Processing time at the aggregator level. (a) FKmeans B. Iteration Loop One of the factor that can delay the delivery of message is the number of iterations. In Fig. 7, we show how many iterations are generated by the sensor at each period to find the final clusters for both FKmeans and the Kmeans. It is important to know that a high number of iterations can increase the complexity of the proposed algorithm as well as the data latency at the sensor. The obtained results show that, The number of iterations is reduced by at least 20% as (b) Kmeans shown in these figure when applying FKmeans on the data source. Therefore, FKmeans enhance the data latency by Fig. 8. Number of sets in each cluster during periods, reducing the number of iterations. τ = 50, Ts = 20, K = 4. D. Illustrative Example of Clusters In this section, we show an illustrative example to which sensors are spatio-temporally correlated and how they are clustered using our FKmeans algorithm, during a taken period. We fixed the parameters as shown in Fig. 9. Based on the figure, we can see that data generated by the sensors in the lab are highly spatio-temporally correlated. Furthermore, we can notice that a sensor is more correlated to its nearest neighboring than the other nodes in the cluster. However, sometimes, correlation between distant nodes can (a) FKmeans be also seen due to the temporal correlation between their generated data. (b) Kmeans Fig. 7. Iteration loop number for FKmeans and Fig. 9. Example of FKmeans correlated sensors for Kmeans, τ = 50, Ts = 20. each node during a period, τ = 50, Ts = 20, K = 4. 16 E. Energy Consumption REFERENCES In our technique, the aggregator will periodically receive [1] Jacques M Bahi, Abdallah Makhoul, and Maguy Medlej. An sets of readings coming from all sensor nodes. After optimized in-network aggregation scheme for data collection in periodic sensor networks. In International Conference on Ad-Hoc clustering redundant ones using FKmeans algorithm, the Networks and Wireless, pages 153–166. Springer, 2012. aggregator selects centers of clusters to be sent to the sink [2] Abdallah Makhoul, Hassan Harb, and David Laiymani. Residual node, as a representative set of the cluster. In our simulation, energy-based adaptive data collection approach for periodic sensor we implemented the same energy model that used in [14] to networks. Ad Hoc Networks, 35:149–160, 2015. calculate the energy consumption in the aggregator level. [3] Hassan Harb, Abdallah Makhoul, Rami Tawil, and Ali Jaber. Energy- efficient data aggregation and transfer in periodic sensor networks. The proposed model computes the energy consumption in IET Wireless Sensor Systems, 4(4):149–158, 2014. the aggregator when it receives data from sensors as well as [4] Jacques M Bahi, Abdallah Makhoul, and Maguy Medlej. A two tiers sending them to the sink. In addition, we compared our data aggregation scheme for periodic sensor networks. Adhoc & results to those obtained with naïve approach where all Sensor Wireless Networks, 21(1), 2014. datasets are sent from the aggregator to the sink without any [5] Ramesh Rajagopalan and Pramod K Varshney. Data aggregation clustering. Fig. 10 shows the energy consumed in aggregator techniques in sensor networks: A survey. 2006. depending on the period size. The obtained results show that [6] Xiaoyan Kui, Jianxin Wang, Shigeng Zhang, and JLANNONG CAO. Energy balanced clustering data collection based on dominating set in the energy consumption increases with the increasing of the wireless sensor networks. Adhoc & Sensor Wireless Networks, 24, period size while it is optimized, using FKmeans, up to 60% 2015. compared to the naïve approach. Therefore, our proposed [7] Pinghui Zou and Yun Liu. A data-aggregation scheme for wsn based technique can be considered very efficiently in terms of on optimal weight allocation. JNW, 9(1):100–107, 2014. reducing the network energy consumption, thus, increasing [8] M Shanmukhi and OBV Ramanaiah. Cluster-based comb-needle its lifetime. model for energyefficient data aggregation in wireless sensor networks. In Applications and Innovations in Mobile Computing (AIMoC), 2015, pages 42–47. IEEE, 2015. [9] Tao Du, Zhe Qu, Qingbei Guo, and Shouning Qu. A high efficient and real time data aggregation scheme for wsns. International Journal of Distributed Sensor Networks, 11(6):261381, 2015. [10] Yao Lu, Ioan Sorin Comsa, Pierre Kuonen, and Beat Hirsbrunner. Dynamic data aggregation protocol based on multiple objective tree in wireless sensor networks. In Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP), 2015 IEEE Tenth International Conference on, pages 1–7. IEEE, 2015. [11] Yung-Kuei Chiang, Neng-Chung Wang, and Chih-Hung Hsieh. A cycle-based data aggregation scheme for grid-based wireless sensor networks. Sensors, 14(5):8447–8464, 2014. [12] Chih-Min Chao and Tzu-Ying Hsiao. Design of structure-free and energy-balanced data aggregation in wireless sensor networks. Journal of Network and Computer Applications, 37:229–239, 2014. Fig. 10. Energy consumption in aggregator. [13] Samuel Madden. http://db.csail.mit.edu/labdata/labdata.html. Intel Berkeley Research lab, 2004. [14] Rupali Rohankar, CP Katti, and Sushil Kumar. Comparison of energy VI. CONCLUSION efficient data collection techniques in wireless sensor network. Procedia Computer Science, 57:146–151, 2015. Wireless sensor networks (WSNs) are one of the most advanced technologies nowadays. Therefore, researchers have paid great attention in the past years to this hot field by exploring the different challenges that cover it, while presenting different solutions. Unfortunately, WSN suffers from two major challenges: The big data collection that complicate the decision and data analysis, and the energy 17