Holistic distributed stream clustering for smart grids
                                             Pedro Pereira Rodrigues1 and João Gama2


Abstract. Smart grids consist of millions of automated electronic                to use ant collony optimization on smart meters data to improve the
meters that will be installed in electricity distribution networks and           current balancing on low-voltage distribution network. Further re-
connected to servers that will manage grid supervision, billing and              search could even take more advantages from smart grids if con-
customer services. World sustainability regarding energy manage-                 sumption patterns could be extracted [14].
ment will definitely rely on such grids, so smart grids need also to                The energy market is changing to meet the global challenge of
be sustainable themselves. This sustainability depends on several re-            power consumption awareness even at the lower household level [3].
search problems that emerge from this new setting (from power bal-               New energy distribution concepts and the advent of smart grids has
ance to energy markets) requiring new approaches for knowledge                   changed the way energy is priced, negotiated and billed. We are
discovery and decision support. This paper presents a holistic dis-              now in a world of hourly real-time pricing [1] which make use of
tributed stream clustering view of possible solutions for those prob-            smart meters to overcome the need for demand prediction preci-
lems, supported by previous research in related domains. The ap-                 sion and, more important, demand prediction reliability [13]. Fur-
proach is based on two orthogonal clustering algorithms, combined                thermore, with the advent of micro-generation at household level, the
for a holistic clustering of the grid. Experimental results are included         market expanded into multiplicity of energy buyers and energy sell-
to illustrate the benefits of each algorithm, while the proposal is dis-         ers. In this setting, new techniques to efficiently auction in the market
cussed in terms of application to smart grid problems. This holistic             are required in order to make the smart grid smarter. Ramachandran
approach could be used to help solving some of the smart grid intel-             et al. (2011) developed a profit-maximizing adaptive bidding strategy
ligent layer research problems, thus improving global sustainability.            based on hybrid-immune-system-based particle swarm optimization.


1 INTRODUCTION                                                                   1.2    Components and features
The Smart Grid (SG), regarded as the next generation power grid,                 Smart grids are built on different sub-systems and present special
is an electric system that uses two-way digital information, cyber-              features that need to be attended. The sources of energy are hetero-
secure communication technologies, and computational intelligence                geneous (power plants, wind, sun, sea, etc) and might be intermittent.
in an integrated fashion across heterogeneous and distributed elec-              A key characteristic of a SG is that it supports two-way flow of elec-
tricity generation, transmission, distribution and consumption to                tricity and information: a user might generate electricity and put it
achieve energy efficiency. It is a loose integration of complementary            back into the grid; electric vehicles may be used as mobile batter-
components, subsystems, functions, and services under the pervasive              ies, sending power back to the grid when demand is high, etc. This
control of highly intelligent management-and-control systems [4].                backward flow is relevant, mainly in microgrids, where parts of the
   A key and novel characteristic of smart grids is the intelligent layer        system that might be islanded due to power failures. Following [4],
that analyses the data produced by these meters allowing companies               the three major systems in SG are:
to develop powerful new capabilities in terms of grid management,
planning and customer services for energy efficiency. The develop-               • Smart infrastructure system that supports advanced and heteroge-
ment of the market with a growing share of load management incen-                  neous electricity generation, delivery and consumption. Is respon-
tives and the increasing number of local generators will bring new                 sible for metering information and monitoring, and information
difficulties to grid management and exploitation.                                  transmission among of systems, devices and sensors.
                                                                                 • Management systems providing advanced management and mon-
                                                                                   itoring, grid topology and control services. The objectives are en-
1.1    Research problems                                                           ergy efficiency improvement, supply and demand balance, emis-
Power and current balance is major goal of all electricity distribu-               sion control, operation cost reduction, and utility maximization.
tion networks, given its impact on the need to produce, buy or sell              • Protection system providing grid reliability analysis, failure pro-
energy. Moreover, due to the fluctuating power from renewable en-                  tection, security and privacy protection services.
ergy sources and loads, supply-demand balancing of power system
becomes problematic [17]. Several intelligent techniques have been
proposed in the past that make use of the amounts of streaming data              1.3    Advantages and challenges
that is available. As an example, Pasdar and Mahne (2011) proposed               Some of the anticipated benefits of a SG include [4]:
1 LIAAD - INESC TEC & Faculty of Medicine of the University of Porto,
  Portugal, email: pprodrigues@med.up.pt                                         • improving power reliability and quality;
2 LIAAD - INESC TEC & Faculty of Economics of the University of Porto,           • optimizing facility utilization and averting construction of back-
  Portugal, email: jgama@fep.up.pt                                                 up (peak load) power plants;


                                                                            18
• enhancing capacity and efficiency of existing electric power net-             to different groups [6]. There are two different clustering problems
  works, hence improving resilience to disruption;                              in ubiquitous and streaming settings: clustering sensor streams and
• enabling predictive maintenance and self-healing responses to                 clustering streaming sensors. The former problem searches for dense
  system disturbances;                                                          regions of the data space, identifying hot-spots where sensors tend to
• facilitating expanded deployment of renewable energy sources;                 produce data, while the latter finds groups of sensors that behave sim-
• accommodating distributed power sources, while automating                     ilarly through time [15]. We identify two different settings for clus-
  maintenance and operation;                                                    tering problems in smart grids. In the first setting a cluster is defined
• reducing greenhouse gas emissions by enabling electric vehicles               to be a set of sensors (meters, households, generators, etc.). In the
  and new power sources, thus reducing oil consumption by reduc-                second setting, a cluster is defined to be a set of data points (demand,
  ing the need for inefficient generation during peak usage periods;            supply, prices, etc.) generated by multiple sources.
• presenting opportunities to improve grid security;
• enabling transition to plug-in electric vehicles and new energy
  storage options;
                                                                                2.1    Research on clustering electrical networks
• increasing consumer choice, new products, services, and markets.              Several real-world applications use machine learning methods to ex-
                                                                                tract knowledge from sensor networks. The case of electricity load
   All these jointly lead to massive research problems that might be
                                                                                demand analysis is a paradigmatic one that has been (and continues
tackled by artificial intelligence techniques. Some challenges where
                                                                                to be) studied. Sensors distributed all around electrical-power distri-
machine learning can play a relevant role, include:
                                                                                bution networks produce streams of data at high-speed. Three major
• The reliability of the system supports itself on millions of meters           questions rise: a) can we define consumption profiles based on simi-
  and other devices that require online monitoring and global asset             lar sensors? b) can we find global patterns in network consumption?
  management [2].                                                               and c) can we manage the uncertainty in sensor data?
• Real-time simulation and contingency analysis of the entire grid                 To efficiently find consumption profiles, clustering techniques
  have to be possible. However, not all operations models currently             were applied to the streams produced by each sensor, either hier-
  make use of real-time data [8].                                               archically at a central server [16] or distributed in the network [15].
• Interoperability issues that arise from the integration of distributed        Although the problem is still very hard to model, given the dimen-
  generation and alternate energy sources [17].                                 sionality of the networks at stake, the incremental systems evolved
• The heterogeneity and volatility of smart grids require mecha-                and adapt to changes in the data, bridging the gap to future paths
  nisms to allow islanding [9] and self-healing [2].                            of research. Regarding global network patterns, related research has
• Finer granularity in management leads to strong demand response               resulted in a system that distributes the clustering process into lo-
  requirements [7] and dynamic pricing strategies [1].                          cal and central tasks, based on single sensor data discretization and
                                                                                centralized clustering of frequent states [5]. But data and models are
                                                                                both uncertain. For example, if a sensor reads 100, most of times it
2   THE DATA MINING POINT OF VIEW                                               could be 99 or 101. This uncertainty has been tackled by reliability
Present SG monitoring systems suffer from the lack of machine                   estimators and improved predictions using those estimates [13], but
learning technologies that can adapt the behavior of monitoring sys-            reliability for clustering definitions is still uncharted territory.
tems on the basis of the sequence patterns arriving over time. From
a data mining point of view, a smart grid is a network (eventually
                                                                                2.2    Clustering as a smart grid problem solver
decomposable) of distributed sources of high-speed data streams.
   Smart meters produce streams of data continuously in real-time. A            In this work we argue that major smart grids problems previously
data stream is an ordered sequence of instances that can be read only           enunciated can and should be addressed as unsupervised machine
once or a small number of times [6, 10], using limited computing and            learning problems.
storage capabilities. These sources of data are characterized by being
open-ended, flowing at high-speed, and generated by non stationary              Power balance Power balance is the most basic-level problem that
distributions.In smart grids the dynamics of data are unknown; the                 smart grids need to solve before anything else. The strongest re-
topology of network changes over time, the number of meters tends                  quirement is that energy is available in the entire network. Hence,
to increase and the context where the meter acts evolves over time.                clustering the data and sources together to find hot-spots can de-
   In smart grids, several knowledge discovery tasks are involved:                 tect specific points of danger in the network.
prediction, cluster (profiling) analysis, event and anomaly detection,          Multiple alternate sources In smart grids, supply and demand
correlation analysis, etc. However, different types of devices present             must be leveled across multiple alternate sources. Hence, com-
different levels of resources and care should be taken in data mining              bining clustering definitions for power demand and power supply
methods that aim to extract knowledge from such restricted scenar-                 should give indications on how to better level the sources.
ios. All these characteristics constitute real challenges and oppor-            Contingency analysis Contingency analysis tries to produce detec-
tunities for applied research in ubiquitous data mining. Generally,                tion and reaction mechanisms to specific unexpected problems.
the main features inherent to ubiquitous learning algorithms are that              Hence, monitoring the evolution of clusters of nodes, should help
the system should be capable of process data incrementally, evolving               on detecting drifting sources of demand or supply.
over time, while monitoring the evolution of its own learning process           Islanding Islanding is a concept that is directly connected with clus-
and self-diagnosis this process. However, learning algorithms differ               tering, in the sense that it searches for subnetworks where de-
in the extent of self-awareness they offer in this diagnosis. .                    mand and supply are leveled. Hence, local distributed clustering
   One of the most popular knowledge discovery techniques is clus-                 of sources and data should produce the expected definitions.
tering, the process of finding groups in data such that data objects            Self-healing Self-healing relates to the ability to rearrange and
clustered in the same group are more alike than objects assigned                   adapt the network on-the-fly to meet unexpected changes. Hence,


                                                                           19
  ad-hoc distributed clustering of sources, independently from a                                                                                              Impact of the number of sensors on Kappa
  centralized server, should produce procedures for self-healing.


                                                                                                                            1.0
Online monitoring and asset management These features are
                                                                                                                                   −                          Averaged over values of s for each domain−cluster (d,k) pair
                                                                                                                                   −
                                                                                                                                   −     −
                                                                                                                                   −

  strongly connected with incremental learning and adaptation of
                                                                                                                                                     −                                                                                           ●
                                                                                                                                                                                                                                                     k=3
                                                                                                                                         −
                                                                                                                                         −                                         −                                                                 k=4


                                                                                                                            0.9
                                                                                                                                                                                                                                                 ●

                                                                                                                                   −     −
                                                                                                                                         −
                                                                                                                                   −     −                                                                                                       ●
                                                                                                                                                                                                                                                     k=5
                                                                                                                                   −

  learned models. Hence, incremental models for sources and data
                                                                                                                                   −
                                                                                                                                   −                 −
                                                                                                                                                                                                                                                 ●
                                                                                                                                                                                                                                                     k=6
                                                                                                                                   −     −           −                                                                                           ●
                                                                                                                                                                                                                                                     k=2
                                                                                                                                         −
                                                                                                                                         −           −                             −
                                                                                                                                         −           −                             −                                                             ●
                                                                                                                                                                                                                                                     k=7

  clustering, and their evolution, should provide basic information.


                                                                                                                            0.8
                                                                                                                                                     −                             −


                                                                                                Kappa
                                                                                                                                                     −                             −                                                         −
                                                                                                                                                     −                             −                                                         −
                                                                                                                                                     −                             −
                                                                                                                                                                                   −                                                         −

Dynamic energy pricing Energy pricing largely depends on supply                                                                                      −                             −                                                         −
                                                                                                                                                                                   −                                                         −
                                                                                                                                                                                                                                             −


                                                                                                                            0.7
                                                                                                                                                                                                                                             −
                                                                                                                                                                                                                                             −
                                                                                                                                                                                                                                             −
  and demand balance. Hence, clustering power demand and supply                                                                                                                                                                              −

  together with buy and sell prices, should give insights on prospec-


                                                                                                                            0.6
  tive energy pricing.
                                                                                                                                   8     16          32                            64                                                        128

                                                                                                                                                                                  Number of sensors (d)


3 HOLISTIC DISTRIBUTED CLUSTERING                                                                                                                             Impact of the number of sensors on Kappa


                                                                                                                            1.0
                                                                                                                                                              Averaged over values of k for each domain−overlap (d,s) pair
                                                                                                                                         −
The smart grid produces different types of data, on each source (node
                                                                                                                                   −                 −
                                                                                                                                   −                                               −
                                                                                                                                   −     −
                                                                                                                                                     −                             −
                                                                                                                                                                                                                                                 ●
                                                                                                                                                                                                                                                     s=0.01
                                                                                                                                                                                                                                             −
or subnetwork), which must be taken into account: power demand,
                                                                                                                                                                                                                                                     s=0.05


                                                                                                                            0.9
                                                                                                                                                                                                                                                 ●

                                                                                                                                   −     −                                                                                                       ●
                                                                                                                                                                                                                                                     s=0.1
                                                                                                                                   −

power supply, energy sell price, energy buy price. As previously                                                                         −                                                                                                   −
                                                                                                                                         −           −


                                                                                                                            0.8
                                                                                                Kappa
stated, two clustering problems exist: clustering data and clustering                                                                    −                                         −
                                                                                                                                                     −
                                                                                                                                                     −                             −
data sources. This way, each node might be assigned to a cluster on                                                                                                                                                                          −


                                                                                                                            0.7
                                                                                                                                                                                   −
                                                                                                                                                     −
(at least) eight different clustering definitions. For all problems, there                                                                                                         −
                                                                                                                                                                                                                                             −


                                                                                                                            0.6
                                                                                                                                                                                                                                             −
is a common requirement: each node (meter) should process locally                                                                                                                                                                            −


their own data. Only aggregated data should be shared between the                                                                  8     16          32                            64                                                        128

different nodes in the grid.                                                                                                                                                      Number of sensors (d)


   From the previous section it became clear that a holistic approach                                                                                Impact of Communication Incompleteness in Agreement (σ=0.05)
to clustering in smart grids is needed and should produce benefits
                                                                                                                            1.0


to energy sustainability. In this section we present such a proposal,
based on two existing works on stream clustering (L2GClust and
                                                                                    Average Proportion of Agreement, P(A)

                                                                                                                            0.9


DGClust) and their prospective integration in a multi-dimensional
                                                                                                                            0.8


clustering system. Next sections present the original clustering algo-
rithms, their application to electricity demand sensor data streams,
                                                                                                                            0.7


and how they could be merged into a holistic clustering system.
                                                                                                                            0.6
                                                                                                                            0.5


3.1 L2GClust: Distributed clustering of grid nodes
                                                                                                                                  0.00


                                                                                                                                              0.10


                                                                                                                                                      0.20


                                                                                                                                                               0.30


                                                                                                                                                                           0.40


                                                                                                                                                                                           0.50


                                                                                                                                                                                                  0.55


                                                                                                                                                                                                         0.60


                                                                                                                                                                                                                0.65


                                                                                                                                                                                                                       0.70


                                                                                                                                                                                                                              0.75


                                                                                                                                                                                                                                     0.80


                                                                                                                                                                                                                                            0.85
                                                                                                                                                                                                                                            0.87
                                                                                                                                                                                                                                            0.89
                                                                                                                                                                                                                                            0.91
                                                                                                                                                                                                                                            0.93
                                                                                                                                                                                                                                            0.95
                                                                                                                                                                                                                                            0.97
                                                                                                                                                                                                                                            0.99
                                                                                                                                                                                                                                            1.00
                                                                                                                                                                        Probability of Message Loss (λ)


Clustering streaming data sources has been recently tackled in re-
search, but usual clustering algorithms need the data streams to be
                                                                                  Figure 1. L2GClust: sensitivity of κ̂ statistic to the number of sensors (d),
fed to a central server [15]. Considering the number of sensors possi-            for different number (k) and overlap (s) of clusters. Bottom plot presents the
bly included in a smart grid, this requirement could be a bottleneck.                  impact of communication incompleteness on average proportion of
A local algorithm was proposed to perform clustering of sensors on                               agreement for 5 clusters in a 128 sensor network.
ubiquitous sensor networks, based on the moving average of each
node’s data over time [15]. L2GClust has two main characteristics.
On one hand, each sensor node keeps a sketch of its own data. On the                 One important task in electrical networks is to define profiles
other hand, communication is limited to direct neighbors, so cluster-             of consumers, to better predict their behavior in the near future.
ing is computed at each node. The moving average of each node is                  L2GClust was applied to a sample of an electrical network to try
approximated using memoryless fading average, while clustering is                 to find such profiles. From the raw data received at each sub-station,
based on the furthest point algorithm applied to the centroids com-               observations were aggregated on a hourly basis over more than two
puted by the node’s direct neighbors. This way, each sensor acts as               and a half years [14]. The log of electricity demand data from active
data stream source but also as a processing node, keeping a sketch of             power sensors was used to check whether consumer profiles would
its own data, and a definition of the clustering structure of the entire          rise. The log has hourly data from a subsample (780 sensors) of the
network of data sources.                                                          entire data set (∼4000 sensors). Since no information existed on the
   Global evaluation of the L2GClust algorithm on synthetic data re-              actual electricity distribution network, the simulator used this dataset
vealed high agreement with the centralized, yet streaming, counter-               as input data to a random network and monitored the resulting clus-
part, being especially robust in terms of cluster separability. Also, for         tering structures. Unfortunately, real data is never clean, and half of
stable concepts, empirical evidence of convergence was found. On                  the sensors have more than 27% missing values, which naturally hin-
the other hand, sensitivity analysis exposed the robusteness of the               dered the analysis. Given this, and the dynamic nature of the data,
local algorithm approach. Figure 1 shows that agreement levels are                no convergence was possible in the clustering structures. However,
robust to an increase on the number of clusters, being, however, a bit            we could stress that, as more data is being fed to the system, better
more sensitive with respect to network size and cluster overlapping.              agreement can be achieved with the centralized approach, as exposed
Nonetheless, the robusteness to network communication problems is                 in Figure 2. Hence, not only does the agreement tend to increase with
exposed, as the proportion of agreement is harmed only for high lev-              more observations, but also changes on the clustering structure are
els of communication incompleteness.                                              apparently possible to detect. L2GClust presented good characteris-


                                                                             20
                                                                                                                      Evolution of Clustering Validity
                                                                                                                                                                                                                                                               The Distributed Grid Clustering (DGClust) algorithm was pro-
                                                                                                                                                                                                                                                            posed for clustering data points produced on wide sensor net-
                                   1.0


                                                                  P(A)
                                                                                                                                                                                                                                                            works [5]. The rationale is to use: a) online discretization of each
                                                                                                                                                                                                                                                            single sensor data, tracking changes of data intervals (states) instead
                                   0.8


                                                                  Kappa


                                                                                                                                                                                                                                                            of raw data (to reduce communication to central server); b) frequent
                                   0.6
  Validity


                                                                                                                                                                                                                                                            state monitoring at the central server, preventing processing all possi-
                                   0.4


                                                                                                                                                                                                                                                            ble state combinations (to cut computation); and c) online clustering
                                   0.2


                                                                                                                                                                                                                                                            of frequent states (to keep high validity and adaptivity). Each local
                                                                                                                                                                                                                                                            sensor receives data from a given source, producing a univariate data
                                   0.0


                                                                                                                                                                                                                                                            stream, which is potentially infinite. Therefore, each sensor’s data is
                                            0

                                                    30

                                                             60

                                                                  90

                                                                        120

                                                                               150

                                                                                     180

                                                                                           210

                                                                                                  240

                                                                                                         270

                                                                                                               300

                                                                                                                     330

                                                                                                                           360

                                                                                                                                 390

                                                                                                                                       420

                                                                                                                                             450

                                                                                                                                                   480

                                                                                                                                                         510

                                                                                                                                                               540

                                                                                                                                                                     570

                                                                                                                                                                           600

                                                                                                                                                                                 630

                                                                                                                                                                                       660

                                                                                                                                                                                             690

                                                                                                                                                                                                   720

                                                                                                                                                                                                         750

                                                                                                                                                                                                               780

                                                                                                                                                                                                                     810

                                                                                                                                                                                                                           840

                                                                                                                                                                                                                                  870

                                                                                                                                                                                                                                           900

                                                                                                                                                                                                                                                 930
                                                                                                                                             Days

                                                                                                                                                                                                                                                            processed locally, being incrementally discretized into a univariate
                                                                                                                                                                                                                                                            adaptive grid. Each new data point triggers a cell in this grid, reflect-
  Figure 2. L2GClust evolution of clustering agreement (probability of                                                                                                                                                                                      ing the current state of the data stream at the local site. Whenever
     agreement and κ̂ statistic) for a real active power sensor data log.                                                                                                                                                                                   a local site changes its state, that is, the triggered cell changes, the
                                                                                                                                                                                                                                                            new state is communicated to a central site. Furthermore, the cen-
                                                                                           Impact of the number of sensors on the normalized loss                                                                                                           tral site keeps the global state of the entire network where each local
                                                                            Averaged over values of m for each domain−granularity (d,w) pair and normalized by number of sensors                                                                            site’s state is the cell number of each local site’s grid. Nowadays, sen-
                                                                                                                                                                                                                                                            sor networks may include thousands of sensors. This scenario yields
                                    0.10


                                                                                                                                                                                                                                                            an exponential number of cell combinations to be monitored by the
                                    0.08
         Average Normalized Loss


                                                                                                                                                                                                                                       ●

                                                                                                                                                                                                                                       ●
                                                                                                                                                                                                                                           w=7
                                                                                                                                                                                                                                           w=5              central site. However, it is expected that only a small number of this
                                    0.06


                                                                                                                                                                                                                                           w=13

                                                                                                                                                                                                                                                            combinations are frequently triggered by the whole network, so, par-
                                                                                                                                                                                                                                       ●

                                                                                                                                                                                                                                       ●
                                                                                                                                                                                                                                           w=17
                                                                                                                                                                                                                                       ●
                                                                                                                                                                                                                                           w=15
                                                                                                                                                                                                                                           w=21
                                                                                                                                                                                                                                                            allel to the aggregation, the central site keeps a small list of counters
                                                                                                                                                                                                                                       ●
                                    0.04


                                                                                                                                                                                                                                       ●
                                                                                                                                                                                                                                           w=19
                                                                                                                                                                                                                                       ●
                                                                                                                                                                                                                                           w=11


                                                                                                                                                                                                                                                            of the most frequent global states. Finally, the current clustering defi-
                                                                                                                                                                                                                                       ●
                                                                                                                                                                                                                                           w=9
                                    0.02


                                                                                                                                                                                                                                                            nition is defined and maintained by an adaptive partitional clustering
                                                2        8             16                  32                                           64                                                                                       128                        algorithm applied on the frequent states central points.
                                                                                                                                 Number of sensors (d)
                                                                                                                                                                                                                                                               To evaluate the sensitivity of the system to the number of sensors,
                                                                                                 Impact of the number of sensors on communication
                                                                                                                                                                                                                                                            synthetic data was used and the average result for a given value of
                                                                                                                                                                                                                                                            granularity (w), averaged over all values of number of frequent states
                                   100%


                                                                                                        Averaged over values of m for each domain−granularity (d,w) pair

                                                                                                                                                                                                                                                            to monitor (m, as loss seemed to be only lightly dependent on this
                                   90%


                                                                                                                                                                                                                                                            factor) was analyzed. In figure 3 we note no clear trend, strengthen-
  Average Communication


                                                                                                                                                                                                                                   ●
                                                                                                                                                                                                                                           w=21

                                                                                                                                                                                                                                                            ing the evidence of robusteness to wide sensor networks. Regarding
                                   80%


                                                                                                                                                                                                                                   ●
                                                                                                                                                                                                                                           w=19
                                                                                                                                                                                                                                   ●
                                                                                                                                                                                                                                           w=17
                                                                                                                                                                                                                                           w=15

                                                                                                                                                                                                                                                            communication reduction when compared with centralized cluster-
                                                                                                                                                                                                                                   ●


                                                                                                                                                                                                                                           w=13
                                   70%


                                                                                                                                                                                                                                   ●

                                                                                                                                                                                                                                   ●
                                                                                                                                                                                                                                           w=11
                                                                                                                                                                                                                                           w=9
                                                                                                                                                                                                                                                            ing, figure 3 also shows that the amount of communication reduction
                                                                                                                                                                                                                                   ●

                                                                                                                                                                                                                                   ●
                                                                                                                                                                                                                                           w=7
                                   60%


                                                                                                                                                                                                                                   ●
                                                                                                                                                                                                                                           w=5


                                                                                                                                                                                                                                                            does not depend on the number of sensors. This way, the benefits of
                                   50%


                                                                                                                                                                                                                                                            reduced transmission rates are extensible to wide sensor networks.
                                   40%


                                            2            8         16                      32                                          64                                                                                    128

                                                                                                                                 Number of sensors (d)
                                                                                                                                                                                                                                                            3.3    HDClust: Holistic Distributed Clustering
                                                                                                                                                                                                                                                            The two algorithms previously exposed are designed for streaming
                          Figure 3. DGClust: impact of the number of sensors on loss to real                                                                                                                                                                data, and work with reduced computational costs in terms of memory
                              centroids (top) and communication reduction (bottom) [5].
                                                                                                                                                                                                                                                            and communications bandwidth. They present strong characteristics
                                                                                                                                                                                                                                                            that could be even improved if used together. In L2GClust, each sen-
tics to find clusters of sensors in wide networks such as smart grids.                                                                                                                                                                                      sor node each node has an approximation of the global clustering. In
                                                                                                                                                                                                                                                            DGClust, a centralized site maintains the global cluster structure of
                                                                                                                                                                                                                                                            the entire network at reduced communication costs. The main idea of
3.2                                        DGClust: Grid clustering of grid data streams
                                                                                                                                                                                                                                                            the Holistic Distributed Clustering (HDClust) is to integrate the local
Clustering data points is probably the most common unsupervised                                                                                                                                                                                             distributed approach of L2GClust, with the grid data clustering ap-
learning process in knowledge discovery. In ubiquitous settings,                                                                                                                                                                                            proach of DGClust, in order to achieve the holistic clustering of data
however, there aren’t many tailored solutions to try to extract knowl-                                                                                                                                                                                      and sources on sensor networks such as smart grids. Specifically, for
edge in order to define dense regions of the sensor data space. Clus-                                                                                                                                                                                       each measured dimension:
tering examples in sensor networks can be used to search for hot-
spots where sensors tend to produce data. In this settings, grid-based                                                                                                                                                                                      • each local node (meter) keeps a sketch of its own data streams (as
clustering represents a major asset as regions can be, strictly or                                                                                                                                                                                            in L2GClust) and a local discretization grid (as in DGClust);
loosely, defined by both the user and the adaptive process [5]. The ap-                                                                                                                                                                                     • communication is restricted to the neighborhood (as in L2GClust);
plication of clustering to grid cells enhances the abstraction of cells                                                                                                                                                                                     • at regular intervals, each local node receives from its neighbors
as interval regions which are better interpreted by humans. More-                                                                                                                                                                                             the estimates of the clusters centroids (as in L2GClust) and the
over, comparing intervals or grids is usually easier than comparing                                                                                                                                                                                           current data discretized grid cell (as in DGClust);
exact points, as an external scale is not required: intervals have in-                                                                                                                                                                                      • each node keeps an estimate of the global clustering of nodes by
trinsic scaling. The comprehension of how sensors are interacting in                                                                                                                                                                                          clustering neighbors’ centroids (as in L2GClust);
the network is greatly improved by using grid-based clustering tech-                                                                                                                                                                                        • each node keeps a frequent state list and maintains a clustering of
niques for the data examples produced by sensors.                                                                                                                                                                                                             the most frequent states (as in DGClust) from the neighbors;


                                                                                                                                                                                                                                                       21
                                                                                                 be computed: profiling, anomaly and event detection, outliers de-
                                                                                                 tections, trends, deviations, etc. In this paper, we have discussed
                                      read dimension data
                                                                                                 distributed clustering algorithm for data streams produced on wide
                                                                                                 sensor networks like smart grids. Furthermore, we have shown how
                                                                                                 smart grid problems can be addressed as clustering problems, and
                                                                                                 proposed a holistic approach to better extract knowledge from the
                    sketch data                                   flag grid cell
                                                                                                 grid. We believe that this holistic approach could be used to help
                                                                                                 solving some of the smart grid intelligent layer research problems.
                get nodes centroids                            get neighbors cells
                                                                                                 Current research focus on the integration of both algorithms into the
                     L2GClust                                      DGClust                       schema and its evaluation on real-world electrical networks data.
                                                              check frequent items

                ensemble clustering
                                                                                                   ACKNOWLEDGEMENTS This work is funded by the ERDF
                                                                 data clustering
                                                                                                 through Programme COMPETE and by the Portuguese Government through
                                                                                                 FCT, projects PEst-C/SAU/UI0753/2011 and PTDC/EIA/098355/2008. The
                                                                                                 authors acknowledge the help of Luı́s Lopes and João Araújo.
               global nodes centroids                         neighbor data centroids
                         c                                               k
                                                                                                 REFERENCES
                                                                                                  [1] H. Allcott, ‘Rethinking real-time electricity pricing’, Resource and En-
                                          bi-clustering
                                                                                                      ergy Economics, 33(4), 820–842, (2011).
                                                                                                  [2] S.M. Amin, ‘Smart grid: Overview, issues and opportunities. advances
                                                                                                      and challenges in sensing, modeling, simulation, optimization and con-
           other dimensions
                                                                                                      trol’, European Journal of Control, 17(5-6), 547–567, (2011).
                                                                         other dimensions
                                         <c,k> centroids                                          [3] F. Benzi, N. Anglani, E. Bassi, and L. Frosini, ‘Electricity smart meters
                                                                                                      interfacing the households’, IEEE Transactions on Industrial Electron-
                                                                                                      ics, 58(10), 4487–4494, (2011).
                                                                                                  [4] Xi Fang, Satyajayant Misra, Guoliang Xue, and Dejun Yang, ‘Smart
                                        holistic clustering
                                                                                                      grid – the new and improved power grid: a survey’, IEEE Communica-
                                                                                                      tions Surveys & Tutorials, (2012). (to appear).
                                            HDClust                                               [5] João Gama, Pedro Pereira Rodrigues, and Luı́s Lopes, ‘Clustering dis-
                                                                                                      tributed sensor data streams using local processing and reduced com-
                                                                                                      munication’, Intelligent Data Analysis, 15(1), 3–28, (January 2011).
                                                                                                  [6] Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and
                                                                                                      Liadan O’Callaghan, ‘Clustering data streams: Theory and practice’,
 Figure 4. HDClust schema to be applied at each node, for each included                               IEEE Transactions on Knowledge and Data Engineering, 15(3), 515–
    dimension. Left branch applies L2GClust while right branch applies                                528, (2003).
  DGClust using data from the neighbors, each node acting also as central                         [7] A. Iwayemi, P. Yi, X. Dong, and C. Zhou, ‘Knowing when to act: An
    clustering agent. Both clustering definitions are then combined and                               optimal stopping method for smart grid demand response’, IEEE Net-
                integrated with other measured dimensions.                                            work, 25(5), 44–49, (2011).
                                                                                                  [8] J.A. Kavicky, ‘Impacts of smart grid data on parallel path and contin-
                                                                                                      gency analysis efforts’, in IEEE PES General Meeting, (2010).
                                                                                                  [9] R.H. Lasseter, ‘Smart distribution: Coupled microgrids’, Proceedings
• to link clustering of sources with clustering of data, each node also                               of the IEEE, 99(6), 1074–1082, (2011).
  receives from the neighbors their self assignment to a cluster.                                [10] S. Muthukrishnan, Data Streams: Algorithms and Applications, Now
                                                                                                      Publishers Inc, New York, NY, 2005.
In the resulting cluster structure, each sensor maintains C clusters of                          [11] A. Pasdar and H.H. Mehne, ‘Intelligent three-phase current balanc-
                                                                                                      ing technique for single-phase load based on smart metering’, Interna-
data sources, and K clusters of data points.                                                          tional Journal of Electrical Power and Energy Systems, 33(3), 693–698,
   In a smart grid context, and taking advantage of the decomposable                                  (2011).
property of the grid network (microgrids), L2GClust and DGClust                                  [12] B. Ramachandran, S.K. Srivastava, C.S. Edrington, and D.A. Cartes,
can work together. Assume a microgrid of D sensors, and 4 dimen-                                      ‘An intelligent auction scheme for smart grid market using a hybrid im-
                                                                                                      mune algorithm’, IEEE Transactions on Industrial Electronics, 58(10),
sions or quantities of interest: power demand, power supply, energy                                   4603–4612, (2011).
sell price and energy buy price. The resulting HDClust, the network                              [13] Pedro Pereira Rodrigues, Zoran Bosnić, João Gama, and Igor
is summarized by C clusters of data sources, and K clusters of data                                   Kononenko, ‘Estimating reliability for assessing and correcting individ-
points, for each quantity of interest. In real-time and at each moment,                               ual streaming predictions’, in Reliable Knowledge Discovery, 267–287,
each sensor is in a state hci , ki i in each dimension. Figure 4 presents                             Springer Verlag, (2012).
                                                                                                 [14] Pedro Pereira Rodrigues and João Gama, ‘A system for analysis and
the global schema for a holistic approach to clustering, to be applied                                prediction of electricity load streams’, Intelligent Data Analysis, 13(3),
at each node of a smart grid. The combination of the characteristics                                  477–496, (June 2009).
both algorithms seems not only possible, but extremely relevant as                               [15] Pedro Pereira Rodrigues, João Gama, João Araújo, and Luı́s Lopes,
complementary knowledge discovery in a holistic view of the grid.                                     ‘L2GClust: Local-to-global clustering of stream sources’, in Proceed-
                                                                                                      ings of ACM SAC 2011, pp. 1011–1016, (March 2011).
                                                                                                 [16] Pedro Pereira Rodrigues, João Gama, and João Pedro Pedroso, ‘Hier-
4   REMARKS AND FUTURE PATHS                                                                          archical clustering of time-series data streams’, IEEE Transactions on
                                                                                                      Knowledge and Data Engineering, 20(5), 615–627, (May 2008).
Smart grids are a paradigmatic example of ubiquitous streaming data                              [17] K. Tanaka, A. Yoza, K. Ogimi, A. Yona, T. Senjyu, T. Funabashi, and
sources. Data is produced at high speed, from a dynamic (time-                                        C.-H. Kim, ‘Optimal operation of dc smart house system by control-
                                                                                                      lable loads based on smart grid topology’, Renewable Energy, 39(1),
changing) environment. Meters are geographically distributed, form-                                   132–139, (2012).
ing a network. On top of clustering algorithms, several tasks can


                                                                                            22