=Paper= {{Paper |id=Vol-2400/paper-23 |storemode=property |title=Mining Microscopic and Macroscopic Changes in Network Data Streams (Discussion Paper) |pdfUrl=https://ceur-ws.org/Vol-2400/paper-23.pdf |volume=Vol-2400 |authors=Corrado Loglisci,Angelo Impedovo,Michelangelo Ceci,Donato Malerba |dblpUrl=https://dblp.org/rec/conf/sebd/LoglisciICM19 }} ==Mining Microscopic and Macroscopic Changes in Network Data Streams (Discussion Paper)== https://ceur-ws.org/Vol-2400/paper-23.pdf
Mining Microscopic and Macroscopic Changes in
  Network Data Streams (Discussion Paper)

     Corrado Loglisci, Angelo Impedovo, Michelangelo Ceci, Donato Malerba
        Dept. of Computer Science, University of Bari "Aldo Moro", Bari, Italy
           {corrado.loglisci, angelo.impedovo, michelangelo.ceci,
                            donato.malerba}@uniba.it



        Abstract.     Network data streams oer an abstraction of complex sys-
        tems from the real-world, which can be seen as producers of unbounded
        sequences of complex data generated at high speed. Many complex sys-
        tems evolve according to stochastic processes which remain unknown
        to the interested users. As a consequence, changes happen in an un-
        predictable manner and may involve various portions of the observed
        complex systems. In this scenario, an interesting problem concerns the
        identication and characterization of the changes that may concern both
        the whole structure of a complex system and small parts of it. We con-
        jecture that the former can be explained by the latter and conversely,
        the latter can trigger the former. This type of problem requires a quite
        holistic strategy that traditional approaches often do not carry out be-
        cause they focus on either the whole network or on some portions only.
        In this discussion paper, we describe a descriptive data mining approach
        based on frequent pattern discovery that we designed for recent research
        work. It combines frequent pattern with automatic time-window setting,
        in order to identify and characterize  macroscopic changes     and micro-
        scopic changes   as changes that have an impact on a substantial part of
        the network or on specic portions, respectively. We provide arguments
        of the viability to real-world applications through two case studies, more
        precisely, telecommunication networks and geo-sensor networks.

        Keywords:    Change Mining, Network Streams, Frequent Subnetworks




1     Introduction
The dissemination of technologies able to unceasingly record information from
real-world applications has lead to the development of systems based on data
stream models. Often such systems work in highly dynamic and complex do-
mains where data are naturally interconnected. This is the case of social networks
and telecommunication infrastructures. Analyzing data continuously generated
    Copyright c 2019 for the individual papers by the papers' authors. Copying per-
    mitted for private and academic purposes. This volume is published and copyrighted
    by its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy.
in these elds requires data stream mining algorithms that operate on network
data. In this scenario, a natural and interesting problem is that of analyzing data
streams in order to identify the changes in the network interactions over time.
However, in the domains in which human supervision or reference knowledge are
not promptly available we cannot learn models able to recognize changing behav-
iors, and therefore the supervised approaches give way to those unsupervised.
The descriptive data mining techniques and, in particular, the pattern-based
approaches are gaining great attention thanks to their peculiarities to provide
arguments to the change-related discoveries through statistical evidence. In [1],
the discovery of graph evolution rules starting from sequence of graph snapshots
from an evolving graph has been proposed. The rules are graphs in which nodes
denote entities while edges are labeled with the time-stamp. However, this ap-
proach is only able to represent insertions and not deletions of nodes/edges. In
[6], we studied how characterizing the evolution of specic portions of the whole
network by introducing the notion of evolution chains. In [9], the authors track
changes in the trend of statistical parameters of frequent patterns in social net-
works. The patterns are mined from descriptive properties of the nodes of the
networks and therefore do not provide information on the structural changes.
     We addressed the critical points of the aforementioned works through a
method, described in detail in [5], which contributes to the problem of change
mining in three dierent perspectives, that is, the task studied, methodologi-
cal approach and algorithmic solution. As for the task, we argue that changes
concerning the global properties of a network can be ascribed to a combination
of local variations. Changes of single portions (microscopic changes) can trigger
changes in the global properties of the network (macroscopic). More specically,
microscopic changes are associated with subnetworks whose properties change
over time, whereas macroscopic changes are associated with composite aggrega-
tions of microscopic changes.
     As for the methodological approach, we resort to the frequent pattern min-
ing framework to generate a summarizing form of the network data on which
identifying evolutions. Therefore, frequent subnetworks are not the primary ob-
jective of the paper, but the means to capture two kinds of changes. By using
pattern-subnetworks, we can search for changes in an abstract description of the
network [4], while alleviating the computational cost that the analysis of raw
data would require. So, single frequent subnetworks synthesize specic portions,
while sets of frequent subnetworks synthesize larger portions of the network.
This methodological decision is motivated by the fact that frequent subnetworks
provide arguments for the robustness of our method. Indeed, frequency denotes
statistical evidence, therefore, frequently occurring changes turn out to be more
interesting than episodic ones, because they are replicated and, probably, well-
established over time.
     As for the algorithmic solution, the hypothesis on the absence of labeling
of the changes suggest us to monitor data streams and decide when new data
snapshots exhibit a change, that is, when the network diers from the past. To
this end, we use two time-windows, one to collect data snapshots of the "re-
cent past" and the other to collect new data snapshots. This way, we capture
the status of the network in two distinct time-windows, keep patterns and their
statistical properties updated, and adapt the search strategy to the variations
of the underlying data distribution. The combined use of time-windows and fre-
quent subnetworks suggests us to search for i) macroscopic changes as variations
between sets of frequent subnetworks discovered on two time-windows, and ii)
microscopic changes as punctual variations of frequent subnetworks occurring at
the level of data snapshots. The use of two windows ("recent past" and "new
snaphosts") has been studied also for predictive tasks [8].

2    Basics and Problem Statement
In the following, we provide some basic concepts as well as the notions of change
of interest for this work. A network data stream is the time-ordered sequence D =
hG1 , G2 , . . .i of network snapshots Gi observed at time-point ti . Each snapshot
Gi is labeled graph with labeled edges Gi ⊆ N ×N ×L, where N is the set of nodes
and L is set of edge labels. In particular a landmark window W = [ti , tj ] of width
|W | = j − i + 1 is the sequence of consecutive snapshots {Gi , . . . , Gj }. Following
[3], a (landmark) window W 0 is successive to another (landmark) window W
when they share some initial time-points, that is, W = [ti , tn ] and W 0 = [ti , tm ],
with ti < tn < tm .
    The proposed approach relies on the frequent pattern mining framework. In
particular, we i) restrict the pattern language to subnetworks in which there
exists a path between any two nodes, and ii) mine frequent subnetworks from
snapshots of a window W . The support of a subnetwork S in W is dened as
sup(S, W ) = |{Gi ∈W      |S⊆Gi }|
                        |W |       . A subnetwork S is frequent in W if sup(P, W ) ≥
minSU P , where minSU P ∈ [0, 1]. Then, FW denotes the set of all the frequent
subnetworks discovered from the window W . We consider two types of changes:
Denition 1 (Macroscopic change). Given minM C ∈ [0, 1], a macroscopic
                                     |FW 0 −FW |+|FW −FW 0 |
change is found if M C(W, W 0 ) =       |FW |+|FW 0 −FW |    > minM C .

Denition 2 (Microscopic change). Given minGR ∈ [1, +∞), a subnetwork
                                       sup(S,W )
S denotes a microscopic change when i) sup(S,W 0 ) ≥ minGR if S ∈ (FW −FW 0 ),

      sup(S,W 0 )
or ii) sup(S,W ) ≥ minGR if S ∈ (FW 0 − FW ).

where, W = [ti , tn ] and W 0 = [ti , tm ] are two successive windows, FW and FW 0
are the sets of frequent subnetworks mined on W and W 0 , respectively.
    In these terms, the problem of identifying and characterizing macroscopic
and microscopic changes in a network data stream D = hG1 , G2 , . . .i can be
interpreted as the search for pairs of successive windows (W , W 0 ), in which the
i) quantication of the variations between FW and FW 0 exceeds the threshold
minM C and ii) quantication of the variations the support of S (either S ∈
(FW − FW 0 ) or S ∈ (FW − FW 0 )) exceeds the threshold minGR.
3    The proposed mining approach
In this discussion paper, we discuss the KARMA algorithm proposed in [5], which
delivers a computational solution to discover the changes before formulated. The
algorithm (Figure 1) iteratively consumes blocks Π of network snapshots coming
from the stream D (Step 2) by using two successive landmark windows W and
W 0 (Step 3). This way, it mines frequent subnetworks, FW and FW 0 , necessary to
the identication of both macroscopic and microscopic changes (Steps 4-5). The
window grows (W = W 0 , Step 8) with new network snapshots, and the associated
set of frequent subnetworks is kept updated until M C(W, W 0 ) > minM C and a
macroscopic change is found. In that case, the algorithm mines the microscopic
changes (Step 6) and drops the content of the window by retaining only the last
block of transactions (W = Π , Step 7). Then, the analysis restarts.

                                            8                                        ≤ minM C
                                                 W = W0
                                                FW = FW 0

1                     2                 3                   4                    5
     init. W with         new block Π            0                update              evaluate
                                                W =W ∪Π         FW 0 and FW          M C(W, W 0 )
    rst block of D         from D

                                            7               6
                                                 W =Π           micro. changes
                                                FW = FΠ          in (W, W 0 )        > minM C

                      Fig. 1. The KARMA algorithm owchart




    The algorithm relies on the frequent subgraph mining framework, which is
intractable for massive datasets due to the combinatorial explosion of the fre-
quent subgraphs. For this reason, the sets FW and FW 0 are kept updated and not
recomputed from scratch upon the arrival of new transactions. Algorithmically,
this is done by means of incremental solutions of frequent pattern mining.

3.1 Macroscopic change mining
When a new block of network snapshots Π is acquired, KARMA builds the win-
dow W 0 (W 0 = W ∪ Π ) and checks if there is a macroscopic change by matching
W 0 against W . To do that, KARMA oers two peculiarities: i) identication of
relevant changes on a summary of the data (the set of frequent subnetworks)
rather than on the raw data, and ii) quantication of the changes in terms of
the dierences between the set FW of frequent subnetworks discovered on W
and the set FW 0 of frequent subnetworks discovered on W 0 . It is desirable to
seek for changes on data by looking at frequent subnetworks in order to avoid
false alarms, thus achieving higher robustness to noisy data. In this scenario, a
single noisy network snapshot (e.g. a noisy outlier) would not aect the set of
frequent subnetworks, depending on the minimum support minSU P . In fact, the
main assumption behind the macroscopic change is that the whole network does
not exhibit a change between W and W 0 if the frequent subnetworks, FW and
FW 0 , do not signicantly change. To measure the amount of change, KARMA
computes the macroscopic change (MC) as M C(W, W 0 ) = |FW|F0 −F   W |+|FW −FW 0 |
                                                                W |+|FW 0 −FW |
                                                                                   .
Where W and W are two successive landmark windows, |FW − FW | denotes
                   0                                                   0

the number of subnetworks which are frequent in W and infrequent in W 0 , and
|FW 0 − FW | is the number of subnetworks which are frequent in W 0 and in-
frequent in W . The formula quanties the fraction of subnetworks which have
crossed the minimum support threshold minSU P , that is, those which were fre-
quent (infrequent) in W and become infrequent (frequent) in W 0 , and which
therefore indicate a relevant change in the underlying network data distribution.

3.2 Microscopic change mining
In KARMA, the discovery of microscopic changes is performed only when a
macroscopic change is spotted. Indeed, a microscopic change accounts for the
contribution that each subnetwork, which crosses the minimum support thresh-
old minSU P , gives to the macroscopic change. To do that, we resort the no-
tion of emerging pattern [2], which have been proven useful in the classication
setting when mining discriminative patterns between two classes. In KARMA,
we mine emerging subnetworks whose support signicantly spreads between
W and W 0 . This is done by quantifying the growth-rate of the subnetwork
                                                                            0
S , sup(S,W
    sup(S,W )
            0 ) ≥ minGR   if S ∈ (FW − FW 0 ) (or alternatively, sup(S,W      )
                                                                    sup(S,W ) ≥
minGR if S ∈ (FW 0 − FW )). Every emerging subnetwork S satisfying the
growth-rate inequality denotes a single microscopic change. The minGR thresh-
old is a tradeo between the completeness and the simplicity of the descriptive
model. Low values of minGR lead to expressive models, while high values lead
to synthetic models about the change.

3.3 KARMA at work
We show the applicability of KARMA for the analysis of a real-world network by
discussing some discoveries and commenting on their actionability with respect
to facts and events occurred in the domain. In the following, we discuss two case
studies: the rst39 in the domain of telecommunication networks (NODOBO
dataset), and the second in the domain of geo-sensor networks (NOAA dataset).
    The NODOBO 1 dataset concerns a communication network and contains
telecommunication transactions gathered during a study of the mobile phone
usage of 27 students of a Scottish state high-school, from September 2010 to
February 2011. The students communicate by phone calls, text messages (SMS)
and Bluetooth connectivity. When building the network, the nodes represent the
students, while the edges represent the dierent modalities of communication.
1
    http://nodobo.com/release.html
    The dataset NOAA2 was developed in the context of the Reanalysis project
by the National Center for Environmental Prediction and the National Cen-
ter for Atmospheric Research. The project aimed at providing new atmospheric
analysis by gathering daily measurements of various meteorological quantities
(e.g. relative humidity and air temperature) by means of geo-localized sensors
equally distributed over space. In this work, we built the network with the mea-
surements of relative humidity of the time-interval January, 1st 1990 - December,
31st 2010, recorded daily on an area that roughly covers North-Central Amer-
ica. The nodes of the network represent the sensors, while the edges are nominal
values denoting the relative humidity values measured on the two linked sensors.




Fig. 2. Number of microscopic changes (    #ES ) discovered from the NODOBO dataset
as a function of time. The values of the   #ES have been multiplied by 10 .
                                                                       -3




    As for the analysis of NODOBO, in Figure 2, we see a succession of points
with a decreasing trend, which has the peak on October 26th, 2010 and the
lowest number of microscopic changes on December 6th, 2010. To give a practical
interpretation to this behavior, it is useful to say that in Scottish state high-
schools there is a holiday period which covers the second and third week of
October, thereafter the school activities continue. So, the projection of Figure 2
reveals that, when the school activities resume, there is high variability (many
microscopic changes) in the modalities of communication, which, as time goes
by, tends to decrease. This may provide indications on the use of mobile phone
of the students, which can be exploited, for instance, to plan the school policies
and to improve the mobile network services in the area.
    Among the microscopic changes associated to the peak, we nd the pattern
P ={(student_14, student_0, bluetooth ), (student_18, student_0, bluetooth ),
(student_2, student_0, high_length )}. It is involved in the strongest macro-
scopic change (quantied as 0.94), which starts on October 26th 2010 at 3:00
(the time-point after the window [2010 Oct 25-22:00, 2010 Oct 26-2:55]) and
terminates on October 26th 2010 at 7:55. Specically, P denotes the doubling
2
    https://coastwatch.pfeg.noaa.gov/erddap/griddap/esrlNcepRe.html
(growth-rate equals to 2.0) of the occurrences of the associated subnetwork from
the window [2010 Oct 25-22:00, 2010 Oct 26-2:55] to the window [2010 Oct
25-22:00, 2010 Oct 26-7:55]. On the contrary, (we veried) P does not appear
in the set of microscopic changes corresponding to the successive macroscopic
change detected between the landmark windows [2010 Dec 05-22:10, 2010 Dec
06-3:05] and [2010 Dec 05-22:10, 2010 Dec 06-8:05], which may indicate that the
modalities of interaction among the three students become stable.




Fig. 3. Number of microscopic changes (#ES ) discovered from the NOAA dataset as
a function of time.


    As for the NOAA domain (Figure 3), there are several macroscopic changes,
those which have a greater impact on the network are characterized by more than
150 microscopic changes. In particular, there are two macroscopic changes with
the highest number, they start on October 17th, 1995 and August 2008, 9th re-
spectively. We deepened the microscopic changes corresponding to the two points
and spotted several emerging subnetworks in common, one is P ={(<10,300>,
<10,305>, from_70_to_80 ), (<10,300>, <7.5,297.5>, from_80_to_90 ),
(<12.5,297.5>, <7.5,297.5>, from_80_to_90 )}.
    By mapping the nodes of P into a geodesic space, we see they identify two
regions, both cover approximately the area of the state of Venezuela and part
of the Caribbean sea, where there are small dierences in terms of relative hu-
midity (the edge labels refer to consecutive ranges). This meteorological scenario
becomes less frequent (the growth-rate decreases) over the window [1995 Oct 17,
1995 Nov 30] and [2008 Aug 09, 2008 Sep 22] respectively, which suggests the
possibility of dierent behavior, in the same geographic area, occurred before or
after those two macroscopic changes.

4    Conclusions
In this discussion paper, we have investigated the problem of identifying relevant
changes in network data streams, where the changes can be distinguished in two
categories: macroscopic changes and microscopic changes. The system presented
in the paper, called KARMA, is able to simultaneously extract macroscopic
changes and microscopic changes by exploiting the fact that they are inevitably
related to each other. Two case studies have shown the usefulness and the ac-
tionability of the changes in the domain of geo-sensor networks and telecommu-
nication networks. An extensive discussion on the inuence of the parameters
(minSUP, minMC, minGR) on the results can be found in [5], where a compar-
ative evaluation of the running times is also reported. The interested reader can
refer to the journal paper for further details.
    For future work, we plan to investigate two main research directions: i) use
of solutions of big data analytics to detect changes in very large networks, and
ii) study of the closed patterns [7] to discover non-redundant subnetworks.


References
1. Berlingerio, M., Bonchi, F., Bringmann, B., Gionis, A.: Mining graph evolution
   rules. In: Proc. of the European Conference on Machine Learning and Knowledge
   Discovery in Databases: Part I. pp. 115130. ECML PKDD '09, Springer-Verlag,
   Berlin, Heidelberg (2009)
2. Dong, G., Li, J.: Ecient mining of emerging patterns: Discovering trends and
   dierences. In: Proceedings of the Fifth ACM SIGKDD International Conference
   on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 15-18,
   1999. pp. 4352 (1999)
3. Gama, J., Gaber, M.M.: Learning from data streams: processing techniques in sensor
   networks. Springer (2007)
4. Loglisci, C., Berardi, M.: Segmentation of evolving complex data and generation of
   models. In: Workshops Proceedings of the 6th IEEE International Conference on
   Data Mining (ICDM 2006), 18-22 December 2006, Hong Kong, China. pp. 269273.
   IEEE Computer Society (2006). https://doi.org/10.1109/ICDMW.2006.144
5. Loglisci, C., Ceci, M., Impedovo, A., Malerba, D.: Mining microscopic and macro-
   scopic changes in network data streams. Knowl.-Based Syst. 161, 294312 (2018)
6. Loglisci, C., Ceci, M., Malerba, D.: Relational mining for discovering changes in
   evolving networks. Neurocomputing 150, Part A(0), 265  288 (2015)
7. Loglisci, C., Malerba, D.: Mining multiple level non-redundant association rules
   through two-fold pruning of redundancies. In: Perner, P. (ed.) Machine Learning
   and Data Mining in Pattern Recognition, 6th International Conference, MLDM
   2009, Leipzig, Germany, July 23-25, 2009. Proceedings. Lecture Notes in Computer
   Science, vol. 5632, pp. 251265. Springer (2009). https://doi.org/10.1007/978-3-642-
   03070-3_19
8. Loglisci, C., Malerba, D.: Leveraging temporal autocorrelation of histori-
   cal data for improving accuracy in network regression. Statistical Analy-
   sis and Data Mining 10(1), 4053 (2017). https://doi.org/10.1002/sam.11336,
   https://doi.org/10.1002/sam.11336
9. Nohuddin, P.N.E., Coenen, F., Christley, R., Setzkorn, C., Patel, Y., Williams, S.:
   Finding "interesting" trends in social networks using frequent pattern mining and
   self organizing maps. Knowl.-Based Syst. 29, 104113 (2012)