=Paper= {{Paper |id=None |storemode=property |title=Event and Anomaly Detection using Tucker3 Decomposition |pdfUrl=https://ceur-ws.org/Vol-960/paper2.pdf |volume=Vol-960 }} ==Event and Anomaly Detection using Tucker3 Decomposition== https://ceur-ws.org/Vol-960/paper2.pdf
Event and anomaly detection using Tucker3 decomposition
          Hadi Fanaee Tork1, Márcia Oliveira2, João Gama3, Simon Malinowski4 and Ricardo Morla5

Abstract.1 Failure detection in telecommunication networks is a             informatics, and credit card fraud detection. A significant amount
vital task. So far, several supervised and unsupervised solutions           of research has been devoted to solve this problem. However our
have been provided for discovering failures in such networks.               focus is on unsupervised methods. Anomaly detection techniques
Among them unsupervised approaches has attracted more attention             can be classified into five groups [1]: classification-based,
since no label data is required [1]. Often, network devices are not
                                                                            clustering-based, nearest neighbor based, statistical methods,
able to provide information about the type of failure. In such cases,
unsupervised setting is more appropriate for diagnosis. Among               information theory-based methods and spectral methods. Based on
unsupervised approaches, Principal Component Analysis (PCA)                 this classification, our method is placed in the group of spectral
has been widely used for anomaly detection literature and can be            methods. These approaches first decompose the high-dimensional
applied to matrix data (e.g. Users-Features). However, one of the           data into a lower dimension space and then assume that normal and
important properties of network data is their temporal sequential           abnormal data points appear significantly different from together.
nature. So considering the interaction of dimensions over a third           This some benefits: 1) they can be employed in both unsupervised
dimension, such as time, may provide us better insights into the            and supervised settings 2) they can detect anomalies in high
nature of network failures. In this paper we demonstrate the power          dimensional data, and 3) unlike clustering techniques, they do not
of three-way analysis to detect events and anomalies in time-               require complicated manual parameter estimation. So far, most of
evolving network data.
                                                                            the work related to spectral anomaly detection was based on
                                                                            Principal Component Analysis (PCA) and Singular Value
1         INTRODUCTION                                                      Decomposition (SVD). Two of the most important applications of
                                                                            PCA during recent years has been in the domain of intrusion
Event detection can be briefly described as the task of discovering         detection [2] [3] and traffic anomaly detection [4] [5].
unusual behavior of a system during a specific period of the time.
On the other hand, anomaly detection concentrates on the detection
of abnormal points. So clearly it is different from event detection         1.2       EVENT DETECTION
since it just considers the points rather than a group of points. Our
                                                                            Due to huge amount of sequential data being generated by sensors,
work takes into account both issues using multi-way data analysis.
                                                                            event detection has become an emerging issue with several real-
Our methodology comprises the following steps: 1) Anomaly
                                                                            world applications. Event is a significant occurrence or pattern that
detection: detection of individual abnormal users 2) Generating
                                                                            is unusual comparing to the normal patterns of the behavior of a
user trajectories (i.e. behavior of users over time), 3) Clustering
                                                                            system [6]. This can be natural phenomena or manual system
users’ trajectories to discover abnormal trajectories and 4)
                                                                            interaction. Some examples of events can be an attack on the
Detection of events: group of users who show abnormal behavior
                                                                            network, bioterrorist activities, epidemic disease, damage in an
during specific time periods. Although there is a rich body of
                                                                            aircraft, pipe-breaks, forest fires, etc. A real system behaves
research on the two mentioned issues, to the best of our knowledge
                                                                            normally most of the time, until an anomaly occurs that may cause
we are the first ones applying multi-way analysis to the anomaly
                                                                            damages to the system. Since the effects of an event in the system
and event detection problem. In the remainder of this section we
                                                                            are not known a priori, detecting and characterizing abnormal
explain some basic and related concepts and works. Afterwards, we
                                                                            events is challenging. This is the reason why most of the time we
define the problem, and then discuss three-way analysis methods.
                                                                            cannot evaluate different algorithms. One solution might be
Hereafter, we introduce the dataset and experiments. Finally, we
                                                                            injection of artificial event into the normal data. However,
discuss the results and point out possible future directions.
                                                                            construction of a realistic event pattern is not trivial [7].

1.1       ANOMALY DETECTION                                                 1.3       HIDDEN MARKOV MODELS
Anomaly is as a pattern in the data that does not conform to the
                                                                            Hidden Markov Models (HMMs) have been used at least for the
expected behavior [1]. Anomaly detection has a wide range of
                                                                            last three decades in signal processing, especially in domain of
application in computer network intrusion detection, medical
                                                                            speech recognition. They have also been applied in many other
                                                                            domains as bioinformatics (e.g. biological sequence analysis),
1
  LIAAD-INESC TEC, FEUP- University of Porto, hadi.fanaee@fe.up.pt          environmental studies (e.g. earthquake and wind detection), and
2
  LIAAD-INESC TEC, FEP- University of Porto, marcia@liaad.up.pt             finance (financial time series). HMMs became popular for its
3
  LIAAD-INESC TEC, FEP- University of Porto, jgama@fep.up.pt                simplicity and general mathematical tractability [8].
4
  INESC TEC, FEUP- University of Porto, sjfm@inescporto.pt
5
  INESC TEC, FEUP- University of Porto, ricardo.morla@inescporto.pt




                                                                        8
   HMMs are widely used to describe complex probability
distributions in time series and are well adapted to model time
dependencies in such series. HMMs assume that observations
distribution does not follow a normal distribution and are generated
by different processes. Each process is dependent on the state of an
underlying and unobserved Markov process [7]. Markov process
denotes that the value of a process Xt only depends on the previous
value of X. Using notations of [9] let:

   T = Length of the Observation sequence
   N = Number of states
   Q = {q0, q1, …, q (N-1) } = distinct states of Markov process                            Figure 1. Tucker3 Decomposition
   A = State transition probabilities
   B = set of N observation probability distributions
   π = Initial State Distribution                                          1.4.1     Tucker3 Model
   O = (O0, O1,…, O (T-1)) = observation sequence
                                                                           The Tucker3 model decomposes a three-mode tensor  into set of
   A HMM Model is defined with the triple of λ= (A, B, π). It              component matrices , ,  and a small core tensor . The
assumes that Observations are drawn using the observation                  following mathematical equation reveals the decomposition:
probability distribution associated to the current state. The
                                                                                                  
transition probabilities between states are given in matrix A.                        ≈ ∑ ∑ ∑   ×  ×   ×   (1)

   The three main problems related with HMMs are the following.            Where P, Q and R are parameters of the Tucker3 model and
The first problem consists in computing the probability P(O) that a        represent the number of components retained in the first, the
given observation sequence O is generated by a given HMM λ. The            second and the third mode of the tensor, respectively. This
second problem consists in finding the most probable sequence of           decomposition is illustrated in Figure 1.
hidden states given an observation sequence O and λ and the third
problem is related to parameter inference. It consists in estimating
the parameters of the HMM λ that best fits a given observation             2         PROBLEM DEFINITION
sequence O. The mainly used algorithms to solve these problems             One of significant issues in telecommunication systems, such as
are given in the last column of Table 1. More details about these          IP/TV, is to detect the anomalies at both network and user level. In
algorithms can be found in [10]. In this paper, we deal with the           order to study this, target users are usually equipped with a facility
third problem to estimate the HMM parameters that best describe            in their modem which sends an automatic notification message to
time series, as it will be explained in Section 2.                         the central server when the connection of a client in the network is
                                                                           lost or reestablished. These modems are ubiquitous and
                    Table1. Three HMM Problems                             geographically dispersed.
  Problem       Input    Output                 Solution
  Problem 1      λ, O     P(O)         Forward Backward algorithm          The modeling of such behavior is not straightforward because the
  Problem 2      λ, O    Best Q             Viterbi algorithm              number of notification messages is not equal for each user during
  Problem 3       O         λ            Baum-Welch algorithm              the time period under analysis. For instance, one user may face 40
                                                                           connection problems in an hour, hence generating 40 messages,
                                                                           while others may face 5 or even no problems at all. In standard
                                                                           event detection problems, for each time point there is a
1.4       THREE-WAY DATA ANALYSIS                                          measurement via one or multiple sensors. In the context of our
Traditional data analysis techniques such as PCA, clustering,              application, such measurements do not take place at regular time
regression, etc. are only able to model two dimensional data and           points, since user modems (or sensors) only send messages to the
they do not consider the interaction between more than two                 server when something unexpected occurs. Figure 2 illustrates two
dimensions. However, in several real-world phenomena, there is a           sample users. Each circle represents the time stamp at which a
mutual relationship between more than two dimensions (e.g. a 3D            notification relative to the given user is received, while ∆T
tensor (Users×Features×Time)) and thus, they should be analyzed            represents the inter-arrival time between two consecutive
through a three-way perspective. Three-way analysis considers all          messages. As it can be seen, 2 messages were related to user 1 in
mutual dependencies between the different dimensions and                   that period, while 4 were related to user 2 during the same period.
provides a compact representation of the original tensor in lower-         Also, the ∆T between messages is larger for user 1 than for user 2.
dimensional spaces. The most common three-way analysis models              This means that user 2 sent messages more frequently than user 1.
are Tucker2, Tucker3, and PARAFAC [10] which are generalized               As in many other event detection problems, we could easily use the
versions of two-mode principal component model or, more                    number of events per hour (measurement) at different users
specifically, SVD. Following, we briefly introduce Tucker3 model           (sensors) to detect the events but this way we would lose the
as the best-known method for analysis of three-way data.                   information content provided by the ∆T’s.
                                                                              As the number of ∆T is not the same for each user, this feature
                                                                           cannot be directly integrated in our model. Hence, this would cause




                                                                       9
                                                                             4.1       Abnormal Users
                                                                             We applied Tucker3 model to both datasets X102 and X909 by
                                                                             employing a MATLAB package called Three-mode component
                                                                             analysis (Tucker3) [10]. Before that, we performed ANOVA test
                                                                             [10] to see the significance of three-way and two-way interaction
                                                                             in the data. The results of this test are presented in Table 3. ANOVA
                                                                             Max 2D represents the maximum value obtained via different
                                                                             combinations of two-way modeling (e.g. I-J, J-K, I-K). As it can be
    Figure 2. Two sample users with different number of messages and         seen, bigger numbers are obtained for three-dimension interaction
                           different intervals
                                                                             (ANOVA 3D), which reveals that there is a mutual interaction
                                                                             between the three dimensions in both datasets that can be explained
some vectors to have different lengths, which is not supported by
                                                                             better with three-way modeling like Tucker3, than with two-way
the Tucker3 analysis. To solve this, every time-series of ∆T
                                                                             modeling like PCA.
relative to a given user is modeled by a 2-state HMM obtained by
the Baum-Welch algorithms [11]. 6 parameters are extracted from
                                                                                   Table3. ANOVA test and selected model parameters P-Q-R
the HMM and are used to describe the time-series of ∆T of the
                                                                                         ANOVA         ANOVA         Selected Model
users. Using this approach we obtain the same number of features               Data                                                        fit
                                                                                         max 2D           3D             P-Q-R
for each user and, then, include this information in our feature
                                                                               X102       26.18%        38.90%            3-2-2          42.00
vectors.
                                                                               X909       17.02%        78.04%           40-2-4          51.01
                     Table2. Datasets in tensor format
                                                                                 The next step is to estimate the best parameters P, Q, R of
                   1st mode (I)     2nd mode (J)        3rd mode (K)
       Data                                                                  Equation 1. P-Q-R is similar to what we have in PCA. In PCA we
                      Users           Features             Hours
                                                                             just determine the number of PCs for one dimension but here we
       X102            102               10                 720
                                                                             need to determine the number of principal components for each
       X909            909               10                 720
                                                                             one of the three modes. P, Q and R can assume values that fall
                                                                             within the interval [1, ], where  denotes the maximum
3           DATASET                                                          number of entities in the corresponding mode. For example, in
                                                                             terms of X102 the P-Q-R can go from 1-1-1 to 102-10-720. These
Dataset is extracted from the usage log of a European IP/TV                  parameters are chosen based on a trade-off between model
service provider. The raw dataset includes the notification                  parsimony, or complexity, and goodness of fit. For instance,
messages of users in each line including their occurrence time. As           regarding the mentioned dataset, 1-1-1 gives about 28% fit (less
previously mentioned, it is not possible to use this data directly in        complete and less complex) and model 102-10-720 gives 100% fit
our modeling approach, so some pre-processing steps were                     (most complete and most complex). If we try parameters 3-2-2 the
performed. In addition to the obtained HMM parameters for each               model has a 42% fit. So it can be more reasonable choice because
hour and for each user, we included another features, such as mean,          it finds a good compromise between complexity and fit. In [10] the
variance, entropy and number of messages per hour, to our feature            scree test method is proposed as a guideline to choose these
vector. We generated two separated datasets, each one spanning a             parameters. We used this test to determine the best model for both
time period of one month, which is equivalent to 720 hours. In one           datasets. The selected model parameters and their corresponding
set we selected 102 users and in another we selected 909 users. The          fits are presented in Table 3. This means that, for example, for
latter dataset is an extended version of the former. We then                 dataset X102 if we choose Tucker3 model with 3, 2 and 2
transformed both datasets to the tensor format. These datasets are           components to summarize the users, features and hours modes,
shown in a format of Tucker3 input tensor (figure 1) in Table 2              respectively, the model is able to explain 42% of the total variance
where I, J, K represent users, features and hours modes,                     contained in raw data. After the estimation of model parameters,
respectively.                                                                we used the selected model to decompose the raw data into a lower
                                                                             dimensional subspace, as illustrated in Figure 1 and Equation 1.
                                                                             After the decomposition we obtained matrices ,  and , a small
4           EXPERIMENTS
                                                                             core tensor  and a tensor of residual errors.
This section is divided into three subsections, according to the
steps mentioned in the Introduction section. In subsection 1, we             In order to detect the abnormal users we simply projected the users
explain how we detect the abnormal users. In the next subsection             on the component space yielded by matrix . This projection is
we describe how we generate user trajectories And in the last                presented in Figure 3, for dataset X102. The three-dimensional
subsection we explain how we cluster the trajectories using                  subspace is given by the three obtained components by the model
hierarchical clustering and detect events using user trajectories.           for the 1st mode (users). As mentioned earlier, this number of
                                                                             components is one of the parameters of the model, namely P = 3,
                                                                             which corresponds to the first mode.




                                                                        10
                                                                                       Figure 4. Generation process of user trajectories
                                                                             obtained by sequentially connecting these coordinates. Formally
                                                                             we define user trajectories as:

                                                                                Definition 1 (User Trajectory) : A sequence of time-stamped
                                                                             points, +,- = ./ → . → … → . → ⋯ → . , where . (, %, 3)
                                                                             (4 = 0,1, … 5), and 3 is a time point.

      Figure 3. Projection of Users on Matrix  for dataset X102             Figure 5 shows two abnormal users appearing in the top-10 users
                                                                             ranked based on abnormality. These abnormal users were ranked
In order to evaluate the reliability of the model we used the same           based on decreasing values of distance to the center, as explained
procedure and applied a Tucker3 model to dataset X909, which                 in subsection 4.1, user 10 (right) is ranked 2nd and user 95 (left) is
includes all users of X102. Our idea was to see how this model can           ranked 4th. However, as it is clear from the figure, their behavior
identify abnormal users from both datasets. For this purpose, we             over time is completely different. User 95 just shows two abnormal
computed the Euclidean distance between each user in the                     behaviors that correspond to two time points, while user 10 shows
projection space (see Figure 3) and the corresponding center (0, 0,          this abnormal behavior almost in all time points. This means that
0), for both datasets X102 and X909. Then we normalized the                  user 10 is dealing with a stable problem while user 95 only has
distances for each dataset and computed the Pearson correlation for          problems in specific points in time. This type of interpretation was
the common users of these two datasets, according to their distance          not possible based only on the ranking list of abnormal users,
to the center of the subspace. We obtained a correlation of 68.44%.          obtained in subsection 4.1. Using user trajectories provides us
Although, for X909 we just took 3 out of 40 main components to               richer insights into different kind of problems a user can
and model fit was different for both datasets (42% for X102 and              experience. For instance, what made user 95 be identified as
51.01% for X909),          abnormal or normal users in X102                  abnormal could be something that suddenly happened in the
approximately appeared as the same way in X909 with 68.44%                   network and then was quickly solved, while for user 10, some
confidence. This denotes that Tucker3 is a robust model to detect            problems occurred but they weren’t solved until the end of the time
the abnormal users.                                                          period under analysis.

4.2       User Trajectories
Visualization methods like the one we presented in Figure 3 are
not able to show the evolving behavior of users over time. We
need another solution to enable us understanding the behavior of
users over time. One solution is to project the users on a
decomposed feature space (matrix  of Figure 1) for each time
point. Since both of our selected parameters have Q equal to 2 it
means that after projecting Users on feature space we must have a
coordinate of (, %) for each timepoint and for each user. The
process of generating this coordinates is presented in Figure 4.
':, and ':,) represent the two components that summarize the                Figure 5. Two sample users trajectories in X909, Left) 4th ranked
original entities of the features mode and  represent the three-                    abnormal user Right) 2nd ranked abnormal user
order tensor (see Figure 1). The rows of the front matrix are the
users, the columns correspond to the features and the third mode
(*-axis) represents the hours. If we compute the dot product                 4.3       Event Detection from user trajectories
between each tensor’s rows with the columns of the component
                                                                                 Even though user trajectories can be useful, when the number of
matrix , yielded by the Tucker3 model we obtain the
                                                                             users is too large, the individual analysis of each trajectory can
coordinate(, %) for a given timepoint. If we repeat this procedure
                                                                             become a cumbersome task. If we notice that some group of users
for all time points (e.g. hours), we are able to generate the
                                                                             trajectories behave similarly, this can be understood as something
coordinates of each user for the 720 hours. The user trajectories are
                                                                             abnormal happens in their network level. Then some prevention or
                                                                             surveillance operations can be conducted more quickly.




                                                                        11
                                                                            clustering of trajectories. We are going to apply sliding window on
                                                                            trajectories to find time periods that have the most compact
                                                                            trajectories, which would lead to the discovery of events in a more
                                                                            accurate and reliable way


                                                                            ACKNOWLEDGEMENTS

                                                                            This work was supported by the Institute for Systems and Computer
Figure 6. Center Trajectory of clusters, Left: 1 user, Center : 866
                  users and Right: 22 users.                                Engineering of Porto (INESC TEC) under projects TNET with reference
                                                                            BI/120033/PEst/LIAAD and project KDUS with reference PTDC/EIA-
                                                                            EIA/098355/2008. The authors are grateful for the financial support. This
To explore this goal, we employed Agglomerative Hierarchical
Clustering toolbox from MATLAB to cluster user trajectories. We             work is also financed by the ERDF European Regional Development Fund
defined Euclidean distance between each point in trajectories as            through the COMPETE Programme (operational programme for
our distance function and Ward's criterion as the linkage criterion.        competitiveness) and by National Funds through the FCT Fundação para a
We tested different values of cut-off from 0.6 to 1.2 to examine the        Ciência e a Tecnologia (Portuguese Foundation for Science and
clustering structure. The most suited clustering structure was              Technology) within project CMU-PT/RNQ/0029/2009. We'd like to
obtained for a dendrogram distance of 1, which cuts the tree to             acknowledge the support of J.A. at the European IP/TV service provider.
level that, corresponds to three clusters. The average trajectory of
these clusters is shown in Figure 6. Cluster red has1 user (0.1%),
cluster blue comprises 866 users (97.4%) and cluster green                  REFERENCES
includes 22 users (2.5%). As it can be seen, no specific pattern can
be recognized from the green and the red cluster. The users in these         [1] Banerjee A., Kumar V. Chandola V., "Anomaly detection: a
two clusters show an abnormal behavior almost in all time points.                survey," 2009, pp. 1–58, 2009.
Such event can be due to a stable specific problem such as a                 [2] W. Wang and R. Battiti, "Identifying intrusions in computer
problem in the user device. Regarding the blue cluster, it is                    networks with principal component analysis," in International
possible to detect three events. First significant event occurs                  Conference on Availability, Reliability and Security, Vienna,
between hours 350 to 400. Second and third events also occur                     Austria, 2006, pp. 270 - 279.
between 450 to 480 and 520 to 560, respectively. However, the                [3] X. Guan, and X. Zhang W. Wang, "A novel intrusion
occurrence of the second and the third events should be assessed                 detection method based on principal component analysis," in
with hypothesis testing since they can be due to an accidental                   IEEE Symposium on Neural Networks, 2004, p. 2004.
change.
                                                                             [4] M.-L., Chen, S.-C., Sarinnapakorn, K., and Chang Shyu, "A
                                                                                 novel anomaly detection scheme-based on principal
5         CONCLUSIONS AND FUTURE WORK                                            component classifier," in 3rd IEEE International Conference
                                                                                 on Data Mining, Melbourne, Florida, 2003, pp. 353--365.
In this paper, we present a study on using the Tucker3
                                                                             [5] C. Callegari, "A novel PCA-based Network Anomaly
decomposition to discover abnormal users in an IP/TV network.
                                                                                 Detection," in IEEE International Conference on
Our results indicate that Tucker3 is a robust method for detecting
                                                                                 Communications (ICC), Pisa, Italy , 2011, pp. 1 - 5.
abnormal users in situations where interactions between the three
dimensions are present. From the tensor decomposition, we can                [6] M. C. et al. Kerman, "Event Detection Challenges, Methods,
define user trajectories. The trajectories allow us to observe the               and Applications in Natural and Artificial Systems," in 14th
behavior of these users over time. We were able to identify two                  International Command and Control Research and
kinds of abnormal users: those who show frequent abnormal                        Technology Symposium, Washington, DC, 2009,
behavior over the whole time period and those who are associated             [7] G. Shmueli, "Wavelet-based Monitoring in Modern
to one or few severe abnormal behaviors over the time period.                    Biosurveillance," University of Maryland, College Park,
Without resorting to the analysis of user temporal trajectories it               Report Working Paper RHS-06-002, 2005.
would have been harder to uncover such facts. Furthermore, from              [8] M. I. Zucchini W., Hidden Markov Models for Time Series.
the clusters of the users’ trajectories, we have identified three                USA/FL: Chapman & Hall/CRC, 2009.
events that occurred during three time points in the network. The
                                                                             [9] M. Stamp. (2004, Jan) Department of Computer Science, San
result of this work can be used in a real network surveillance
                                                                                 Jose           State           University.       [Online].
system to identify failures in the quickest possible time. In this
                                                                                 http://www.cs.sjsu.edu/faculty/stamp/RUA/HMM.pdf
work, we did not consider the spatial relation of users. Taking into
account spatial relationships between network nodes could lead to           [10] H.A.L., & van Mechelen, I. Kiers, "Three-way component
a better clustering of users. Since some users might show similar                analysis:   Principles   and       illustrative  application,"
behavior, with some delays, other distance measures for clustering               Psychological Methods, vol. 6, pp. 84-110, 2001.
should be tested. Currently we are employing another distance               [11] Lawrence R. Rabiner, "A tutorial on hidden Markov models
function using dynamic time warping, which assigns two users                     and selected applications in speech recognition," Proceedings
with same behavior but with a time shift in the same cluster. The                of the IEEE, vol. 77, no. 2, pp. 257-286, Feb 1989.
solution we presented for detection of events was based on




                                                                       12