Event and anomaly detection using Tucker3 decomposition Hadi Fanaee Tork1, Márcia Oliveira2, João Gama3, Simon Malinowski4 and Ricardo Morla5 Abstract.1 Failure detection in telecommunication networks is a informatics, and credit card fraud detection. A significant amount vital task. So far, several supervised and unsupervised solutions of research has been devoted to solve this problem. However our have been provided for discovering failures in such networks. focus is on unsupervised methods. Anomaly detection techniques Among them unsupervised approaches has attracted more attention can be classified into five groups [1]: classification-based, since no label data is required [1]. Often, network devices are not clustering-based, nearest neighbor based, statistical methods, able to provide information about the type of failure. In such cases, unsupervised setting is more appropriate for diagnosis. Among information theory-based methods and spectral methods. Based on unsupervised approaches, Principal Component Analysis (PCA) this classification, our method is placed in the group of spectral has been widely used for anomaly detection literature and can be methods. These approaches first decompose the high-dimensional applied to matrix data (e.g. Users-Features). However, one of the data into a lower dimension space and then assume that normal and important properties of network data is their temporal sequential abnormal data points appear significantly different from together. nature. So considering the interaction of dimensions over a third This some benefits: 1) they can be employed in both unsupervised dimension, such as time, may provide us better insights into the and supervised settings 2) they can detect anomalies in high nature of network failures. In this paper we demonstrate the power dimensional data, and 3) unlike clustering techniques, they do not of three-way analysis to detect events and anomalies in time- require complicated manual parameter estimation. So far, most of evolving network data. the work related to spectral anomaly detection was based on Principal Component Analysis (PCA) and Singular Value 1 INTRODUCTION Decomposition (SVD). Two of the most important applications of PCA during recent years has been in the domain of intrusion Event detection can be briefly described as the task of discovering detection [2] [3] and traffic anomaly detection [4] [5]. unusual behavior of a system during a specific period of the time. On the other hand, anomaly detection concentrates on the detection of abnormal points. So clearly it is different from event detection 1.2 EVENT DETECTION since it just considers the points rather than a group of points. Our Due to huge amount of sequential data being generated by sensors, work takes into account both issues using multi-way data analysis. event detection has become an emerging issue with several real- Our methodology comprises the following steps: 1) Anomaly world applications. Event is a significant occurrence or pattern that detection: detection of individual abnormal users 2) Generating is unusual comparing to the normal patterns of the behavior of a user trajectories (i.e. behavior of users over time), 3) Clustering system [6]. This can be natural phenomena or manual system users’ trajectories to discover abnormal trajectories and 4) interaction. Some examples of events can be an attack on the Detection of events: group of users who show abnormal behavior network, bioterrorist activities, epidemic disease, damage in an during specific time periods. Although there is a rich body of aircraft, pipe-breaks, forest fires, etc. A real system behaves research on the two mentioned issues, to the best of our knowledge normally most of the time, until an anomaly occurs that may cause we are the first ones applying multi-way analysis to the anomaly damages to the system. Since the effects of an event in the system and event detection problem. In the remainder of this section we are not known a priori, detecting and characterizing abnormal explain some basic and related concepts and works. Afterwards, we events is challenging. This is the reason why most of the time we define the problem, and then discuss three-way analysis methods. cannot evaluate different algorithms. One solution might be Hereafter, we introduce the dataset and experiments. Finally, we injection of artificial event into the normal data. However, discuss the results and point out possible future directions. construction of a realistic event pattern is not trivial [7]. 1.1 ANOMALY DETECTION 1.3 HIDDEN MARKOV MODELS Anomaly is as a pattern in the data that does not conform to the Hidden Markov Models (HMMs) have been used at least for the expected behavior [1]. Anomaly detection has a wide range of last three decades in signal processing, especially in domain of application in computer network intrusion detection, medical speech recognition. They have also been applied in many other domains as bioinformatics (e.g. biological sequence analysis), 1 LIAAD-INESC TEC, FEUP- University of Porto, hadi.fanaee@fe.up.pt environmental studies (e.g. earthquake and wind detection), and 2 LIAAD-INESC TEC, FEP- University of Porto, marcia@liaad.up.pt finance (financial time series). HMMs became popular for its 3 LIAAD-INESC TEC, FEP- University of Porto, jgama@fep.up.pt simplicity and general mathematical tractability [8]. 4 INESC TEC, FEUP- University of Porto, sjfm@inescporto.pt 5 INESC TEC, FEUP- University of Porto, ricardo.morla@inescporto.pt 8 HMMs are widely used to describe complex probability distributions in time series and are well adapted to model time dependencies in such series. HMMs assume that observations distribution does not follow a normal distribution and are generated by different processes. Each process is dependent on the state of an underlying and unobserved Markov process [7]. Markov process denotes that the value of a process Xt only depends on the previous value of X. Using notations of [9] let: T = Length of the Observation sequence N = Number of states Q = {q0, q1, …, q (N-1) } = distinct states of Markov process Figure 1. Tucker3 Decomposition A = State transition probabilities B = set of N observation probability distributions π = Initial State Distribution 1.4.1 Tucker3 Model O = (O0, O1,…, O (T-1)) = observation sequence The Tucker3 model decomposes a three-mode tensor  into set of A HMM Model is defined with the triple of λ= (A, B, π). It component matrices , ,  and a small core tensor . The assumes that Observations are drawn using the observation following mathematical equation reveals the decomposition: probability distribution associated to the current state. The  transition probabilities between states are given in matrix A.  ≈ ∑ ∑ ∑   ×  ×   ×   (1) The three main problems related with HMMs are the following. Where P, Q and R are parameters of the Tucker3 model and The first problem consists in computing the probability P(O) that a represent the number of components retained in the first, the given observation sequence O is generated by a given HMM λ. The second and the third mode of the tensor, respectively. This second problem consists in finding the most probable sequence of decomposition is illustrated in Figure 1. hidden states given an observation sequence O and λ and the third problem is related to parameter inference. It consists in estimating the parameters of the HMM λ that best fits a given observation 2 PROBLEM DEFINITION sequence O. The mainly used algorithms to solve these problems One of significant issues in telecommunication systems, such as are given in the last column of Table 1. More details about these IP/TV, is to detect the anomalies at both network and user level. In algorithms can be found in [10]. In this paper, we deal with the order to study this, target users are usually equipped with a facility third problem to estimate the HMM parameters that best describe in their modem which sends an automatic notification message to time series, as it will be explained in Section 2. the central server when the connection of a client in the network is lost or reestablished. These modems are ubiquitous and Table1. Three HMM Problems geographically dispersed. Problem Input Output Solution Problem 1 λ, O P(O) Forward Backward algorithm The modeling of such behavior is not straightforward because the Problem 2 λ, O Best Q Viterbi algorithm number of notification messages is not equal for each user during Problem 3 O λ Baum-Welch algorithm the time period under analysis. For instance, one user may face 40 connection problems in an hour, hence generating 40 messages, while others may face 5 or even no problems at all. In standard event detection problems, for each time point there is a 1.4 THREE-WAY DATA ANALYSIS measurement via one or multiple sensors. In the context of our Traditional data analysis techniques such as PCA, clustering, application, such measurements do not take place at regular time regression, etc. are only able to model two dimensional data and points, since user modems (or sensors) only send messages to the they do not consider the interaction between more than two server when something unexpected occurs. Figure 2 illustrates two dimensions. However, in several real-world phenomena, there is a sample users. Each circle represents the time stamp at which a mutual relationship between more than two dimensions (e.g. a 3D notification relative to the given user is received, while ∆T tensor (Users×Features×Time)) and thus, they should be analyzed represents the inter-arrival time between two consecutive through a three-way perspective. Three-way analysis considers all messages. As it can be seen, 2 messages were related to user 1 in mutual dependencies between the different dimensions and that period, while 4 were related to user 2 during the same period. provides a compact representation of the original tensor in lower- Also, the ∆T between messages is larger for user 1 than for user 2. dimensional spaces. The most common three-way analysis models This means that user 2 sent messages more frequently than user 1. are Tucker2, Tucker3, and PARAFAC [10] which are generalized As in many other event detection problems, we could easily use the versions of two-mode principal component model or, more number of events per hour (measurement) at different users specifically, SVD. Following, we briefly introduce Tucker3 model (sensors) to detect the events but this way we would lose the as the best-known method for analysis of three-way data. information content provided by the ∆T’s. As the number of ∆T is not the same for each user, this feature cannot be directly integrated in our model. Hence, this would cause 9 4.1 Abnormal Users We applied Tucker3 model to both datasets X102 and X909 by employing a MATLAB package called Three-mode component analysis (Tucker3) [10]. Before that, we performed ANOVA test [10] to see the significance of three-way and two-way interaction in the data. The results of this test are presented in Table 3. ANOVA Max 2D represents the maximum value obtained via different combinations of two-way modeling (e.g. I-J, J-K, I-K). As it can be Figure 2. Two sample users with different number of messages and seen, bigger numbers are obtained for three-dimension interaction different intervals (ANOVA 3D), which reveals that there is a mutual interaction between the three dimensions in both datasets that can be explained some vectors to have different lengths, which is not supported by better with three-way modeling like Tucker3, than with two-way the Tucker3 analysis. To solve this, every time-series of ∆T modeling like PCA. relative to a given user is modeled by a 2-state HMM obtained by the Baum-Welch algorithms [11]. 6 parameters are extracted from Table3. ANOVA test and selected model parameters P-Q-R the HMM and are used to describe the time-series of ∆T of the ANOVA ANOVA Selected Model users. Using this approach we obtain the same number of features Data fit max 2D 3D P-Q-R for each user and, then, include this information in our feature X102 26.18% 38.90% 3-2-2 42.00 vectors. X909 17.02% 78.04% 40-2-4 51.01 Table2. Datasets in tensor format The next step is to estimate the best parameters P, Q, R of 1st mode (I) 2nd mode (J) 3rd mode (K) Data Equation 1. P-Q-R is similar to what we have in PCA. In PCA we Users Features Hours just determine the number of PCs for one dimension but here we X102 102 10 720 need to determine the number of principal components for each X909 909 10 720 one of the three modes. P, Q and R can assume values that fall within the interval [1, ], where  denotes the maximum 3 DATASET number of entities in the corresponding mode. For example, in terms of X102 the P-Q-R can go from 1-1-1 to 102-10-720. These Dataset is extracted from the usage log of a European IP/TV parameters are chosen based on a trade-off between model service provider. The raw dataset includes the notification parsimony, or complexity, and goodness of fit. For instance, messages of users in each line including their occurrence time. As regarding the mentioned dataset, 1-1-1 gives about 28% fit (less previously mentioned, it is not possible to use this data directly in complete and less complex) and model 102-10-720 gives 100% fit our modeling approach, so some pre-processing steps were (most complete and most complex). If we try parameters 3-2-2 the performed. In addition to the obtained HMM parameters for each model has a 42% fit. So it can be more reasonable choice because hour and for each user, we included another features, such as mean, it finds a good compromise between complexity and fit. In [10] the variance, entropy and number of messages per hour, to our feature scree test method is proposed as a guideline to choose these vector. We generated two separated datasets, each one spanning a parameters. We used this test to determine the best model for both time period of one month, which is equivalent to 720 hours. In one datasets. The selected model parameters and their corresponding set we selected 102 users and in another we selected 909 users. The fits are presented in Table 3. This means that, for example, for latter dataset is an extended version of the former. We then dataset X102 if we choose Tucker3 model with 3, 2 and 2 transformed both datasets to the tensor format. These datasets are components to summarize the users, features and hours modes, shown in a format of Tucker3 input tensor (figure 1) in Table 2 respectively, the model is able to explain 42% of the total variance where I, J, K represent users, features and hours modes, contained in raw data. After the estimation of model parameters, respectively. we used the selected model to decompose the raw data into a lower dimensional subspace, as illustrated in Figure 1 and Equation 1. After the decomposition we obtained matrices ,  and , a small 4 EXPERIMENTS core tensor  and a tensor of residual errors. This section is divided into three subsections, according to the steps mentioned in the Introduction section. In subsection 1, we In order to detect the abnormal users we simply projected the users explain how we detect the abnormal users. In the next subsection on the component space yielded by matrix . This projection is we describe how we generate user trajectories And in the last presented in Figure 3, for dataset X102. The three-dimensional subsection we explain how we cluster the trajectories using subspace is given by the three obtained components by the model hierarchical clustering and detect events using user trajectories. for the 1st mode (users). As mentioned earlier, this number of components is one of the parameters of the model, namely P = 3, which corresponds to the first mode. 10 Figure 4. Generation process of user trajectories obtained by sequentially connecting these coordinates. Formally we define user trajectories as: Definition 1 (User Trajectory) : A sequence of time-stamped points, +,- = ./ → . → … → . → ⋯ → . , where . (, %, 3) (4 = 0,1, … 5), and 3 is a time point. Figure 3. Projection of Users on Matrix  for dataset X102 Figure 5 shows two abnormal users appearing in the top-10 users ranked based on abnormality. These abnormal users were ranked In order to evaluate the reliability of the model we used the same based on decreasing values of distance to the center, as explained procedure and applied a Tucker3 model to dataset X909, which in subsection 4.1, user 10 (right) is ranked 2nd and user 95 (left) is includes all users of X102. Our idea was to see how this model can ranked 4th. However, as it is clear from the figure, their behavior identify abnormal users from both datasets. For this purpose, we over time is completely different. User 95 just shows two abnormal computed the Euclidean distance between each user in the behaviors that correspond to two time points, while user 10 shows projection space (see Figure 3) and the corresponding center (0, 0, this abnormal behavior almost in all time points. This means that 0), for both datasets X102 and X909. Then we normalized the user 10 is dealing with a stable problem while user 95 only has distances for each dataset and computed the Pearson correlation for problems in specific points in time. This type of interpretation was the common users of these two datasets, according to their distance not possible based only on the ranking list of abnormal users, to the center of the subspace. We obtained a correlation of 68.44%. obtained in subsection 4.1. Using user trajectories provides us Although, for X909 we just took 3 out of 40 main components to richer insights into different kind of problems a user can and model fit was different for both datasets (42% for X102 and experience. For instance, what made user 95 be identified as 51.01% for X909), abnormal or normal users in X102 abnormal could be something that suddenly happened in the approximately appeared as the same way in X909 with 68.44% network and then was quickly solved, while for user 10, some confidence. This denotes that Tucker3 is a robust model to detect problems occurred but they weren’t solved until the end of the time the abnormal users. period under analysis. 4.2 User Trajectories Visualization methods like the one we presented in Figure 3 are not able to show the evolving behavior of users over time. We need another solution to enable us understanding the behavior of users over time. One solution is to project the users on a decomposed feature space (matrix  of Figure 1) for each time point. Since both of our selected parameters have Q equal to 2 it means that after projecting Users on feature space we must have a coordinate of (, %) for each timepoint and for each user. The process of generating this coordinates is presented in Figure 4. ':, and ':,) represent the two components that summarize the Figure 5. Two sample users trajectories in X909, Left) 4th ranked original entities of the features mode and  represent the three- abnormal user Right) 2nd ranked abnormal user order tensor (see Figure 1). The rows of the front matrix are the users, the columns correspond to the features and the third mode (*-axis) represents the hours. If we compute the dot product 4.3 Event Detection from user trajectories between each tensor’s rows with the columns of the component Even though user trajectories can be useful, when the number of matrix , yielded by the Tucker3 model we obtain the users is too large, the individual analysis of each trajectory can coordinate(, %) for a given timepoint. If we repeat this procedure become a cumbersome task. If we notice that some group of users for all time points (e.g. hours), we are able to generate the trajectories behave similarly, this can be understood as something coordinates of each user for the 720 hours. The user trajectories are abnormal happens in their network level. Then some prevention or surveillance operations can be conducted more quickly. 11 clustering of trajectories. We are going to apply sliding window on trajectories to find time periods that have the most compact trajectories, which would lead to the discovery of events in a more accurate and reliable way ACKNOWLEDGEMENTS This work was supported by the Institute for Systems and Computer Figure 6. Center Trajectory of clusters, Left: 1 user, Center : 866 users and Right: 22 users. Engineering of Porto (INESC TEC) under projects TNET with reference BI/120033/PEst/LIAAD and project KDUS with reference PTDC/EIA- EIA/098355/2008. The authors are grateful for the financial support. This To explore this goal, we employed Agglomerative Hierarchical Clustering toolbox from MATLAB to cluster user trajectories. We work is also financed by the ERDF European Regional Development Fund defined Euclidean distance between each point in trajectories as through the COMPETE Programme (operational programme for our distance function and Ward's criterion as the linkage criterion. competitiveness) and by National Funds through the FCT Fundação para a We tested different values of cut-off from 0.6 to 1.2 to examine the Ciência e a Tecnologia (Portuguese Foundation for Science and clustering structure. The most suited clustering structure was Technology) within project CMU-PT/RNQ/0029/2009. We'd like to obtained for a dendrogram distance of 1, which cuts the tree to acknowledge the support of J.A. at the European IP/TV service provider. level that, corresponds to three clusters. The average trajectory of these clusters is shown in Figure 6. Cluster red has1 user (0.1%), cluster blue comprises 866 users (97.4%) and cluster green REFERENCES includes 22 users (2.5%). As it can be seen, no specific pattern can be recognized from the green and the red cluster. The users in these [1] Banerjee A., Kumar V. Chandola V., "Anomaly detection: a two clusters show an abnormal behavior almost in all time points. survey," 2009, pp. 1–58, 2009. Such event can be due to a stable specific problem such as a [2] W. Wang and R. Battiti, "Identifying intrusions in computer problem in the user device. Regarding the blue cluster, it is networks with principal component analysis," in International possible to detect three events. First significant event occurs Conference on Availability, Reliability and Security, Vienna, between hours 350 to 400. Second and third events also occur Austria, 2006, pp. 270 - 279. between 450 to 480 and 520 to 560, respectively. However, the [3] X. Guan, and X. Zhang W. Wang, "A novel intrusion occurrence of the second and the third events should be assessed detection method based on principal component analysis," in with hypothesis testing since they can be due to an accidental IEEE Symposium on Neural Networks, 2004, p. 2004. change. [4] M.-L., Chen, S.-C., Sarinnapakorn, K., and Chang Shyu, "A novel anomaly detection scheme-based on principal 5 CONCLUSIONS AND FUTURE WORK component classifier," in 3rd IEEE International Conference on Data Mining, Melbourne, Florida, 2003, pp. 353--365. In this paper, we present a study on using the Tucker3 [5] C. Callegari, "A novel PCA-based Network Anomaly decomposition to discover abnormal users in an IP/TV network. Detection," in IEEE International Conference on Our results indicate that Tucker3 is a robust method for detecting Communications (ICC), Pisa, Italy , 2011, pp. 1 - 5. abnormal users in situations where interactions between the three dimensions are present. From the tensor decomposition, we can [6] M. C. et al. Kerman, "Event Detection Challenges, Methods, define user trajectories. The trajectories allow us to observe the and Applications in Natural and Artificial Systems," in 14th behavior of these users over time. We were able to identify two International Command and Control Research and kinds of abnormal users: those who show frequent abnormal Technology Symposium, Washington, DC, 2009, behavior over the whole time period and those who are associated [7] G. Shmueli, "Wavelet-based Monitoring in Modern to one or few severe abnormal behaviors over the time period. Biosurveillance," University of Maryland, College Park, Without resorting to the analysis of user temporal trajectories it Report Working Paper RHS-06-002, 2005. would have been harder to uncover such facts. Furthermore, from [8] M. I. Zucchini W., Hidden Markov Models for Time Series. the clusters of the users’ trajectories, we have identified three USA/FL: Chapman & Hall/CRC, 2009. events that occurred during three time points in the network. The [9] M. Stamp. (2004, Jan) Department of Computer Science, San result of this work can be used in a real network surveillance Jose State University. [Online]. system to identify failures in the quickest possible time. In this http://www.cs.sjsu.edu/faculty/stamp/RUA/HMM.pdf work, we did not consider the spatial relation of users. Taking into account spatial relationships between network nodes could lead to [10] H.A.L., & van Mechelen, I. Kiers, "Three-way component a better clustering of users. Since some users might show similar analysis: Principles and illustrative application," behavior, with some delays, other distance measures for clustering Psychological Methods, vol. 6, pp. 84-110, 2001. should be tested. Currently we are employing another distance [11] Lawrence R. Rabiner, "A tutorial on hidden Markov models function using dynamic time warping, which assigns two users and selected applications in speech recognition," Proceedings with same behavior but with a time shift in the same cluster. The of the IEEE, vol. 77, no. 2, pp. 257-286, Feb 1989. solution we presented for detection of events was based on 12