=Paper=
{{Paper
|id=None
|storemode=property
|title=A Model to Summarize User Action Log Files
|pdfUrl=https://ceur-ws.org/Vol-926/paper3.pdf
|volume=Vol-926
|dblpUrl=https://dblp.org/rec/conf/aiia/GentiliM12
}}
==A Model to Summarize User Action Log Files==
13 A Model to Summarize User Action Log Files Eleonora Gentili Supervisor: Alfredo Milani Dipartimento di Matematica e Informatica, Università degli Studi di Perugia via Vanvitelli 1 – 06123 Perugia, Italy eleonora.gentili@dmi.unipg.it Abstract. Social networks, web portals or e-learning platforms produce in general a large amount of data everyday, normally stored in its raw format in log file systems and databases. Such data can be condensed and summarized to improve reporting performance and reduce the sys- tem load. This data summarization reduces the amount of space that is required to store software data but produces, as a side effect, a decrease of their informative capability due to an information loss. In this work we study the problem of summarizing data obtained by the log systems of chronology-dependent applications with a lot of users. In particular, we present a method to reduce the data size, collapsing the descriptions of more events in a unique descriptor or in a smaller set of descriptors and pose the optimal summarization problem. 1 Introduction During last years we have seen an impressive growth and diffusion of applications shared and used by a huge amount of users around the world. Network traffic data log systems, alarms in telecommunication networks and web portals which records the user activities are examples of chronology-dependent applications (CDA) producing in general large amount of data in the form of log sequences. But log files are usually big and noisy, and the difficulty of finding patterns is very high as well as the number of patterns discovered can be very large [6]. For this reason, data in log files can be condensed and summarized to improve reporting performance and analyze the system behavior. In this work, a new method to produce a concise summary of sequences of events related to time is presented, which is based on the data size reduction obtained merging time intervals and collapsing the descriptions of more events in a smaller set of descriptors. Moreover, in order to obtain a data representation as compact as possible, an abstraction operation allowing an event generalization process (as in [5]) is defined. The summarized time sequence can substitute the original time sequence in the data analysis. The reduction of the amount of data produces also as a side effect, a decrease of their informative capability due to an information loss. For this reason, we formally define the summarization problem as an optimization problem that balances between shortness of the summary and accuracy of the data description. 14 1.1 Related Works In [5], Pham et al. propose an algorithm that achieves time sequence summariza- tion based on a generalization, grouping and concept formation process, main- taining the overall chronology of events. Our summarization method overcomes the time limitation of this procedure, using time intervals, instead only time instants. In [6], [4], [3], authors propose methods to produce a comprehensive summary for the input sequence, focused on the frequency changes of event types across adjacent segments. In particular, in [4], Kiernan and Terzi rely to the Maximum Description Length (MDL) principle to produce the summarized time sequence balancing the shortness of the summary and accuracy of the data description. Moreover, in [3], the presented framework summarizes an event sequence using inter-arrival histograms to capture the temporal relationships among events us- ing the MDL. Unlike these works, where are presented methods which partition the time sequence into segments and study global relationships on each segment, and among segments, our method takes into account an overview of the time sequence to solve the summarization problem. In [1], Chandola et al. formulates the problem of summarization as an op- timal summarization problem involving two measures, Compaction Gain and Information Loss, which assess the quality of the summary. 2 The model In this work we assume that each event is described by a set of users which made actions over objects either at a particular instant or during an interval. Moreover, the assumed time model is discrete, where events are instantaneous and are representable by a tuple (u, a, o, t), with which we describe users, actions, objects and time of the actions. In order to represent more general situations and potentially aggregate in- formation of similar events, we define events as tuples involving sets of users, actions and objects possibly occurred during an interval. Definition 1. An event descriptor is a t-uple X = (U, A, O, I, δ) representing a set of actions A made by a set of users U over a set of objects O, during a given time interval I according to the covering index δ that is defined as the ratio by the number of points in which the actions in A are actually executed and all the points of I. Example 1. X = ({admin}, {login, logout}, {IT }, [10, 50], 0.30) represents that admin made login and logout in ITcourse in the 30% of the points of I = [10, 50]. We assume that the labels in the sets U , A and O can be organized in taxonomies, which are organized in hierarchies with multiple inheritance: each taxonomy is associated to an abstraction operator (↑), allowing to climb the hierarchy. 15 The abstraction operator applied to a node of the taxonomy returns all the fathers of the node, while, when it is applied to a set of nodes S = {s1 , . . . , sn }, the result is a set ↑ (S) = S ′ where at least a si is substituted with ↑ (si ). Let’s two different sets S1 and S2 , the minimal abstracting set S is defined as the first not null set of common ancestor of S1 and S2 computed by climbing the taxonomy graph associated to S1 and S2 , i.e. such that S = ↑ (S1 ) = ↑ (S2 ). Definition 2. A time sequence X = (X1 , . . . , Xm ) is a sequence of m event de- scriptors Xi ; m is called size of the time sequence, or data size. Given a time sequence, we aim to provide methods to reduce its data size, collapsing the descriptions of more events in a smaller set of event descriptors. Definition 3. Let’s Ω the set of all event descriptors, X1 = (U, A, O, I1 , δ1 ) and X2 = (U, A, O, I2 , δ2 ) two event descriptors, I1 = [t′1 , t′′1 ] and I2 = [t′2 , t′′2 ], we define the merging operator as ⊕:Ω×Ω →Ω (1) ((U, A, O, I1 , δ1 ), (U, A, O, I2 , δ2 )) 7→ (U, A, O, I, δ) such that I = [min(t′1 , t′2 ), max(t′′1 , t′′2 )] and δ1 |I1 | + δ2 |I2 | − min(δ1 |I1 |, δ2 |I2 |, |I1 ∩ I2 |) δ= (2) |I| The merging operator collapses intervals of event descriptors with identical label sets. δ is computed considering that events happening in both I1 and I2 coincide as much as possible; it is simple to prove that δ ≤ max(δ1 , δ2 ). Example 2. Given the time sequence X = {X1 , X2 , X3 , X4 }, where X1 = ({user1 , user2 }, {login}, {objA}, [1, 1], 1.0), X2 = ({user2 }, {send}, {objA}, [2, 2], 1.0), X3 = ({user1 }, {read}, {objA}, [4, 4], 1.0), X4 = ({user2 }, {send}, {objA}, [5, 5], 1.0). We can apply the merging operator to X2 and X4 obtaining X24 = X2 ⊕ X4 = ({user2 }, {send}, {objA}, [2, 4], 0.67). The new time sequence X = {X1 , X24 , X3 } has a smaller size but less information about events. And no more merging operation can be applied. Definition 4. Let’s Ω the set of all event descriptors and given an event de- scriptor X = (U, A, O, I, δ), the abstraction operator ↑S is defined as ↑S : Ω → Ω (3) (U, A, O, I, δ) 7→ (U ′ , A′ , O′ , I, δ) where S ∈ {U, A, O} and S ′ =↑ (S) is obtained applying the abstraction operator. The abstraction operator will make mergeable event descriptors having dif- ferent label sets, generalizing labels in the sets U , A, O. Definition 5. Let’s {Xi : Xi = (Ui , Ai , Oi , Ii , δi ), i = 1, . . . , n} a set of event descriptors, Xi∗ = (U, A, O, Ii , δi ) is the minimal abstracting event for Xi if each label set U , A, O is the minimal abstracting set respectively for {Ui }, {Ai }, {Oi }. 16 For instance, let consider the taxonomy graphs depicted in Fig.1, and given X1 = ({user1 }, {Create.f older, Save}, {log1 }, I1 , δ1 ), X2 = ({user1 , user2 }, {Disk.op, W rite.mail}, {log1 }, I2 , δ2 ), the two minimal abstracting events for X1 and X2 are X1∗ = ({user}, {U ser.op}, {log1 }, I1 , δ1 ) and X2∗ = ({user}, {U ser.op}, {log1 }, I2 , δ2 ) respectively. Fig. 1. An example of taxonomy graphs respectively over the sets U and A. Considering that creating a summary produces information loss, the optimal summarization problem aims to maximize the reduction of data size minimizing the information loss. It is clear that the optimal summarization is a question of tradeoff between application of the merging operator and the abstraction operator to the event descriptors. 3 The optimal summarization problem Let’s X0 the time sequence of the initial volume of data and X a summarized time sequence, we define some metrics to assess the quality of X with respect to X0 . Definition 6. The compaction gain of X is define as C(X, X0 ) = |X 0| |X| . Definition 7. Given Ii = [t′i , t′′i ] and Gi = [t′′i , t′i+1 ], the covering accuracy of X is Pn Pm δi |Ii | + i=1 |Gi | µ(X) = Pi=1 n P m . (4) i=1 |Ii | + i=1 |Gi | The gap intervals Gi are considered as intervals with δGi = 1.0, because there are no events happening in each Gi . It easy to prove that 0 ≤ µ(X) ≤ 1. In particular, µ(X) = 1 is verified if and only if δi = 1, ∀i = 1, . . . , n. Definition 8. Given X, the description accuracy of X is defined as η(X) = min (min(ωU η(U ), ωA η(A), ωO η(O))) , (5) X∈X where ωU , ωA , ωO ≥ 0 are the weights of the label sets, and d(r, n) η(S) = min , n∈S h(n) where S ∈ {U, A, O}, r is the root of the taxonomy TS and h(n) is the longest distance from n to a leaf. 17 Note that 0 ≤ η(S) ≤ 1. In particular, η(S) = 1 is verified when n is a leaf, and η(S) = 0 when n coincides with the root. Definition 9. Given X and X0 , the information loss of the summarization pro- cess is defined as I(X, X0 ) = α(µ(X0 ) − µ(X)) + β(η(X0 ) − η(X)). (6) Definition 10. Given X0 and a real number γ > 0, we define the Optimal Sum- marized Time Sequence X such that the parameterized ratio between I(X, X0 ) and C(X, X0 ) is minimal, i.e. I(X, X0 ) X = argmin . (7) X [C(X, X0 )]γ 4 Conclusions and future works In this work the problem of summarizing data obtained by the log systems of applications with a lot of users is studied. We have presented a new method to produce a concise summary of sequences of events related to time, based on the data size reduction obtained merging time intervals and collapsing the descrip- tions of more events in a unique descriptor or in a smaller set of descriptors. Moreover, in order to obtain a data representation as compact as possible, an abstraction operation allowing an event generalization process is defined. Moreover, we are studying about the formalization and the implementation of an optimal algorithm for the Optimal Summarization Problem. The idea is to use suboptimal algorithms to obtain the best summarization algorithm for our method. References 1. V. Chandola and V. Kumar. Summarization–compressing data into an informative representation. Knowledge and Information Systems, 12(3):355–378, 2007. 2. E. Gentili, A. Milani and V.Poggioni. Data summarization model for user action log files. In Proceedings of ICCSA 2012, Part III, LNCS 7335, pp. 539–549. Springer, Heidelberg, 2012. 3. Y. Jiang, C.S. Perng, and T. Li. Natural event summarization. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pages 765–774. ACM, 2011. 4. J. Kiernan and E. Terzi. Constructing comprehensive summaries of large event se- quences. In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 417–425. ACM, 2008. 5. Q.K. Pham, G. Raschia, N. Mouaddib, R. Saint-Paul, and B. Benatallah. Time sequence summarization to scale up chronology-dependent applications. In Proceed- ing of the 18th ACM conference on Information and knowledge management, pages 1137–1146, 2009. 6. P. Wang, H. Wang, M. Liu, and W. Wang. An algorithmic approach to event sum- marization. In Proceedings of the 2010 international conference on Management of data, pages 183–194. ACM, 2010.