=Paper=
{{Paper
|id=None
|storemode=property
|title=A Model to Summarize User Action Log Files
|pdfUrl=https://ceur-ws.org/Vol-926/paper3.pdf
|volume=Vol-926
|dblpUrl=https://dblp.org/rec/conf/aiia/GentiliM12
}}
==A Model to Summarize User Action Log Files==
<pdf width="1500px">https://ceur-ws.org/Vol-926/paper3.pdf</pdf>
<pre>
                                                                                   13

    A Model to Summarize User Action Log Files

                                 Eleonora Gentili
                             Supervisor: Alfredo Milani

    Dipartimento di Matematica e Informatica, Università degli Studi di Perugia
                      via Vanvitelli 1 – 06123 Perugia, Italy
                        eleonora.gentili@dmi.unipg.it


      Abstract. Social networks, web portals or e-learning platforms produce
      in general a large amount of data everyday, normally stored in its raw
      format in log file systems and databases. Such data can be condensed
      and summarized to improve reporting performance and reduce the sys-
      tem load. This data summarization reduces the amount of space that is
      required to store software data but produces, as a side effect, a decrease
      of their informative capability due to an information loss. In this work
      we study the problem of summarizing data obtained by the log systems
      of chronology-dependent applications with a lot of users. In particular,
      we present a method to reduce the data size, collapsing the descriptions
      of more events in a unique descriptor or in a smaller set of descriptors
      and pose the optimal summarization problem.


1    Introduction

During last years we have seen an impressive growth and diffusion of applications
shared and used by a huge amount of users around the world. Network traffic
data log systems, alarms in telecommunication networks and web portals which
records the user activities are examples of chronology-dependent applications
(CDA) producing in general large amount of data in the form of log sequences.
    But log files are usually big and noisy, and the difficulty of finding patterns
is very high as well as the number of patterns discovered can be very large [6].
For this reason, data in log files can be condensed and summarized to improve
reporting performance and analyze the system behavior.
    In this work, a new method to produce a concise summary of sequences of
events related to time is presented, which is based on the data size reduction
obtained merging time intervals and collapsing the descriptions of more events in
a smaller set of descriptors. Moreover, in order to obtain a data representation as
compact as possible, an abstraction operation allowing an event generalization
process (as in [5]) is defined. The summarized time sequence can substitute the
original time sequence in the data analysis.
    The reduction of the amount of data produces also as a side effect, a decrease
of their informative capability due to an information loss. For this reason, we
formally define the summarization problem as an optimization problem that
balances between shortness of the summary and accuracy of the data description.
14

1.1   Related Works

In [5], Pham et al. propose an algorithm that achieves time sequence summariza-
tion based on a generalization, grouping and concept formation process, main-
taining the overall chronology of events. Our summarization method overcomes
the time limitation of this procedure, using time intervals, instead only time
instants.
    In [6], [4], [3], authors propose methods to produce a comprehensive summary
for the input sequence, focused on the frequency changes of event types across
adjacent segments. In particular, in [4], Kiernan and Terzi rely to the Maximum
Description Length (MDL) principle to produce the summarized time sequence
balancing the shortness of the summary and accuracy of the data description.
Moreover, in [3], the presented framework summarizes an event sequence using
inter-arrival histograms to capture the temporal relationships among events us-
ing the MDL. Unlike these works, where are presented methods which partition
the time sequence into segments and study global relationships on each segment,
and among segments, our method takes into account an overview of the time
sequence to solve the summarization problem.
    In [1], Chandola et al. formulates the problem of summarization as an op-
timal summarization problem involving two measures, Compaction Gain and
Information Loss, which assess the quality of the summary.


2     The model

In this work we assume that each event is described by a set of users which
made actions over objects either at a particular instant or during an interval.
Moreover, the assumed time model is discrete, where events are instantaneous
and are representable by a tuple (u, a, o, t), with which we describe users, actions,
objects and time of the actions.
    In order to represent more general situations and potentially aggregate in-
formation of similar events, we define events as tuples involving sets of users,
actions and objects possibly occurred during an interval.

Definition 1. An event descriptor is a t-uple X = (U, A, O, I, δ) representing a
set of actions A made by a set of users U over a set of objects O, during a given
time interval I according to the covering index δ that is defined as the ratio by
the number of points in which the actions in A are actually executed and all the
points of I.

Example 1. X = ({admin}, {login, logout}, {IT }, [10, 50], 0.30) represents that
admin made login and logout in ITcourse in the 30% of the points of I = [10, 50].

    We assume that the labels in the sets U , A and O can be organized in
taxonomies, which are organized in hierarchies with multiple inheritance: each
taxonomy is associated to an abstraction operator (↑), allowing to climb the
hierarchy.
                                                                                           15

    The abstraction operator applied to a node of the taxonomy returns all the
fathers of the node, while, when it is applied to a set of nodes S = {s1 , . . . , sn },
the result is a set ↑ (S) = S ′ where at least a si is substituted with ↑ (si ). Let’s
two different sets S1 and S2 , the minimal abstracting set S is defined as the
first not null set of common ancestor of S1 and S2 computed by climbing the
taxonomy graph associated to S1 and S2 , i.e. such that S = ↑ (S1 ) = ↑ (S2 ).
Definition 2. A time sequence X = (X1 , . . . , Xm ) is a sequence of m event de-
scriptors Xi ; m is called size of the time sequence, or data size.
    Given a time sequence, we aim to provide methods to reduce its data size,
collapsing the descriptions of more events in a smaller set of event descriptors.
Definition 3. Let’s Ω the set of all event descriptors, X1 = (U, A, O, I1 , δ1 )
and X2 = (U, A, O, I2 , δ2 ) two event descriptors, I1 = [t′1 , t′′1 ] and I2 = [t′2 , t′′2 ],
we define the merging operator as
                                     ⊕:Ω×Ω →Ω
                                                                                       (1)
               ((U, A, O, I1 , δ1 ), (U, A, O, I2 , δ2 )) 7→ (U, A, O, I, δ)

such that I = [min(t′1 , t′2 ), max(t′′1 , t′′2 )] and
                         δ1 |I1 | + δ2 |I2 | − min(δ1 |I1 |, δ2 |I2 |, |I1 ∩ I2 |)
                    δ=                                                                    (2)
                                                  |I|
    The merging operator collapses intervals of event descriptors with identical
label sets. δ is computed considering that events happening in both I1 and I2
coincide as much as possible; it is simple to prove that δ ≤ max(δ1 , δ2 ).
Example 2. Given the time sequence X = {X1 , X2 , X3 , X4 }, where
X1 = ({user1 , user2 }, {login}, {objA}, [1, 1], 1.0), X2 = ({user2 }, {send}, {objA}, [2, 2], 1.0),
   X3 = ({user1 }, {read}, {objA}, [4, 4], 1.0), X4 = ({user2 }, {send}, {objA}, [5, 5], 1.0).
We can apply the merging operator to X2 and X4 obtaining
              X24 = X2 ⊕ X4 = ({user2 }, {send}, {objA}, [2, 4], 0.67).
The new time sequence X = {X1 , X24 , X3 } has a smaller size but less information
about events. And no more merging operation can be applied.
Definition 4. Let’s Ω the set of all event descriptors and given an event de-
scriptor X = (U, A, O, I, δ), the abstraction operator ↑S is defined as
                                    ↑S : Ω → Ω
                                                                              (3)
                        (U, A, O, I, δ) 7→ (U ′ , A′ , O′ , I, δ)
where S ∈ {U, A, O} and S ′ =↑ (S) is obtained applying the abstraction operator.
    The abstraction operator will make mergeable event descriptors having dif-
ferent label sets, generalizing labels in the sets U , A, O.
Definition 5. Let’s {Xi : Xi = (Ui , Ai , Oi , Ii , δi ), i = 1, . . . , n} a set of event
descriptors, Xi∗ = (U, A, O, Ii , δi ) is the minimal abstracting event for Xi if each
label set U , A, O is the minimal abstracting set respectively for {Ui }, {Ai }, {Oi }.
16

     For instance, let consider the taxonomy graphs depicted in Fig.1, and given
              X1 = ({user1 }, {Create.f older, Save}, {log1 }, I1 , δ1 ),
           X2 = ({user1 , user2 }, {Disk.op, W rite.mail}, {log1 }, I2 , δ2 ),
the two minimal abstracting events for X1 and X2 are X1∗ = ({user}, {U ser.op}, {log1 }, I1 , δ1 )
and X2∗ = ({user}, {U ser.op}, {log1 }, I2 , δ2 ) respectively.


      Fig. 1. An example of taxonomy graphs respectively over the sets U and A.

    Considering that creating a summary produces information loss, the optimal
summarization problem aims to maximize the reduction of data size minimizing
the information loss. It is clear that the optimal summarization is a question of
tradeoff between application of the merging operator and the abstraction operator
to the event descriptors.

3     The optimal summarization problem
Let’s X0 the time sequence of the initial volume of data and X a summarized
time sequence, we define some metrics to assess the quality of X with respect to
X0 .
Definition 6. The compaction gain of X is define as C(X, X0 ) = |X 0|
                                                                 |X| .

Definition 7. Given Ii = [t′i , t′′i ] and Gi = [t′′i , t′i+1 ], the covering accuracy of
X is                             Pn                  Pm
                                           δi |Ii | + i=1 |Gi |
                     µ(X) = Pi=1        n            P m          .                   (4)
                                        i=1 |Ii | +    i=1 |Gi |
   The gap intervals Gi are considered as intervals with δGi = 1.0, because
there are no events happening in each Gi . It easy to prove that 0 ≤ µ(X) ≤ 1.
In particular, µ(X) = 1 is verified if and only if δi = 1, ∀i = 1, . . . , n.
Definition 8. Given X, the description accuracy of X is defined as
                  η(X) = min (min(ωU η(U ), ωA η(A), ωO η(O))) ,                     (5)
                           X∈X
where ωU , ωA , ωO ≥ 0 are the weights of the label sets, and
                                            d(r, n)
                               η(S) = min           ,
                                       n∈S h(n)

where S ∈ {U, A, O}, r is the root of the taxonomy TS and h(n) is the longest
distance from n to a leaf.
                                                                                   17

   Note that 0 ≤ η(S) ≤ 1. In particular, η(S) = 1 is verified when n is a leaf,
and η(S) = 0 when n coincides with the root.
Definition 9. Given X and X0 , the information loss of the summarization pro-
cess is defined as
                I(X, X0 ) = α(µ(X0 ) − µ(X)) + β(η(X0 ) − η(X)).          (6)
Definition 10. Given X0 and a real number γ > 0, we define the Optimal Sum-
marized Time Sequence X such that the parameterized ratio between I(X, X0 )
and C(X, X0 ) is minimal, i.e.
                                       I(X, X0 )
                           X = argmin              .                     (7)
                                  X   [C(X, X0 )]γ

4    Conclusions and future works
In this work the problem of summarizing data obtained by the log systems of
applications with a lot of users is studied. We have presented a new method to
produce a concise summary of sequences of events related to time, based on the
data size reduction obtained merging time intervals and collapsing the descrip-
tions of more events in a unique descriptor or in a smaller set of descriptors.
Moreover, in order to obtain a data representation as compact as possible, an
abstraction operation allowing an event generalization process is defined.
    Moreover, we are studying about the formalization and the implementation
of an optimal algorithm for the Optimal Summarization Problem. The idea is to
use suboptimal algorithms to obtain the best summarization algorithm for our
method.

References
1. V. Chandola and V. Kumar. Summarization–compressing data into an informative
   representation. Knowledge and Information Systems, 12(3):355–378, 2007.
2. E. Gentili, A. Milani and V.Poggioni. Data summarization model for user action log
   files. In Proceedings of ICCSA 2012, Part III, LNCS 7335, pp. 539–549. Springer,
   Heidelberg, 2012.
3. Y. Jiang, C.S. Perng, and T. Li. Natural event summarization. In Proceedings of the
   20th ACM International Conference on Information and Knowledge Management,
   pages 765–774. ACM, 2011.
4. J. Kiernan and E. Terzi. Constructing comprehensive summaries of large event se-
   quences. In Proceeding of the 14th ACM SIGKDD International Conference on
   Knowledge Discovery and Data Mining, pages 417–425. ACM, 2008.
5. Q.K. Pham, G. Raschia, N. Mouaddib, R. Saint-Paul, and B. Benatallah. Time
   sequence summarization to scale up chronology-dependent applications. In Proceed-
   ing of the 18th ACM conference on Information and knowledge management, pages
   1137–1146, 2009.
6. P. Wang, H. Wang, M. Liu, and W. Wang. An algorithmic approach to event sum-
   marization. In Proceedings of the 2010 international conference on Management of
   data, pages 183–194. ACM, 2010.

</pre>