<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Temporal Support in Sequential Pattern Mining</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leticia I. G´omez</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bart Kuijpers</string-name>
          <email>bart.kuijpers@uhasselt.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alejandro A. Vaisman</string-name>
          <email>alejandro.vaisman@uhasselt.be</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Buenos Aires University, Hasselt University &amp; Transnational University of Limburg</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Hasselt University &amp; Transnational University of Limburg</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Instituto Tecn ́ologico de Buenos Aires</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In sequential pattern discovery, the support of a sequence is computed as the number of data-sequences satisfying a pattern with respect to the total number of data-sequences in the database. When the items are frequently updated, the traditional way of counting support in sequential pattern mining may lead to incorrect (or, at least incomplete), conclusions. For example, if we are looking for the support of the sequence A.B, where A and B are two items such that A was created after B, all sequences in the database that were completed before A was created, can never produce a match. Therefore, accounting for them would underestimate the support of the sequence A.B. In this paper we propose to revise the classic notion of support in sequential pattern mining, introducing the concept of temporal support of a sequential expression(SE), intuitively defined as the number of sequences satisfying a target pattern, out of the total number of sequences that could have possibly matched such pattern. We then generalize this notion to regular expressions (RE) which encapsulate the definition of a collection of SEs. We present and discuss a theoretical framework for these novel notion of support, and present an algorithm to compute it.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Traditional sequential patterns algorithms are founded on the assumption that
items in databases are static, and that they existed throughout the whole
lifespan of the world modeled by the database. There are many real-world situations
(where items are created or deleted dynamically) where these assumptions are
not valid. Consider for example SPM in the context of the World Wide Web,
where Web pages are frequently added or deleted. Data Mining techniques have
been applied for discovering interaction patterns of WWW users, typically
analyzing the URLs visited during a session, recorded in a Web server log. Figure
1 depicts a portion of a (simplified) Web log. In classic SPM, the support of a
sequence S is defined as the fraction of sessions that support S. Thus, all sessions
are considered as having the same probability to support a given sequence. For
example, the support of the sequence CBC, counted in the classical way, would
be 66%, since CBC is present in two of the three sessions. Analogously, the
support of the sequence CB would also be 66%. However, if not all these Web pages
existed all the time, would it be reasonable to ignore the evolution of the items
(URLs) across time? We discuss these issues in this paper. The right hand side
session ID interaction</p>
      <p>
        time URL
s1 tt==31 CA
t=5 B
t=6 C
time URL
s2 tt==43 AB
t=8 A
time URL
s3 tt==150 CA
t=20 B
t=23 C
of Figure 1 shows how URLs A, B, and C in the left hand side have evolved,
and the time intervals when each URL has been available. For example, URL A
was available during the interval [
        <xref ref-type="bibr" rid="ref1 ref8">1, 8</xref>
        ], and URL B in intervals [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] and [11,
now ]. (We use the term now to refer to the current time instant). During session
s2, URL C did not exist at t=8, when the user clicked URL A. Thus, session s2
did not have the possibility of producing a sequence that finishes with the URL
C. Sessions s1 and s3, instead, support the sequence CBC. Then, ignoring the
evolution of these URLs, the support of sequence CBC would be 66%, but, if
we do not count session s2, we would obtain a support of 100%. Analogously,
s1 and s3 support the sequence CB but s2 does not. However, C was available
during session s2, when the user clicked URLs A (t=3) and B (t=4). Thus, the
user could have produced the sequence CB, although she decided to follow a
different path. Session s2 must then be counted for computing the support of
the sequence CB, which would be 66%.
      </p>
      <p>In addition to the above, suppose now that a user is interested not only in the
support of a sequence, but in the support of an RE as a whole. Let us analyze
a simple example. The expression (A|B).C is satisfied by sequences like A.C
or B.C. Even though the semantics of this RE suggests that both of them are
equally interesting to the user, if neither of them verifies a minimum support
(although together they do), they would not be retrieved. The problem gets more
involved if we are interested in categorical sequential patterns, i.e., patterns like
Science.Sports, where Science and Sports are, for instance, categories of Web
pages in an ontology.</p>
      <p>In light of the above, we propose to revise, in different ways, the classic notion
of support for sequential pattern mining, introducing the concept of temporal
support of regular expressions, intuitively defined as the number of sequences
satisfying a target pattern, out of the total number of sequences that could
have possibly matched such pattern, where the pattern is defined as a RE over
complex items. After reviewing related work (Section 2), we introduce the data
model (Section 3). Then we present and discuss a theoretical framework for this
novel notion of support, and an RE-based language (Sections 4 and 5). Finally,
we adapt classic Generalized Sequential Pattern (GSP) algorithms to our setting,
introducing appropriate pruning strategies (Section 6).</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Classic algorithms for Sequential Pattern discovery [
        <xref ref-type="bibr" rid="ref1 ref8">1, 8</xref>
        ] return all frequent
sequences present in a database. However, more often than not, only a few ones are
interesting from a user’s point of view. Thus, post-processing tasks are required
in order to discard uninteresting sequences. To avoid this drawback, languages
based on regular expressions (RE) were proposed to restrict frequent sequences
to the ones that satisfy user-specified constraints. Garofalakis et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] address
this problem by pruning the candidate patterns obtained during the mining
process by adding user-specified constraints in the form of regular expressions over
items. The algorithm returns only the frequent patterns that satisfy these
regular expressions. Recently, the data mining community started to discuss new
notions of support in SPM, that account for changes of the items database across
time. Although this problem has already been addressed for Association Rule
mining, where the concept of temporal support has already been introduced [
        <xref ref-type="bibr" rid="ref10 ref4 ref5">4,
5, 10</xref>
        ], this has been overlooked in SPM. To the best of our knowledge, the works
we comment below are the only ones partially addressing the issue. Masseglia
et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and Parthasarathy et al.[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], study the so-called incremental sequential
pattern mining problem, focusing on avoiding re-scanning the entire database
when new items appear. Recently, Huang et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] addressed the problem of
detecting frequent patterns valid during a defined period of interest, called POI.
For example, if new items appear, and no new transactions were generated, old
frequent sequences would still be frequent.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Data Model</title>
      <p>Depending on the application domain, the items to be mined can be
characterized by different attributes. Throughout the paper we refer to an example where
each Web page is characterized by the following attributes: (a) catName, which
represents the name of the category of the item; (b) keyword, which
summarizes the page contents; (c) filter, specifying a list of URLs that cannot appear
together with the URL of the item. Finally, ID is a distinguished, mandatory
attribute, in this case containing the URL that univocally identifies a Web page.
Definition 1. [Category Schema] We have a set of attribute names A, and a set
of identifier names I. Each attribute a ∈ A is associated with a set of values in
dom(a), and each identifier ID ∈ I is associated with a set of values in dom(ID ).</p>
      <p>A category schema S is a tuple (ID , A), where ID ∈ I is a distinguished
attribute denoted identifier, and A is a set of attributes in A. Without loss of
generality, and for simplicity, in what follows we consider the set A ordered.
Thus, S has the form [ID, attr1, ..., attrn]. ⊓⊔</p>
      <p>
        In many real-world applications, assuming that the values of attributes for
a category occurrence do not change, could not be realistic. Thus, we introduce
the time dimension into our data model, timestamping category occurrences. We
assume that the category schema is constant across time. Formally, if T is a set,
and &lt; a discrete linear order without endpoints on T, the structure TP = (T , &lt;)
is the Point-based Temporal Domain. The elements in the carrier of T model
the individual time instants, and the linear order &lt; models the succession of
time. We consider the set T to be N (standing for the natural numbers). We can
map individual time instants t ∈ N to calendar instants, assuming a reference
point and a granularity. In what follows we use calendar time, and granularity
“minute”. In temporal databases terminology [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], in this paper we assume valid
time support for categories, and transaction time support for items (see below).
Definition 2. [Category Occurrence and Category Instance] Given a category
schema S , a category occurrence for S is the tuple [hID, idi, P , t], where ID
is the ID attribute of Definition 1 above, id ∈ dom(ID), P is the structure
[hattr1, v1i..., hattrn, vni], t is a point in the temporal domain TP , and: (a)
attri = A(i) in S (remember that A is considered ordered); (b) vi ∈ dom(attri),
∀i, i = 1..n; (c) All the occurrences of the same category have the same set of
attributes, at any given time; (d) At any instant t, the pair hID , ti is unique for
a category occurrence, meaning that no two occurrences of the same category can
have the same value for ID at the same time; (e) t is the time instant when the
information in the category occurrence is valid.
      </p>
      <p>A set of occurrences of the same category is denoted a category instance.
We extend the fourth condition in Definition 2 to hold for the whole set: no two
occurrences of categories in the set can have the same value for ID at the same
instant t (in other words, the pair (ID, t) is unique for the whole instance). ⊓⊔</p>
      <p>In what follows, for clarity, we assume that attr1 stands for ID . Thus, a
category occurrence is the set of pairs [h attr1,v1i,..., ... , h attrn,vni, t]. ⊓⊔
Definition 3. [Interval Encoding] Let G be a time granularity, and g a time
unit for G (e.g., one minute). Given a set of k ≥ 0 category occurrences,
[hattr1, v1i, . . . , hattrn, vni, t1], [hattr1, v1i, . . . , hattrn, vni, t2],. . ., [hattr1, v1i . . . ,
hattrn, vni, tk], if ∀ i, i = 1..k − 1, it holds that ti+1 = ti + g, we encode all these
occurrences in a single tuple [hattr1, v1i, . . . , hattrn, vni, [t1, tk]]. ⊓⊔
Definition 4. [Encoded Category Occurrence and Items] Given a category
instance C with time granularity G, and a partition P of C such that the
number of sets pi ∈ P is minimal. Each set pi is obtained encoding the
occurrences in C as in Definition 3, i.e., each pi contains a set of tuples that can
be encoded into a single tuple. Thus, associated to pi there is a tuple tpi =
(hID, idi, hattr1, v1i, . . . hattrn, vni, tv, te), where (a) ID, attr1, ..., attrn are the
attributes of the occurrences in pi; (b) id, v1, .... vn are the values for the
attributes in (a); (c) ts is the smallest t of the occurrences in pi; (d) te is the
largest t of the occurrences in pi. We denote tpi an encoded category occurrence
(ECO) of the set of occurrences in pi. Given an ECO ei we denote Interval (ei)
its associated interval [ts, te].</p>
      <p>Adding a time instant to an ECO, produces an Item. Let eo = (hID, vi, hattr1,
v1i, . . . hattrn, vni, ts, te) be an ECO for some category instance. An item I
associated to eo is the set: (ht, vti, hID, vi, hattr1, v1i, . . . hattrn, vni, ts, tei), such
that vt ∈ [ts, te] holds. We denote t the transaction time of the item. ⊓⊔
ecoA1 [(ID,‘A’), (catName,‘WebPage’), (filter,‘’), (keyword,‘Books’), [‘11/29/2007 15:45’, ‘11/29/2007 17:10’] ]
ecoA2 [(ID,‘A’), (catName,‘WebPage’), (filter,‘P’), (keyword,‘Computers’), [‘11/29/2007 18:00’, ‘11/29/2007 19:30’]]
eeeeecccccoooooPCCCM13121 [[[[[(((((IIIIIDDDDD,,,,,‘‘‘‘‘PCCCM’’’’’))))),,,,, (((((cccccaaaaatttttNNNNNaaaaammmmmeeeee,,,,,‘‘‘‘‘WWWWWeeeeebbbbbPPPPPaaaaagggggeeeee’’’’’))))),,,,, (((((fffffiiiiilllllttttteeeeerrrrr,,,,,‘‘‘‘‘’’’AM))),,,’’)),, (((((kkkkkeeeeeyyyyywwwwwooooorrrrrddddd,,,,,‘‘‘‘‘BCGGGooaaaommmmkpeeessssu’’’’)t))),e,,,rs’), [[[[[‘‘‘‘‘1111111111/////2222299999/////22222000000000077777 1111188966:::::5034000050’’’’’,,,,, ‘‘‘‘‘111N11111o////w2222’9999]////]2222000000007777 21110886::::05540005’’’’]]]]]]]]
ecoP2 [(ID,‘P’), (catName,‘WebPage’), (filter,‘A’), (keyword,‘Computers’), [‘11/29/2007 18:51’, ‘Now’]]
Definition 5. [Sequential Expression] A Sequential Expression (SE) of length
n is an ordered list of n sub-expressions SE1.SE2....SEn, where each SEi is a
constraint, ∀i, i = 1..n.</p>
      <p>A constraint is a formula enclosed in squared brackets. Given a constraint
C = [F ], we denote F (C) the formula of C. A formula is built as a conjunction of
atoms of the form term1 = term2, where term1 is an attribute (temporal or
nontemporal) or a function, and term2 is a constant. More precisely, the terms of the
language are: (a) Constants: a literal enclosed by simple quotes; (b) Non temporal
Attributes: an attribute in the category schema (e.g. filter, url). (c) Temporal
Attributes: t, the temporal attribute of and item; (d) Functions of n arguments: Let
fn be a function symbol, the expression fn(attribute, ‘ct1’, ‘ct2’, ..., ‘ctn−1’), n ≥
1, is a function where the first parameter is an attribute (temporal or
nontemporal), and all the other ones are constants. ⊓⊔</p>
      <p>Given two intervals Ii = [tsi, tei] and Ij = [tsj , tej ] we say that Ii follows
Ij if tsi ≥ tej . For example, in Figure 2 we can see that Interval(ecoC3) follows
Interval(ecoA2) and Interval(ecoC2). We can also see that Interval(ecoC3) does
not follow Interval(ecoM1).</p>
      <p>Definition 6. [ECO-Satisfability of a Constraint] Given a constraint C and an
ECO E, we say that E satisfies C if one of the following conditions hold: (a) If
OID Items</p>
      <p>[(t,‘11/29/2007 16:30’), (ID,‘C’) ]
Session1 [(t,‘11/29/2007 17:00’), (ID,‘P’) ]
[(t,‘11/29/2007 19:45’), (ID,‘C’),]
[(t,‘11/29/2007 18:20’), (ID,‘C’)]
Session2 [(t,‘11/29/2007 18:50’), (ID,‘P’)]
[(t,‘11/29/2007 18:51’), (ID,‘M’)]
[(t,‘11/29/2007 19:31’), (ID,‘C’)]
Session3 [(t,‘11/29/2007 19:32’), (ID,‘M’)]
[(t,‘11/29/2007 20:00’), (ID,‘C’)]
F (C) is of the form attr = ‘ct’ where attr is an attribute, ‘ct ′ is a constant, and
the instantiation of attr with its value in E, verifies the equality. (b) If F (C)
is of the form fn(attr , ‘ct1’, ‘ct2’, ..., ‘ctn−1’) = ‘ct’, attr and ‘ct ′ are defined
as in (a), and the instantiation of attr in fn with its value in E, makes the
equality true. (c) If F (C) is of the form t = ‘ct’ where t is a temporal attribute,
‘ct’ is a temporal constant, and ‘ct’ ∈ Interval (E). (d) If F (C) is of the form
fn(t, ‘ct1’, ‘ct2’, ..., ‘ctn−1’) = ‘ct’, where t and ‘ct’ are defined as in (c), and
∃tu ∈ Interval (E) such that the equality is true. (e) If F (C) is a formula F 1∧F 2,
and F 1 and F 2 are satisfied by E. ⊓⊔
Definition 7. [ECO-Satisfability of SE] Let EO = (EO1, EO2, ...EOn) be a list
of ECOs such that ∀ i, j, i &lt; j ⇒ Interval (Ei) does not follow Interval (Ej ). We
denote EO a t-ordered list of ECOs. A sequential expression SE=SE1.SE2....SEn
is satisfied by EO if EOi satisfies SEi, ∀ i, i = 1..n. We denote SLk (SE) the
set composed of the n lists of ECOs that satisfy an SE of length k. ⊓⊔
Definition 8. [ToI and Normalized ToI] Let I be a finite set of items. A Table
Of Items (ToI) for I is a table with schema T = (OID, Items), where Items
is the name of an attribute whose instances are items, and an instance of T
is a finite set of tuples of the form hOj , iki where ik ∈ I is an item associated
to the object Oj . Moreover, given hOj , iki and hOj , imi, two tuples
corresponding to the same object, and tk and tm the transaction times of the items, then
tk 6= tm holds. A normalized ToI is a database containing a table with schema
(OID, t, ID) (the Normalized ToI), and one table per category, each one with
schema (ID, attr1, ..., attrn, ts, te). ⊓⊔</p>
      <p>Figure 3 shows an instance of a normalized ToI where items are related to the
category instances of Figure 2. There are three sessions (sequences), Session1,
Session2 and Session3, each one with an associated list of items. The three
sessions clicked on URL C, but only Session1 would satisfy the constraint [ID =
‘C’ ∧ catN ame = ‘Books’] (see Figure 2).</p>
      <p>Definition 9. [Temporal Matching of a S.E] Consider a normalized ToI (from
now on, nToI), with schema (OID, t, ID). An object identified by OIDm
temporally matches a SE of length k, if there exist k tuples in nToI, hOIDm, t1, ID1i,
hOIDm, t2, ID2i, . . . , hOIDm, tk, IDki, where for at least one Lp ∈ SLk (SE),
Lp = (eco1, eco2, . . . , ecok), ti ∈ Interval (ecoi), ∀ i = 1..k. ⊓⊔
Example 1. Let us analyze SE = [ID = ‘P’].[filter = ‘M’], using Figures 2 and
3. The first constraint is satisfied by ecoP 1 and ecoP 2; the second one, by ecoC3.
Thus, the t-ordered lists of ECOs that satisfy SE are L1 = {ecoP 1, ecoC3}
and L2 = {ecoP 2, ecoC3}. The object Session 1 temporally matches SE, since
there exist two different tuples in Session 1 whose transaction times belong
to Interval (ecoP 1) and Interval (ecoC3), respectively. With a similar analysis,
Session 2 does not match the SE (ecoC3 did not exist when the user in this
session clicked the last two URLs). Finally, Session3 temporally matches SE. ⊓⊔
Definition 10. [Temporal Satisfability of a Constraint] Given a constraint C
and a normalized ToI, with schema (OID, t, ID), we say that a tuple in nToI
μ = hOIDm, tm, IDmi temporally satisfies C if at least one of the following
conditions hold: (a) if F (C) is of the form t = ‘ct’ where t is a temporal
attribute, ‘ct’ is a temporal constant, and ‘ct’ = tm; (b) if F (C) is of the form
fn(t, ‘ct1’, ‘ct2’, ..., ‘ctn−1’) = ‘ct’, and fn(tm, ‘ct1’, ‘ct2’, ..., ‘ctn−1’) = ‘ct’ ; (c)
if F (C) does not contain a temporal attribute; (d) if F (C) is a formula F 1 ∧ F 2
and F1 and F2 are satisfied by μ. ⊓⊔
Definition 11. [Total Matching of a Sequential Expression] Given a sequential
expression SE=SE1.SE2...SEk of length k, and a normalized ToI with schema
(OID, t, ID), we say that an object identified by OIDm totally matches SE, if
there exists k different tuples μ1, . . . , μk in nToI, of the form μ1 = hOIDm, t1, ID1i,
μ2 = hOIDm, t2, ID2i,...,μk = hOIDm, tk, IDki, and there is at least one list
Lp = (eco1, eco2, ...ecok), Lp ∈ SLk (SE), where the following conditions hold:
(a) ti ∈ Interval (ecoi), ∀ i, i = 1..k; (b) IDi is the identifier of the
encoded category occurrence (ecoi), ∀ i, i = 1..k; (c) SEi is temporally satisfied by
μi, ∀ i, = 1..k. We denote each Lp a list of interest for SE. ⊓⊔</p>
      <p>From Definition 9, if a list of ECOs does not satisfy a sequential expression
SE, no object in the nToI can use this list to temporally match SE. Thus, given
that the lists in SLk (SE) are computed over the category occurrences, which
usually fit in main memory, unnecessary database scans can be avoided. In
addition, from Definitions 11 and 10, if an object OIDj in T does not temporally
match SE, then it cannot totally match SE, and, if an object OIDj in T that
totally matches SE, then OIDj temporally matches SE.</p>
      <p>Example 2. Object Session1 in Example 1, totally matches SE, using the second
and third tuples, together with list L1. On the other hand, Session2 does not
totally match SE, since it does not temporally match the expression. Finally,
Session3 temporally matches SE, but it does not totally match it, because L2
does not satisfy the second condition in Definition 11. ⊓⊔
Definition 12 (Temporal Support of SE). The temporal support of a
sequential expression SE, denoted Ts(SE), is the quotient between the number of
different objects that totally match SE and the number of different objects that
temporally match SE, if the latter is different than zero. Otherwise Ts(SE) = 0.
⊓⊔</p>
    </sec>
    <sec id="sec-4">
      <title>Temporal Support of Regular Expressions</title>
      <p>Definition 13. [R.E. over constraints] A regular expression over constraints is
an expression generated by the grammar</p>
      <p>E ←− C | E|E | E? | E∗ | E+ | E.E | E | ǫ
where C is a constraint, ǫ is the symbol representing the empty expression,
‘ |’ means disjunction, ‘ .’ means concatenation, ‘ ?’ “zero or one occurrence”,
‘ + ’ “one or more occurrences”, and ‘ ∗ ’ “zero or more occurrences”. ⊓⊔</p>
      <p>Since REs produce SEs, extending the definitions of Section 4 using the
grammar of Definition 13 is straightforward.</p>
      <p>Definition 14 (Temporal Support of a RE). Given a regular expression
R generated by the grammar of Definition 13, the DFA AR that accepts R,
and a normalized ToI with schema (OID, t, ID), we say that OIDm temporally
matches R, if there exists some n ∈ N such that there exists at least one string of
length n accepted by AR, and OIDm temporally matches this string. Moreover,
OIDm totally matches RE, if there exists some n ∈ N such that there is at least
one string of length n accepted by AR and OIDm totally matches this string.</p>
      <p>The temporal support of a regular expression R, denoted Tr(R), is the
quotient between the number of different objects that totally match R and the number
of different objects that temporally match R, if the latter is different than zero.
Otherwise Tr(R) = 0. ⊓⊔
Example 3. Consider the following SEs: SE1=[keyword=‘Games’], SE2 =
[keyword=‘Games’].[filter=‘’], and SE3 = [keyword=‘Games’].[filter=‘’]. [filter=‘’]
(i.e., for each SEi, i &gt; 3, a condition [filter=‘’] is added). They are produced
by the RE R = [keyword = ‘Games’].([f ilter = ‘’])∗. To compute the
Temporal Support of R, we need to check the satisfability of all the SEs that the
RE can produce. Since no session has four tuples, in our running example we
only need to compute the support of SE1 through SE3, proceeding like in
Example 2. For example, for SE1, SL1 (SE) = {L1 = {ecoC2}, L2 = {ecoC3},
L3 = {ecoM1}}. From the third tuple in Session1, and L2 = {ecoC3}, we
conclude that Session1 totally (and hence, temporally) matches SE1. Analogously,
Session2 and Session3 totally match SE1. Finally, the temporal support of SE1
is 3/3 = 1. Repeating this procedure for all the possible SEs that the RE can
produce, we can compute the temporal support of the RE. ⊓⊔
6</p>
    </sec>
    <sec id="sec-5">
      <title>Computing the Temporal Support of RE</title>
      <p>
        We propose an algorithm, based on GSP [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and SPIRIT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], that generates
sequential expressions of incremental length using AR. The general idea of the
algorithm is the following. At each step k, candidate lists of ECOs of length k are
produced, using the lists of length k-1 produced in the previous step. Accessing
a
[keyword=‘Games’]
[filter=‘’]
      </p>
      <p>b</p>
      <p>Fig. 4. DFA for R
the ToI, we could check which of these lists of ECOs are temporally and/or
totally matched by some object in that table. To prune as early as possible
the ECOs that will not produce a match, we use the DFA AR. Building and
early pruning these lists of ECOs are the keys of the algorithm. Each step k is
composed of three phases: Phase 1 consists in building a set CSk of candidate
sequences of ECOs of length k; in Phase 2, these CSk’s are pruned using the
automaton; in Phase 3, the sequences in CSk that are not temporally matched
by any object in nToI are pruned. A sequence ecoik of ECOs (of length k) is
pruned if it could not be temporally matched by any object. If at a step p,
CSp = ∅, or there is no nToI instance with at least p tuples for the same object
(i.e., ecop cannot be matched by any object in the nToI), the algorithm exits the
loop. Two further steps are needed. The first one uses AR again, this time to
prune all lists obtained in intermediate steps that it does not accept. The final
step uses nToI to compute the temporal support of R.</p>
      <p>The DFA AR, which recognizes the RE R, generates sequential expressions
of different lengths, such that each constraint in an SE labels an edge of AR. We
use AR to prune the lists of ECOs produced at each step. Since this automaton
typically fits in main memory, the number of accesses to the nToI is reduced.
The algorithm accepts, at iteration k, a sequence that matches any sub-path in
the automaton (thus, a kind of “relaxed” constraint is used). For example, for
SE = [keyword = ‘Games’].([f ilter = ‘’])∗, AR is depicted in Figure 4. At k = 1
we accept any sequence that satisfies the constraints [keyword = ‘Games’] or
[f ilter = ‘’] (respectively, paths a-b and b-b in the automaton of Figure 4. In
the final phase AR is used for strict verification (i.e., to match a path starting
and ending in a initial and final state, respectively).</p>
      <p>Example 4. Consider the regular expression R=[keyword=‘Games’].([filter=‘’])∗.
The algorithm first builds AR, the automaton that accepts R, shown in Figure
4. The first phase populates CS1 with the lists of candidate ECOs that could
satisfy an SE of length one (i.e., all the ecos). In the second phase, we use AR to
generate words of length 1, in this case [keyword=‘Games’] and [filter=‘’]. Then,
we prune CS1 keeping only the lists of ECOs that satisfy any of these words.
Note that L2, L3 and L8 are dropped, because neither of them has a ‘Games’
keyword, nor an empty filter. In the third phase, we access nToI to check which lists
of ECOs are not temporally matched by any object. In this example there is no
pruning. For example, L1 is temporally matched by the first tuple of Session1.
Figure 5 shows the results at each phase of iteration k = 1. The other iterations
proceed analogously, with the exception that an apriori verification is performed,
that is whether or not the lists of ECOs in CSj are t − ordered . For example,
for k = 2, we check if given two joining lists L1 = {ecoi} and L2 = {ecoj},
L12 = {ecoi, ecoj } is t − ordered . All the non-t-ordered lists are pruned, since
they cannot produce a match. ⊓⊔
We expect to extend our work in two ways. On the one hand, the theoretical
framework introduced here allows to think in a more general definition of
support, with different semantics (not only temporal), that may enhance current
data mining tools. On the other hand, we will develop an optimized
implementation of the algorithm that can support massive amounts of data.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Srikant</surname>
          </string-name>
          .
          <article-title>Mining sequential patterns</article-title>
          .
          <source>In Proc. of the Int'l Conference on Data Engineering (ICDE)</source>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Garofalakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rastogi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Shim</surname>
          </string-name>
          . Spirit:
          <article-title>Sequential pattern mining with regular expression constraints</article-title>
          .
          <source>In Proceedings of the 25th VLDB Conference</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tseng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>A general model for sequential pattern mining with a progressive database</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          ,
          <volume>20</volume>
          (
          <issue>9</issue>
          ):
          <fpage>1153</fpage>
          -
          <lpage>1167</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>On mining general temporal association rules in a publication database</article-title>
          .
          <source>In ICDM</source>
          , pages
          <fpage>337</fpage>
          -
          <lpage>344</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.S.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Jajodia</surname>
          </string-name>
          .
          <article-title>Discovering calendar-based temporal association rules</article-title>
          .
          <source>Data Knowl. Eng.</source>
          ,
          <volume>44</volume>
          (
          <issue>2</issue>
          ):
          <fpage>193</fpage>
          -
          <lpage>218</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>F.</given-names>
            <surname>Masseglia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Poncelet</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Teisseire</surname>
          </string-name>
          .
          <article-title>Incremental mining of sequential patterns in large databases</article-title>
          .
          <source>Data Knowl. Eng.</source>
          ,
          <volume>46</volume>
          (
          <issue>1</issue>
          ):
          <fpage>97</fpage>
          -
          <lpage>121</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>S.</given-names>
            <surname>Parthasarathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Zaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ogihara</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Dwarkadas</surname>
          </string-name>
          .
          <article-title>Incremental and interactive sequence mining</article-title>
          .
          <source>In CIKM '99</source>
          , pages
          <fpage>251</fpage>
          -
          <lpage>258</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>R.</given-names>
            <surname>Srikant</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          .
          <article-title>Mining sequential patterns: Generalizations and performance improvements</article-title>
          .
          <source>In Proc. of the Fifth Int'l Conference on Extending Database Technology (EDBT)</source>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>A.</given-names>
            <surname>Tansel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clifford</surname>
          </string-name>
          , and S. Gadia (eds.).
          <source>Temporal Databases: Theory, Design and Implementation</source>
          . Benjamin/Cummings,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Abdullah Uz Tansel and
          <string-name>
            <given-names>Susan P.</given-names>
            <surname>Imberman</surname>
          </string-name>
          .
          <article-title>Discovery of association rules in temporal databases</article-title>
          .
          <source>In ITNG</source>
          , pages
          <fpage>371</fpage>
          -
          <lpage>376</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>