-

Temporal Support in Sequential Pattern Mining

Leticia I. G´omez

Bart Kuijpers

bart.kuijpers@uhasselt.be 0

Alejandro A. Vaisman

alejandro.vaisman@uhasselt.be 1 0 Buenos Aires University, Hasselt University & Transnational University of Limburg 1 Hasselt University & Transnational University of Limburg 2 Instituto Tecn ́ologico de Buenos Aires

In sequential pattern discovery, the support of a sequence is computed as the number of data-sequences satisfying a pattern with respect to the total number of data-sequences in the database. When the items are frequently updated, the traditional way of counting support in sequential pattern mining may lead to incorrect (or, at least incomplete), conclusions. For example, if we are looking for the support of the sequence A.B, where A and B are two items such that A was created after B, all sequences in the database that were completed before A was created, can never produce a match. Therefore, accounting for them would underestimate the support of the sequence A.B. In this paper we propose to revise the classic notion of support in sequential pattern mining, introducing the concept of temporal support of a sequential expression(SE), intuitively defined as the number of sequences satisfying a target pattern, out of the total number of sequences that could have possibly matched such pattern. We then generalize this notion to regular expressions (RE) which encapsulate the definition of a collection of SEs. We present and discuss a theoretical framework for these novel notion of support, and present an algorithm to compute it.

Traditional sequential patterns algorithms are founded on the assumption that items in databases are static, and that they existed throughout the whole lifespan of the world modeled by the database. There are many real-world situations (where items are created or deleted dynamically) where these assumptions are not valid. Consider for example SPM in the context of the World Wide Web, where Web pages are frequently added or deleted. Data Mining techniques have been applied for discovering interaction patterns of WWW users, typically analyzing the URLs visited during a session, recorded in a Web server log. Figure 1 depicts a portion of a (simplified) Web log. In classic SPM, the support of a sequence S is defined as the fraction of sessions that support S. Thus, all sessions are considered as having the same probability to support a given sequence. For example, the support of the sequence CBC, counted in the classical way, would be 66%, since CBC is present in two of the three sessions. Analogously, the support of the sequence CB would also be 66%. However, if not all these Web pages existed all the time, would it be reasonable to ignore the evolution of the items (URLs) across time? We discuss these issues in this paper. The right hand side session ID interaction

time URL s1 tt==31 CA t=5 B t=6 C time URL s2 tt==43 AB t=8 A time URL s3 tt==150 CA t=20 B t=23 C of Figure 1 shows how URLs A, B, and C in the left hand side have evolved, and the time intervals when each URL has been available. For example, URL A was available during the interval [ 1, 8 ], and URL B in intervals [ 4, 5 ] and [11, now ]. (We use the term now to refer to the current time instant). During session s2, URL C did not exist at t=8, when the user clicked URL A. Thus, session s2 did not have the possibility of producing a sequence that finishes with the URL C. Sessions s1 and s3, instead, support the sequence CBC. Then, ignoring the evolution of these URLs, the support of sequence CBC would be 66%, but, if we do not count session s2, we would obtain a support of 100%. Analogously, s1 and s3 support the sequence CB but s2 does not. However, C was available during session s2, when the user clicked URLs A (t=3) and B (t=4). Thus, the user could have produced the sequence CB, although she decided to follow a different path. Session s2 must then be counted for computing the support of the sequence CB, which would be 66%.

In addition to the above, suppose now that a user is interested not only in the support of a sequence, but in the support of an RE as a whole. Let us analyze a simple example. The expression (A|B).C is satisfied by sequences like A.C or B.C. Even though the semantics of this RE suggests that both of them are equally interesting to the user, if neither of them verifies a minimum support (although together they do), they would not be retrieved. The problem gets more involved if we are interested in categorical sequential patterns, i.e., patterns like Science.Sports, where Science and Sports are, for instance, categories of Web pages in an ontology.

In light of the above, we propose to revise, in different ways, the classic notion of support for sequential pattern mining, introducing the concept of temporal support of regular expressions, intuitively defined as the number of sequences satisfying a target pattern, out of the total number of sequences that could have possibly matched such pattern, where the pattern is defined as a RE over complex items. After reviewing related work (Section 2), we introduce the data model (Section 3). Then we present and discuss a theoretical framework for this novel notion of support, and an RE-based language (Sections 4 and 5). Finally, we adapt classic Generalized Sequential Pattern (GSP) algorithms to our setting, introducing appropriate pruning strategies (Section 6).

Related Work

Classic algorithms for Sequential Pattern discovery [ 1, 8 ] return all frequent sequences present in a database. However, more often than not, only a few ones are interesting from a user’s point of view. Thus, post-processing tasks are required in order to discard uninteresting sequences. To avoid this drawback, languages based on regular expressions (RE) were proposed to restrict frequent sequences to the ones that satisfy user-specified constraints. Garofalakis et al. [ 2 ] address this problem by pruning the candidate patterns obtained during the mining process by adding user-specified constraints in the form of regular expressions over items. The algorithm returns only the frequent patterns that satisfy these regular expressions. Recently, the data mining community started to discuss new notions of support in SPM, that account for changes of the items database across time. Although this problem has already been addressed for Association Rule mining, where the concept of temporal support has already been introduced [ 4, 5, 10 ], this has been overlooked in SPM. To the best of our knowledge, the works we comment below are the only ones partially addressing the issue. Masseglia et al. [ 6 ], and Parthasarathy et al.[ 7 ], study the so-called incremental sequential pattern mining problem, focusing on avoiding re-scanning the entire database when new items appear. Recently, Huang et al. [ 3 ] addressed the problem of detecting frequent patterns valid during a defined period of interest, called POI. For example, if new items appear, and no new transactions were generated, old frequent sequences would still be frequent. 3

Data Model

Depending on the application domain, the items to be mined can be characterized by different attributes. Throughout the paper we refer to an example where each Web page is characterized by the following attributes: (a) catName, which represents the name of the category of the item; (b) keyword, which summarizes the page contents; (c) filter, specifying a list of URLs that cannot appear together with the URL of the item. Finally, ID is a distinguished, mandatory attribute, in this case containing the URL that univocally identifies a Web page. Definition 1. [Category Schema] We have a set of attribute names A, and a set of identifier names I. Each attribute a ∈ A is associated with a set of values in dom(a), and each identifier ID ∈ I is associated with a set of values in dom(ID ).

A category schema S is a tuple (ID , A), where ID ∈ I is a distinguished attribute denoted identifier, and A is a set of attributes in A. Without loss of generality, and for simplicity, in what follows we consider the set A ordered. Thus, S has the form [ID, attr1, ..., attrn]. ⊓⊔

In many real-world applications, assuming that the values of attributes for a category occurrence do not change, could not be realistic. Thus, we introduce the time dimension into our data model, timestamping category occurrences. We assume that the category schema is constant across time. Formally, if T is a set, and < a discrete linear order without endpoints on T, the structure TP = (T , <) is the Point-based Temporal Domain. The elements in the carrier of T model the individual time instants, and the linear order < models the succession of time. We consider the set T to be N (standing for the natural numbers). We can map individual time instants t ∈ N to calendar instants, assuming a reference point and a granularity. In what follows we use calendar time, and granularity “minute”. In temporal databases terminology [ 9 ], in this paper we assume valid time support for categories, and transaction time support for items (see below). Definition 2. [Category Occurrence and Category Instance] Given a category schema S , a category occurrence for S is the tuple [hID, idi, P , t], where ID is the ID attribute of Definition 1 above, id ∈ dom(ID), P is the structure [hattr1, v1i..., hattrn, vni], t is a point in the temporal domain TP , and: (a) attri = A(i) in S (remember that A is considered ordered); (b) vi ∈ dom(attri), ∀i, i = 1..n; (c) All the occurrences of the same category have the same set of attributes, at any given time; (d) At any instant t, the pair hID , ti is unique for a category occurrence, meaning that no two occurrences of the same category can have the same value for ID at the same time; (e) t is the time instant when the information in the category occurrence is valid.

A set of occurrences of the same category is denoted a category instance. We extend the fourth condition in Definition 2 to hold for the whole set: no two occurrences of categories in the set can have the same value for ID at the same instant t (in other words, the pair (ID, t) is unique for the whole instance). ⊓⊔

In what follows, for clarity, we assume that attr1 stands for ID . Thus, a category occurrence is the set of pairs [h attr1,v1i,..., ... , h attrn,vni, t]. ⊓⊔ Definition 3. [Interval Encoding] Let G be a time granularity, and g a time unit for G (e.g., one minute). Given a set of k ≥ 0 category occurrences, [hattr1, v1i, . . . , hattrn, vni, t1], [hattr1, v1i, . . . , hattrn, vni, t2],. . ., [hattr1, v1i . . . , hattrn, vni, tk], if ∀ i, i = 1..k − 1, it holds that ti+1 = ti + g, we encode all these occurrences in a single tuple [hattr1, v1i, . . . , hattrn, vni, [t1, tk]]. ⊓⊔ Definition 4. [Encoded Category Occurrence and Items] Given a category instance C with time granularity G, and a partition P of C such that the number of sets pi ∈ P is minimal. Each set pi is obtained encoding the occurrences in C as in Definition 3, i.e., each pi contains a set of tuples that can be encoded into a single tuple. Thus, associated to pi there is a tuple tpi = (hID, idi, hattr1, v1i, . . . hattrn, vni, tv, te), where (a) ID, attr1, ..., attrn are the attributes of the occurrences in pi; (b) id, v1, .... vn are the values for the attributes in (a); (c) ts is the smallest t of the occurrences in pi; (d) te is the largest t of the occurrences in pi. We denote tpi an encoded category occurrence (ECO) of the set of occurrences in pi. Given an ECO ei we denote Interval (ei) its associated interval [ts, te].

Adding a time instant to an ECO, produces an Item. Let eo = (hID, vi, hattr1, v1i, . . . hattrn, vni, ts, te) be an ECO for some category instance. An item I associated to eo is the set: (ht, vti, hID, vi, hattr1, v1i, . . . hattrn, vni, ts, tei), such that vt ∈ [ts, te] holds. We denote t the transaction time of the item. ⊓⊔ ecoA1 [(ID,‘A’), (catName,‘WebPage’), (filter,‘’), (keyword,‘Books’), [‘11/29/2007 15:45’, ‘11/29/2007 17:10’] ] ecoA2 [(ID,‘A’), (catName,‘WebPage’), (filter,‘P’), (keyword,‘Computers’), [‘11/29/2007 18:00’, ‘11/29/2007 19:30’]] eeeeecccccoooooPCCCM13121 [[[[[(((((IIIIIDDDDD,,,,,‘‘‘‘‘PCCCM’’’’’))))),,,,, (((((cccccaaaaatttttNNNNNaaaaammmmmeeeee,,,,,‘‘‘‘‘WWWWWeeeeebbbbbPPPPPaaaaagggggeeeee’’’’’))))),,,,, (((((fffffiiiiilllllttttteeeeerrrrr,,,,,‘‘‘‘‘’’’AM))),,,’’)),, (((((kkkkkeeeeeyyyyywwwwwooooorrrrrddddd,,,,,‘‘‘‘‘BCGGGooaaaommmmkpeeessssu’’’’)t))),e,,,rs’), [[[[[‘‘‘‘‘1111111111/////2222299999/////22222000000000077777 1111188966:::::5034000050’’’’’,,,,, ‘‘‘‘‘111N11111o////w2222’9999]////]2222000000007777 21110886::::05540005’’’’]]]]]]]] ecoP2 [(ID,‘P’), (catName,‘WebPage’), (filter,‘A’), (keyword,‘Computers’), [‘11/29/2007 18:51’, ‘Now’]] Definition 5. [Sequential Expression] A Sequential Expression (SE) of length n is an ordered list of n sub-expressions SE1.SE2....SEn, where each SEi is a constraint, ∀i, i = 1..n.

A constraint is a formula enclosed in squared brackets. Given a constraint C = [F ], we denote F (C) the formula of C. A formula is built as a conjunction of atoms of the form term1 = term2, where term1 is an attribute (temporal or nontemporal) or a function, and term2 is a constant. More precisely, the terms of the language are: (a) Constants: a literal enclosed by simple quotes; (b) Non temporal Attributes: an attribute in the category schema (e.g. filter, url). (c) Temporal Attributes: t, the temporal attribute of and item; (d) Functions of n arguments: Let fn be a function symbol, the expression fn(attribute, ‘ct1’, ‘ct2’, ..., ‘ctn−1’), n ≥ 1, is a function where the first parameter is an attribute (temporal or nontemporal), and all the other ones are constants. ⊓⊔

Given two intervals Ii = [tsi, tei] and Ij = [tsj , tej ] we say that Ii follows Ij if tsi ≥ tej . For example, in Figure 2 we can see that Interval(ecoC3) follows Interval(ecoA2) and Interval(ecoC2). We can also see that Interval(ecoC3) does not follow Interval(ecoM1).

Definition 6. [ECO-Satisfability of a Constraint] Given a constraint C and an ECO E, we say that E satisfies C if one of the following conditions hold: (a) If OID Items

[(t,‘11/29/2007 16:30’), (ID,‘C’) ] Session1 [(t,‘11/29/2007 17:00’), (ID,‘P’) ] [(t,‘11/29/2007 19:45’), (ID,‘C’),] [(t,‘11/29/2007 18:20’), (ID,‘C’)] Session2 [(t,‘11/29/2007 18:50’), (ID,‘P’)] [(t,‘11/29/2007 18:51’), (ID,‘M’)] [(t,‘11/29/2007 19:31’), (ID,‘C’)] Session3 [(t,‘11/29/2007 19:32’), (ID,‘M’)] [(t,‘11/29/2007 20:00’), (ID,‘C’)] F (C) is of the form attr = ‘ct’ where attr is an attribute, ‘ct ′ is a constant, and the instantiation of attr with its value in E, verifies the equality. (b) If F (C) is of the form fn(attr , ‘ct1’, ‘ct2’, ..., ‘ctn−1’) = ‘ct’, attr and ‘ct ′ are defined as in (a), and the instantiation of attr in fn with its value in E, makes the equality true. (c) If F (C) is of the form t = ‘ct’ where t is a temporal attribute, ‘ct’ is a temporal constant, and ‘ct’ ∈ Interval (E). (d) If F (C) is of the form fn(t, ‘ct1’, ‘ct2’, ..., ‘ctn−1’) = ‘ct’, where t and ‘ct’ are defined as in (c), and ∃tu ∈ Interval (E) such that the equality is true. (e) If F (C) is a formula F 1∧F 2, and F 1 and F 2 are satisfied by E. ⊓⊔ Definition 7. [ECO-Satisfability of SE] Let EO = (EO1, EO2, ...EOn) be a list of ECOs such that ∀ i, j, i < j ⇒ Interval (Ei) does not follow Interval (Ej ). We denote EO a t-ordered list of ECOs. A sequential expression SE=SE1.SE2....SEn is satisfied by EO if EOi satisfies SEi, ∀ i, i = 1..n. We denote SLk (SE) the set composed of the n lists of ECOs that satisfy an SE of length k. ⊓⊔ Definition 8. [ToI and Normalized ToI] Let I be a finite set of items. A Table Of Items (ToI) for I is a table with schema T = (OID, Items), where Items is the name of an attribute whose instances are items, and an instance of T is a finite set of tuples of the form hOj , iki where ik ∈ I is an item associated to the object Oj . Moreover, given hOj , iki and hOj , imi, two tuples corresponding to the same object, and tk and tm the transaction times of the items, then tk 6= tm holds. A normalized ToI is a database containing a table with schema (OID, t, ID) (the Normalized ToI), and one table per category, each one with schema (ID, attr1, ..., attrn, ts, te). ⊓⊔

Figure 3 shows an instance of a normalized ToI where items are related to the category instances of Figure 2. There are three sessions (sequences), Session1, Session2 and Session3, each one with an associated list of items. The three sessions clicked on URL C, but only Session1 would satisfy the constraint [ID = ‘C’ ∧ catN ame = ‘Books’] (see Figure 2).

Definition 9. [Temporal Matching of a S.E] Consider a normalized ToI (from now on, nToI), with schema (OID, t, ID). An object identified by OIDm temporally matches a SE of length k, if there exist k tuples in nToI, hOIDm, t1, ID1i, hOIDm, t2, ID2i, . . . , hOIDm, tk, IDki, where for at least one Lp ∈ SLk (SE), Lp = (eco1, eco2, . . . , ecok), ti ∈ Interval (ecoi), ∀ i = 1..k. ⊓⊔ Example 1. Let us analyze SE = [ID = ‘P’].[filter = ‘M’], using Figures 2 and 3. The first constraint is satisfied by ecoP 1 and ecoP 2; the second one, by ecoC3. Thus, the t-ordered lists of ECOs that satisfy SE are L1 = {ecoP 1, ecoC3} and L2 = {ecoP 2, ecoC3}. The object Session 1 temporally matches SE, since there exist two different tuples in Session 1 whose transaction times belong to Interval (ecoP 1) and Interval (ecoC3), respectively. With a similar analysis, Session 2 does not match the SE (ecoC3 did not exist when the user in this session clicked the last two URLs). Finally, Session3 temporally matches SE. ⊓⊔ Definition 10. [Temporal Satisfability of a Constraint] Given a constraint C and a normalized ToI, with schema (OID, t, ID), we say that a tuple in nToI μ = hOIDm, tm, IDmi temporally satisfies C if at least one of the following conditions hold: (a) if F (C) is of the form t = ‘ct’ where t is a temporal attribute, ‘ct’ is a temporal constant, and ‘ct’ = tm; (b) if F (C) is of the form fn(t, ‘ct1’, ‘ct2’, ..., ‘ctn−1’) = ‘ct’, and fn(tm, ‘ct1’, ‘ct2’, ..., ‘ctn−1’) = ‘ct’ ; (c) if F (C) does not contain a temporal attribute; (d) if F (C) is a formula F 1 ∧ F 2 and F1 and F2 are satisfied by μ. ⊓⊔ Definition 11. [Total Matching of a Sequential Expression] Given a sequential expression SE=SE1.SE2...SEk of length k, and a normalized ToI with schema (OID, t, ID), we say that an object identified by OIDm totally matches SE, if there exists k different tuples μ1, . . . , μk in nToI, of the form μ1 = hOIDm, t1, ID1i, μ2 = hOIDm, t2, ID2i,...,μk = hOIDm, tk, IDki, and there is at least one list Lp = (eco1, eco2, ...ecok), Lp ∈ SLk (SE), where the following conditions hold: (a) ti ∈ Interval (ecoi), ∀ i, i = 1..k; (b) IDi is the identifier of the encoded category occurrence (ecoi), ∀ i, i = 1..k; (c) SEi is temporally satisfied by μi, ∀ i, = 1..k. We denote each Lp a list of interest for SE. ⊓⊔

From Definition 9, if a list of ECOs does not satisfy a sequential expression SE, no object in the nToI can use this list to temporally match SE. Thus, given that the lists in SLk (SE) are computed over the category occurrences, which usually fit in main memory, unnecessary database scans can be avoided. In addition, from Definitions 11 and 10, if an object OIDj in T does not temporally match SE, then it cannot totally match SE, and, if an object OIDj in T that totally matches SE, then OIDj temporally matches SE.

Example 2. Object Session1 in Example 1, totally matches SE, using the second and third tuples, together with list L1. On the other hand, Session2 does not totally match SE, since it does not temporally match the expression. Finally, Session3 temporally matches SE, but it does not totally match it, because L2 does not satisfy the second condition in Definition 11. ⊓⊔ Definition 12 (Temporal Support of SE). The temporal support of a sequential expression SE, denoted Ts(SE), is the quotient between the number of different objects that totally match SE and the number of different objects that temporally match SE, if the latter is different than zero. Otherwise Ts(SE) = 0. ⊓⊔

Temporal Support of Regular Expressions

Definition 13. [R.E. over constraints] A regular expression over constraints is an expression generated by the grammar

E ←− C | E|E | E? | E∗ | E+ | E.E | E | ǫ where C is a constraint, ǫ is the symbol representing the empty expression, ‘ |’ means disjunction, ‘ .’ means concatenation, ‘ ?’ “zero or one occurrence”, ‘ + ’ “one or more occurrences”, and ‘ ∗ ’ “zero or more occurrences”. ⊓⊔

Since REs produce SEs, extending the definitions of Section 4 using the grammar of Definition 13 is straightforward.

Definition 14 (Temporal Support of a RE). Given a regular expression R generated by the grammar of Definition 13, the DFA AR that accepts R, and a normalized ToI with schema (OID, t, ID), we say that OIDm temporally matches R, if there exists some n ∈ N such that there exists at least one string of length n accepted by AR, and OIDm temporally matches this string. Moreover, OIDm totally matches RE, if there exists some n ∈ N such that there is at least one string of length n accepted by AR and OIDm totally matches this string.

The temporal support of a regular expression R, denoted Tr(R), is the quotient between the number of different objects that totally match R and the number of different objects that temporally match R, if the latter is different than zero. Otherwise Tr(R) = 0. ⊓⊔ Example 3. Consider the following SEs: SE1=[keyword=‘Games’], SE2 = [keyword=‘Games’].[filter=‘’], and SE3 = [keyword=‘Games’].[filter=‘’]. [filter=‘’] (i.e., for each SEi, i > 3, a condition [filter=‘’] is added). They are produced by the RE R = [keyword = ‘Games’].([f ilter = ‘’])∗. To compute the Temporal Support of R, we need to check the satisfability of all the SEs that the RE can produce. Since no session has four tuples, in our running example we only need to compute the support of SE1 through SE3, proceeding like in Example 2. For example, for SE1, SL1 (SE) = {L1 = {ecoC2}, L2 = {ecoC3}, L3 = {ecoM1}}. From the third tuple in Session1, and L2 = {ecoC3}, we conclude that Session1 totally (and hence, temporally) matches SE1. Analogously, Session2 and Session3 totally match SE1. Finally, the temporal support of SE1 is 3/3 = 1. Repeating this procedure for all the possible SEs that the RE can produce, we can compute the temporal support of the RE. ⊓⊔ 6

Computing the Temporal Support of RE

We propose an algorithm, based on GSP [ 8 ] and SPIRIT [ 2 ], that generates sequential expressions of incremental length using AR. The general idea of the algorithm is the following. At each step k, candidate lists of ECOs of length k are produced, using the lists of length k-1 produced in the previous step. Accessing a [keyword=‘Games’] [filter=‘’]

Fig. 4. DFA for R the ToI, we could check which of these lists of ECOs are temporally and/or totally matched by some object in that table. To prune as early as possible the ECOs that will not produce a match, we use the DFA AR. Building and early pruning these lists of ECOs are the keys of the algorithm. Each step k is composed of three phases: Phase 1 consists in building a set CSk of candidate sequences of ECOs of length k; in Phase 2, these CSk’s are pruned using the automaton; in Phase 3, the sequences in CSk that are not temporally matched by any object in nToI are pruned. A sequence ecoik of ECOs (of length k) is pruned if it could not be temporally matched by any object. If at a step p, CSp = ∅, or there is no nToI instance with at least p tuples for the same object (i.e., ecop cannot be matched by any object in the nToI), the algorithm exits the loop. Two further steps are needed. The first one uses AR again, this time to prune all lists obtained in intermediate steps that it does not accept. The final step uses nToI to compute the temporal support of R.

The DFA AR, which recognizes the RE R, generates sequential expressions of different lengths, such that each constraint in an SE labels an edge of AR. We use AR to prune the lists of ECOs produced at each step. Since this automaton typically fits in main memory, the number of accesses to the nToI is reduced. The algorithm accepts, at iteration k, a sequence that matches any sub-path in the automaton (thus, a kind of “relaxed” constraint is used). For example, for SE = [keyword = ‘Games’].([f ilter = ‘’])∗, AR is depicted in Figure 4. At k = 1 we accept any sequence that satisfies the constraints [keyword = ‘Games’] or [f ilter = ‘’] (respectively, paths a-b and b-b in the automaton of Figure 4. In the final phase AR is used for strict verification (i.e., to match a path starting and ending in a initial and final state, respectively).

Example 4. Consider the regular expression R=[keyword=‘Games’].([filter=‘’])∗. The algorithm first builds AR, the automaton that accepts R, shown in Figure 4. The first phase populates CS1 with the lists of candidate ECOs that could satisfy an SE of length one (i.e., all the ecos). In the second phase, we use AR to generate words of length 1, in this case [keyword=‘Games’] and [filter=‘’]. Then, we prune CS1 keeping only the lists of ECOs that satisfy any of these words. Note that L2, L3 and L8 are dropped, because neither of them has a ‘Games’ keyword, nor an empty filter. In the third phase, we access nToI to check which lists of ECOs are not temporally matched by any object. In this example there is no pruning. For example, L1 is temporally matched by the first tuple of Session1. Figure 5 shows the results at each phase of iteration k = 1. The other iterations proceed analogously, with the exception that an apriori verification is performed, that is whether or not the lists of ECOs in CSj are t − ordered . For example, for k = 2, we check if given two joining lists L1 = {ecoi} and L2 = {ecoj}, L12 = {ecoi, ecoj } is t − ordered . All the non-t-ordered lists are pruned, since they cannot produce a match. ⊓⊔ We expect to extend our work in two ways. On the one hand, the theoretical framework introduced here allows to think in a more general definition of support, with different semantics (not only temporal), that may enhance current data mining tools. On the other hand, we will develop an optimized implementation of the algorithm that can support massive amounts of data.

Agrawal and

Srikant . Mining sequential patterns . In Proc. of the Int'l Conference on Data Engineering (ICDE) , 1995 .

M. N.

Garofalakis ,

Rastogi , and

Shim . Spirit: Sequential pattern mining with regular expression constraints . In Proceedings of the 25th VLDB Conference , 1999 .

Huang ,

Tseng ,

Ou , and

Chen . A general model for sequential pattern mining with a progressive database . IEEE Transactions on Knowledge and Data Engineering , 20 ( 9 ): 1153 - 1167 , 2008 .

Lee ,

Lin , and

Chen . On mining general temporal association rules in a publication database . In ICDM , pages 337 - 344 , 2001 .

Li ,

Ning ,

X.S.

Wang , and

Jajodia . Discovering calendar-based temporal association rules . Data Knowl. Eng. , 44 ( 2 ): 193 - 218 , 2003 .

Masseglia ,

Poncelet , and

Teisseire . Incremental mining of sequential patterns in large databases . Data Knowl. Eng. , 46 ( 1 ): 97 - 121 , 2003 .

Parthasarathy ,

M. J.

Zaki ,

Ogihara , and

Dwarkadas . Incremental and interactive sequence mining . In CIKM '99 , pages 251 - 258 , 1999 .

Srikant and

Agrawal . Mining sequential patterns: Generalizations and performance improvements . In Proc. of the Fifth Int'l Conference on Extending Database Technology (EDBT) , 1996 .

Tansel ,

Clifford , and S. Gadia (eds.). Temporal Databases: Theory, Design and Implementation . Benjamin/Cummings, 1993 .

10. Abdullah Uz Tansel and

Susan P.

Imberman . Discovery of association rules in temporal databases . In ITNG , pages 371 - 376 , 2007 .