Mining Long, Sharable Patterns in Trajectories of Moving Objects Gyozo Gidófalvi / Torben Bach Pedersen Geomatic aps / Aalborg University gyg@geomatic.dk / tbp@cs.aau.dk Abstract management of trajectories. Second, they can be used to facilitate various Location–Based Services (LBS). The efficient analysis of spatio–temporal data, One LBS example is an intelligent rideshare applica- generated by moving objects, is an es- tion, which finds sharable routes for a set of commuters sential requirement for intelligent location– and suggests rideshare possibilities to them, is consid- based services. Spatio-temporal rules can be ered. Such a rideshare application can be one possible found by constructing spatio–temporal bas- solution to the ever increasing congestion problems of kets, from which traditional association rule urban transportation networks. mining methods can discover spatio–temporal Patterns in trajectories for an intelligent rideshare rules. When the items in the baskets are application are only interesting if those patterns are spatio–temporal identifiers and are derived sharable by multiple commuters, are reoccurring fre- from trajectories of moving objects, the dis- quently, and are worthwhile pursuing, i.e., are long covered rules represent frequently travelled enough for the savings to compensate for the coordi- routes. For some applications, e.g., an intel- nation efforts. The discovery of Long, Sharable Pat- ligent ridesharing application, these frequent terns (LSP) in trajectories is difficult for several rea- routes are only interesting if they are long sons. Patterns do not usually exist along the whole and sharable, i.e., can potentially be shared trajectory. As a example, consider two commuters A by several users. This paper presents a data- and B living in the same area of town, leaving for base projection based method for efficiently work approximately the same time, and working in extracting such long, sharable frequent routes. the same part of town. Given the underlying road The method prunes the search space by mak- network and traffic conditions, for a given support ing use of the minimum length and sharable threshold the middle part of the trips of the two com- requirements and avoids the generation of muters may be frequent, the initial and final parts may the exponential number of sub–routes of long not. In recent work [4] a general problem transfor- routes. A SQL–based implementation is de- mation method, called pivoting, was proposed for the scribed, and experiments on real life data analysis of spatio–temporal data. Pivoting is the pro- show the effectiveness of the method. cess of grouping a set of records based on a set of attributes and assigning the values of likely another 1 Introduction set of attributes to groups or baskets. Pivoting ap- In recent years Global Positioning Systems (GPS) have plied to spatio–temporal data allows the construction become increasingly available and accurate in mobile of spatio–temporal baskets, which can be mined with devices. As a result large amounts of spatio–temporal traditional association rule mining algorithms. When data is being generated by users of such mobile devices, the items in the baskets are spatio–temporal identifiers referred to as moving objects in the following. Tra- and are derived from trajectories, the discovered rules jectories of moving objects, or trajectories for short, represent frequently travelled routes. While there ex- contain regularities or patterns. For example, a per- ist several efficient association rule mining methods [6], son tends to drive almost every weekday to work ap- the straight–forward application of these algorithms to proximately at the same time using the same route. spatio–temporal baskets representing trajectories is in- The benefits of finding such regularities or patterns is feasible for two reasons. First, all sub–patterns of fre- many–fold. First, such patterns can help the efficient quent patterns are also frequent, but not interesting, as longer patterns are preferred. Second, the support Proceedings of the third Workshop on STDBM criterion used in association rule mining algorithms Seoul, Korea, September 11, 2006 is inadequate for a rideshare application, i.e., a fre- 49 quent itemset representing a frequent trajectory pat- and mining CFIs are described. While CFI mining tern, may be supported by a single commuter on many methods can be modified to find the desired solution occasions and hence presents no rideshare opportunity. that meets the sharable criterion, they employ com- In this paper, to overcome the above difficulties of plex data structures and their implementation is quite finding LSPs in trajectories, a novel method is given. involved; hence their augmentation is difficult. In par- According to a new support criterion, the proposed ticular, a projection-based CFI mining algorithm that method first efficiently filters the trajectories to con- employs an in–memory FP-tree to represent itemsets, tain only sub-trajectories that are frequent. Next, it would need to be modified at every node to maintain removes trajectories that do not meet the minimum a set of distinct objects at that have transactions as- length criterion. Then it alternates two steps until sociated with them that support the itemset that is there are undiscovered LSPs. The first step entails represented by the node. In comparison, the herein the discovery of a LSP. The second step entails the fil- presented method –building on work presented in [11]– tering of trajectories by the previously discovered pat- exploits the power of commercial RDBMSs, yielding a tern. An advantage of the proposed method is the ease simple, but effective solution. of implementation in commercial Relational Database Since trajectories are temporally ordered sequences Management Systems (RDBMSes). To demonstrate of locations, sequential pattern mining [3] naturally this, a SQL–based implementation is described. The comes to mind. However, a straight forward inter- effectiveness of the method is demonstrated on the pretation of trips as transactions and application of publicly available INFATI data, which contains tra- a state–of–the–art closed frequent sequential pattern jectories of cars driving on a road network. mining algorithm [15] does not yield the desired so- The herein presented work is novel in several as- lution, since in this case sequences of frequent sub- pects. It is the first to consider the problem of mining trajectories would be found. Furthermore, since the LSPs in trajectories. It describes a novel transforma- trajectories can contain hundreds of items, closedness tion, and the relationship between the problem of min- checking of frequent itemsets even for prominent meth- ing LSPs in trajectories and mining frequent itemsets. ods would be computationally expensive. Interpret- Finally, it describes an effective method with a simple ing single elements of trajectories as transactions and SQL–implementation to mine such LSPs in trajecto- applying closed sequential pattern mining could find ries. frequent sub–trajectories. However a number of prob- The remainder of the paper is organized as follows. lems arise. First, to meet the sharable criterion, the Section 2 reviews related work. Section 3 describes the in–memory data structures would need similar, non- transformation, the use of the framework in frequent trivial augmentation as described above. Second, since itemset mining, and formally defines the task of min- patterns in trajectories could be extremely long, even ing LSPs in trajectories. he proposed algorithm and a state–of–the–art sequential mining methods [12, 15] SQL–based implementation for mining such patterns would have a difficulties handling patterns of such is described in Section 4. Section 5 presents experi- lengths. Third, patterns in trajectories repeat them- mental results. Finally Section 6 concludes and points selves, which cannot be handled by traditional se- to future research. quential pattern mining algorithms. The extraction of spatio–temporal periodic patterns from trajectories 2 Related work is studied in [9], where a bottom–up, level–wise, and a faster top–down mining algorithm is presented. Al- Frequent pattern mining is a core field in data min- though the technique is effective, the patterns found ing research. Since the first solution to the problem are within the trajectory of a single moving object. In of frequent itemset mining [1, 2], various specialized comparison, the herein presented method effectively in–memory data structures have been proposed to im- discovers long, sharable, periodic patterns. prove the mining efficiency, see [6] for an overview. It has been recognized that the set of all frequent item- Moving objects databases are particular cases of sets is too large for analytical purposes and the infor- spatio–temporal databases that represent and man- mation they contain is redundant. To remedy this, two age changes related to the movement of objects. A modification to the task have been proposed: mining necessary component to such databases are specialized of Closed Frequent Itemsets (CFI) and mining of max- spatio–temporal indices such as the Spatio–Temporal imal frequent itemsets. A frequent itemset X is closed R–tree (STR–tree) and Trajectory–Bundle tree (TB– if no itemset Y exists with the same support as X such tree) [7]. An STR-tree organizes line segments of a tra- that X ⊂ Y . A frequent itemset X is maximal if no jectory according to both their spatial properties and frequent itemset Y exists such that X ⊂ Y . Prominent the trajectories they belong to, while a TB–tree only methods that efficiently exploit these modifications to preserves trajectories. If trajectories are projected to the problem are MAFIA, GenMax, CLOSET(+), and the time–of–day domain, STR–tree index values on CHARM [6]. Later in the paper, a relationship be- the projected trajectories could be used as an alter- tween the problems of mining LSPs in trajectories native representation of trajectories. While this ap- 50 proach would reduce the size of the problem of mining time−dimension / time−of−day domain trip 1 time−dimension / date−time domain trip 2 LSPs in trajectories, it would not solve it. In compar- trip 3 8:35 8:30 ison, the herein presented method solves the problem Wed 8:00 8:00 8:00 of mining LSPs in trajectories, which is orthogonal, Tu 8:00 d < δ → end of trip 1 k 8:15 8:10 but not unrelated to indexing of trajectories. 8:05 8:00 15 Mon 8:00 spatio−temporal region In [13] a way to effectively retrieve trajectories in 15 10 d > δ → start of trip 1 k 10 5 40 the presence of noise is presented. Similarity func- y−dimension 5 35 40 30 35 25 30 20 25 0 5 10 15 20 y−dimension 0 5 10 15 0 0 x−dimension x−dimension tions, based on the longest sharable subsequence, are (a) Identification of trips in (b) Time–of–day projection and defined, facilitating an intuitive notion of similarity raw trajectories spatio–temporal region substi- between trajectories. While such an efficient similar- tution ity search between the trajectories will discover similar Figure 1: From trajectories to transactions trajectories, the usefulness of this similarity in terms of length and support would not be explicit. In compar- Projection of the temporal dimension ison, there herein proposed method returns only pat- Since frequent patterns within a single object’s tra- terns that meet the user–specified support and length jectory are expected to repeat themselves daily, the constraints. Furthermore, the trajectory patterns re- temporal dimension of the so identified trips is pro- turned by our method are explicit, as opposed to the jected down to the time–of–day domain. This pro- only implicit patterns contained in similar trajectories. jection is essential to discover the daily periodic na- ture of patterns in trajectories. Mining patterns with 3 Long, sharable patterns in trajecto- other periodicity can be facilitated by projections of ries the temporal domain to appropriate finer, or coarser levels of granularity. Finer levels of granularity can be The following section describes a novel transformation used to detect patterns with shorter periodicity. For of raw trajectories. This transformation allows (1) the example, a delivery person might use a different route formulation of the problem of mining LSPs in trajec- depending on the time–of–hour knowing that at the tories in a framework similar to that used in frequent given time of the hour certain traffic conditions arise, itemset mining, (2) to establish a relationship between which make an otherwise optimal delivery route sub– the two problems. optimal. The detection of these patterns in delivery routes requires the projection of the temporal dimen- 3.1 From trajectories to transactions sion to the time–of–hour domain. Conversely, coarser The proposed transformation of raw trajectories con- levels of granularity can be used to detect patterns sists of three steps: identification of trips, projection of with longer periodicity. For example, a person might the temporal dimension, and spatio–temporal region visit his bank only at the end of pay periods. The de- substitution. It is assumed that locations of moving tection of this pattern requires the projection of the objects are sampled over a long history. That is, a temporal dimension to the day–of–month domain. Fi- raw trajectory is a long sequence of (x, y, t) measure- nally, to discover the pattern that the above mentioned ments at regular time intervals. person makes these visits to his bank Saturday morn- ings following the end of pay periods, requires the pro- Identification of trips jection of the temporal domain to a combination of the A trip is a temporally consecutive set or se- day–of-month, the day–of-week, and the part–of-day quence of measurements such that for any measure- domains. Performing different projections is part of ment mi in the sequence, the sum of spatial dis- the inherently iterative and only semi-automatic pro- placement during the k measurements immediately cess of doing data mining when the exact format of the following mi , denoted dk , is larger than some user– patterns searched for is not known beforehand. Figure defined displacement, δ. Trips can be identified 1(b) shows the projection of the temporal dimension in a straight–forward manner by linearly scanning to the time–of–day domain for the three trips identi- through a trajectory, and calculating dk using a look– fied in Figure 1(a). Since the projection of a single ahead window of k measurements. That is, scanning database record is a constant operation, the total pro- through the total trajectory from the beginning, the cessing time of this transformation step is optimal and first measurement for which dk > δ, signals the begin- linear in the number of database records. ning of the first trip. Consecutive measurements are part of this trip until a measurement is reached for Spatio–temporal region substitution which dk ≤ δ, which signals the end of the first tra- Trajectories are noisy. One source of this noise is jectory. Trips following the first trip are detected in due to imprecise GPS measurements. From the point the same fashion from the remaining part of the total of view of patterns in such trajectories, slight devia- trajectory. Figure 1(a) shows three example trips that tion of trajectories from the patterns can be viewed as are derived from the total trajectory of one moving noise. Examples of such deviations could be due to a object. few minute delay, or to the usage of different lanes on 51 OBJ. 1 (2) OBJ. 2 (3) OBJ. 3 (4) OBJ. 4 (5) OBJ. 5 (6) 8:25 ± 2.5 minutes, respectively. In the following a spatio-temporal region will be referred to by its con- catenated values of the cell identifiers along the x– and 8:20 8:35 A 8:05 y–axis, and the corresponding time instance denoting the center of the time interval of the spatio–temporal 8:10 8:15 8:25 B 8:30 region. Hence, trips associated with object 3 will be 8:20 denoted by the a sequence {HD8:05, HC8:10, HB8:15, 8:00 8:10 IB8:20, IC8:25}. Furthermore, the trajectory database C 8:05 8:25 8:25 8:35 T is assumed to be in a relational format with schema hoid, tid, itemi, where item is a single item, that is D 8:00 8:05 8:05 part of the transaction tid associated with object oid. Hence, each of the four trips of object 3 is represented E F G H I J K L by 5 unique rows in T . Figure 2: Illustration of the sample trajectory DB 3.3 Problem statement the route. Hence, while a person might be driving from home to work at approximately the same time of day After performing the three above transformation steps, using approximately the same route, the chance of two the data set can be represented in a database T con- identical trajectories is highly unlikely. Consequently, taining tuples hoid, tid, si, where oid is an object iden- patterns in raw trajectories are few and certainly not tifier, tid is a trip identifier, and s is a sequence long. Thus, patterns have to be mined in trajecto- of spatio–temporal region identifiers. Since spatio– ries represented in a generalized way, yielding general temporal region identifiers contain a temporal com- patterns in trajectories. To achieve this generalization ponent, the sequence s can, without loss of informa- of trajectories, individual (x, y, t) measurements of a tion, be represented as a set of spatio–temporal region trajectory are discretized and mapped to the spatio– identifiers. Conforming to the naming convention used temporal regions they fall into. Thus, a generalized in the frequent itemset mining framework, a spatio– trajectory is constructed by substituting (x, y, t) mea- temporal region identifier will be equivalently referred surements with the spatio–temporal regions they map to as an item, and a sequence of spatio–temporal re- to. If within a trajectory multiple (x, y, t) measure- gion identifiers will be equivalently referred to as a ments map to the same spatio–temporal region, they transaction. Let X be a set of items, called an item- are substituted with a single instance of the corre- set. A transaction t satisfies an itemset X iff X ⊆ t. sponding spatio–temporal region. The box in Fig- Let STX denote the set of transactions that satisfy X. ure 1(b) represents such a spatio–temporal region. The following definitions are emphasized to point out Since spatio–temporal substitution of a single data- the differences between the frequent itemset mining base record can be achieved using simple arithmetics framework and the one established here. from the spatial and temporal coordinates, the pro- Definition 1 The n–support of an itemset X in cessing time of this transformation step is optimal and T , denoted as X.supp(n), is defined as the number of linear in the number of database records. transactions in STX if the number of distinct oids as- sociated with the transactions in STX is greater than 3.2 Example trajectory database or equal to n, and 0 otherwise. The n–support of an item i in T , denoted as i.supp(n), is equal to the Figure 2 visualizes a sample trajectory database. It n–support of the itemset that contains only i. shows the trajectories of trips of 5 moving objects, which were derived using the three transformation Definition 2 The length of an itemset X, denoted steps described in Section 3.1. For clarity, the tem- as |X|, is defined as the number of items in X. poral dimension is projected down to the 2D–plane. Spatio–temporal regions are defined by the square cells Definition 3 An itemset X is n–frequent in T if and a five minute interval centered around time in- X.supp(n) ≥ MinSupp, and X is long if |X| ≥ stances written inside the square. Each connected line MinLength, where MinLength, MinSupp, and n are represents specific trips of a particular object. The user–defined values. number of times that trip was performed by the ob- Definition 4 An itemset X is n–closed if there ex- ject is represented in the width of the line, and is also ists no itemset Y such that X ⊂ Y and X.supp(n) = written in parenthesis next to the object name in the Y.supp(n). legend. For example, the trip trajectory associated with object 3 was performed 4 times by the object. The task of mining LSPs in trajectories can be de- The object was in spatial regions HD, HC, HB, IB, and fined as finding all long, n–closed, n–frequent itemsets. IC during time intervals 8:05 ± 2.5 minutes, 8:10 ± 2.5 Itemsets that meet these requirements are also referred minutes, 8:15 ± 2.5 minutes, 8:20 ± 2.5 minutes, and to as LSPs, or just patterns. 52 OBJ. 1 (2) OBJ. 2 (3) OBJ. 3 (4) OBJ. 4 (5) OBJ. 5 (6) OBJ. 1 (2) OBJ. 2 (3) OBJ. 3 (4) OBJ. 4 (5) OBJ. 5 (6) OBJ. 1 (2) OBJ. 2 (3) OBJ. 3 (4) OBJ. 4 (5) OBJ. 5 (6) 8:35 8:35 8:35 A A A (0,8) 8:10 8:15 8:25 8:10 8:15 8:20 8:25 8:10 8:15 8:20 8:25 8:30 B 8:20 8:30 B 8:30 B (5,11) (5,11) (5,5) (5,11) (5,11) 8:05 8:10 C 8:05 C 8:05 C (5,5) E F G H I J K L E F G H I J K L E F G H I J K L (a) The sample DB after STEP 1 (b) The sample DB after STEP 2 (c) STEP 3: Item–conditional sample DB TF |FC8:05 and pattern discovery Figure 3: Illustration of steps 1, 2, and 3 4 Projection–based LSP mining the second step of the method further filters TFV and constructs TF that only contain transactions that Now let us turn to the description of the proposed me- have at least MinLength number of items. thod for mining LSPs in trajectories. This description The second step can be formulated in one SQL is based on a number of observations, each of which statement: is associated with a particular step in the method. In related technical report [5], these observations are also INSERT INTO TF (tid, oid, item) stated as lemmas, and their corresponding proofs show SELECT tid, oid, item FROM TFV the correctness and completeness of the method. To WHERE tid IN demonstrate the simplicity of the implementation in (SELECT tid FROM TFV GROUP BY tid a RDBMS, for each step a simple SQL–statement is HAVING COUNT(item) >= MinLength) given. The effect of each step is also illustrated on The sub–select is used to find trip identifiers that have the previously introduced sample trajectory database at least MinLength number of items. The outer part assuming MinLength = 4, MinSupp = 2, and n = 2. of the statement selects all records belonging to these STEP 1: Filtering infrequent items trip identifiers and inserts them into TF. Items, i.e., spatio–temporal regions that are not fre- The effects of the second step are illustrated in Fig- quent in T cannot be part of a LSP. Hence as first step ure 3(b). In particular, the remaining sharable parts of the method, T is filtered such that it contains items of trips belonging to objects 3 and 5 are deleted, be- with n–support larger than or equal to MinSupp. cause the length of them is not greater than or equal The first step can be formulated in two SQL state- to MinLength, which is 4 in the example. Also, note ments: that although in this case items HB8:15 and IB8:20 did not become infrequent in TF, they lost n–support. INSERT INTO F (item, i_cnt) Before stating further observations and continuing SELECT item, count(*) i_cnt FROM T with the development of the proposed method it is im- GROUP BY item HAVING COUNT(DISTINCT oid) >= n portant to note the following. The set of discoverable AND COUNT(*) >= MinSupp LSPs from T is equivalent to the set of discoverable LSPs from TF. This is ensured by first two observa- CREATE VIEW TFV AS tions. Since further steps of the proposed method will SELECT T.oid, T.tid, T.item discover LSPs from TF, these two observation ensure FROM T, F the correctness of the method so far. However, it is WHERE T.item = F.item also important to note that not all transactions in TF The first statement finds items that meet the unique necessarily satisfy a LSP. This is due to the sequen- support criterion. The second statement constructs a tiality of the first two steps. After the first step all filtered view of T , called TFV, in which transactions the remaining items in transactions are frequent items. only contain the items found by the previous state- Then, in the second step, some of these transactions, ment. which are not long, are deleted. Due to this deletion a The effects of the first step are illustrated in Fig- frequent item in the remaining long transactions may ure 3(a). Spatio–temporal regions, which are part of become non–frequent, which in turn may cause some trajectories that belong to less than 2 distinct objects, transactions to become short again. While there is no are removed from trajectories. From the point of view simple solution to break this circle, note that the cor- of an intelligent rideshare application these spatio– rectness of the first and second steps are not violated temporal regions are uninteresting, since these parts since the deleted items and transactions could not have of the trajectories cannot be shared by any objects, satisfied a LSP. i.e., are not sharable. STEP 3: Item–conditional DB projection STEP 2: Filtering of short transactions For the following discussion, adopted from [10], let Transactions, i.e., trip trajectories, having less than an item–conditional database of transactions, equiva- MinLength frequent items cannot satisfy a LSP. Hence, lently referred to as an an item–projected database, be 53 OBJ. 1 (2) OBJ. 2 (3) OBJ. 3 (4) OBJ. 4 (5) OBJ. 5 (6) OBJ. 1 (2) OBJ. 2 (3) OBJ. 3 (4) OBJ. 4 (5) OBJ. 5 (6) OBJ. 1 (2) OBJ. 2 (3) OBJ. 3 (4) OBJ. 4 (5) OBJ. 5 (6) 8:35 8:35 A A A (8,8) 8:10 8:15 8:25 8:10 8:15 8:25 8:30 8:10 8:15 8:25 8:30 B 8:30 B (8,11) (8,11) (8,11) (8,11) B C C C E F G H I J K L E F G H I J K L E F G H I J K L (a) The sample DB after STEP 5 (b) Item–conditional DB TF |LA8:35 and (c) Sample DB after the second PDD PDD Figure 4: Illustration of Pattern Discovery and Deletion phases defined as: JB8:25, KB8:30}. LA8:35 is the only item that is in Definition 5 Let T be a database of transactions, and TF |FC8:05 , but is not in the discovered SMFCI. Since i an item in T . Then, the item–conditional database further projecting TF |FC8:05 on LA8:35 yields a data- of transactions, is denoted as T|i and contains all the base of transactions where no item meets the minimum items from the transactions containing i. n–support criterion, the discovered SMFCI is the only The construction of an item–conditional database of CFI present in TF |FC8:05 . Since the discovered SMFCI transactions can be formulated in a single SQL state- meets both the minimum length and an minimum n– ment as: support criteria it is a pattern. INSERT INTO T_i (oid, tid, item) STEP 5: Deletion of unnecessary items SELECT t1.oid, t1.tid, t1.item The subproblems that are recursively solved by the FROM TF t1, TF t2 method presented so far are overlapping. That is to WHERE t1.tid = t2.tid and t2.item = i say, viewed from a top level, a CFI that has n items is Given n frequent items in T , the problem of finding at least once discovered in each of the n corresponding CFIs can be divided into n subproblems of finding the item–projected databases. To eliminate this redun- CFIs in each of the n item–projected databases [10]. dancy, both in the mining process and the result set, Using the divide–and–conquer paradigm, each of these observe that an item j can be deleted from TF if it has n subproblems can be solved by recursively mining the the same n–support in TF |i as in TF. The intuition item-projected databases as necessary. behind the observation is the following. If j has the same n–support in TF |i as in TF, it implies that all STEP 4: Discovery of the single most frequent the transactions in TF that satisfy j are also present closed itemset in TF |i . Thus, the set of patterns containing j, which Since i is in every transaction of the item–projected can be discovered from TF, can also be discovered from database T|i , and hence has maximum n–support, the TF |i . items in T|i can be grouped in two: items that have the The fifth step can be formulated in two SQL state- same n–support as i, and items that have n–support ments: less than that of i. The set of items that have the same INSERT INTO FT_i (item, i_cnt) n–support in the T|i as i is the Single Most Frequent SELECT item, count(*) i_cnt FROM T_i Closed Itemset (SMFCI) in T|i . The fourth step of the GROUP BY item method discovers this SMFCI. HAVING COUNT(DISTINCT oid) >= n The fourth step can be formulated in two SQL state- DELETE FROM TF WHERE TF.item IN ments: (SELECT F.item FROM F, FT_i INSERT INTO FT_i (item, i_cnt) WHERE F.item = FT_i.item SELECT item, COUNT(*) i_cnt FROM T_i AND F.i_cnt = FT_i.i_cnt) GROUP BY item The first statement counts the n–support of items in HAVING COUNT(DISTINCT oid) >= n TF |i . The second statement deletes all items in TF SELECT item FROM FT_i that have the same n–support in TF as in TF |i . WHERE i_cnt = (SELECT MAX(i_cnt) FROM FT_i) Figure 4(a) shows the effects of deleting the un- The first statement derives n–support of n–frequent necessary items after the mining of TF |FC8:05 . Since items FC:8:05 and IB8:20 have the same n–support items in the item–projected database, while the second in TF |FC8:05 as in TF, shown in Figure 3(c), they are statement selects those items from these n–frequent deleted from TF. Items remaining in TF are shown in items that have maximum n–support. Figure 4(a). Figure 3(c) shows the effects of projecting TF based on the item FC8:05. The numbers in parentheses show Item-projection ordered by increasing n– the n–support of the items in TF |FC8:05 and TF re- Support spectively. The SMFCI that is immediately discovered A LSP p in T , containing items i1 . . . ik , can be dis- from TF |FC8:05 is {FC8:05, GB8:10, HB8:15, IB8:20, covered from any one of the item–projected databases 54 T|i1 , . . . , T|ik . Steps 4 and 5 of the proposed method PAT. 1: O(1,2), S(5), L(6) PAT. 2: O(1,5), S(8), L(5) PAT. 3: O(1,2,5), S(11), L(4) assure that p will be discovered from exactly one of these item–projected databases, but the method pre- A 8:35 sented so far does not specify which one. While this point is irrelevant from the point of view of correctness, 8:10 8:15 8:20 8:25 it is crucial from the point of view of effectiveness. B 8:30 To illustrate this, assume that i1 .supp(n) < i2 .supp(n) < . . . < ik .supp(n). If projections are per- C 8:05 formed in decreasing order of item n–support, then, first T|ik is constructed, then T|ik |ik−1 is constructed E F G H I J K L from it, and so on, all the way to T|ik |ik−1 |...|i1 , from which finally p is discovered. If on the other hand, Figure 5: Three patterns in the sample DB projections are performed in increasing order of item JB8:25, KB8:30} is discovered. Since this closed item- n–support, then p is discovered from the first item– set meets both the minimum length and n–support re- projected database that is constructed, namely T|i1 . quirements, it is recorded as a pattern. Finally, items Assume that p and its qualifying (k − l + 1) (at least having the same n–support in TF |i as in TF, which in l–long) sub–patterns are the only LSPs in T . Then this case means all the items in TF |i , are deleted from during the whole mining process, the total number of TF. After this deletion part of the final PDD phase, projections in the decreasing processing order is Pdec = TF becomes empty and the mining terminates. Figure k, whereas in the increasing processing order the total 5 shows the three patterns that are discovered during number of projections is only Pinc = k − l + 1. If k the mining. Supporting oids, n–supports, and length and l are comparable and large, then Pdec ≫ Pinc . for each discovered patterns are shown in the legend. Similar statements can be made about the total size of the projected databases in both cases. Hence, item– LSP mining algorithm projection should be performed in increasing order of Using the observations and the associated steps, item n-support. the complete algorithm for mining LSPs in trajecto- Alternating pattern discovery and deletion ries is given in Figure 6. Since item–projected data- Alternating steps 3, 4 and 5, all patterns can be bases are constructed at every level of the recursion discovered in a recursive fashion. The sequential ap- and are modified across levels when deleting unnec- plication of these steps is referred to as a Pattern Dis- essary items, the level of recursion L is passed as an covery and Deletion phase (PDD). Mining terminates argument in the recursive procedure, and is used as when all items have been deleted from TF. a superscript to associate databases to the levels they were constructed in. Figures 3(c) and 4(a) shows the effects of the first of these PDD phases. Figures 4(b) and 4(c) show Lines 2 and 3 in the MineLSP procedure represent the effect of the next pattern PDD phase. Since af- step 1 and 2 of the method, and they construct the fil- ter the first PDD phase LA8:35 has the lowest n– tered database of transactions at the initial level, level support in TF, namely 8, it is chosen as the next 0. Line 4 processes frequent items in TF 0 in ascending item to base the database projection on. Figure 4(b) order of n–support. Line 5 represent step 3 of the me- show TF |LA8:35 with the corresponding n–support of thod, and for each such frequent item i, it constructs the items in TF |FC8:05 and TF respectively. Since the item–conditional database of transactions TF 0|i at all the items have the same n–support in TF |FC8:05 level 0. Line 6 calls procedure FindLSP to extract all as LA8:35, namely 8, the closed itemset {GB8:10, LSPs from TF 0|i recursively. HB8:15, JB8:25, KB8:30, LA8:35} is discovered. Since Lines 2 and 3 in the FindLSP procedure represent this closed itemset both meets the minimum length steps 1 and 2 of the method, and they construct the and n–support requirements it is recorded as a pat- filtered database of transactions at the current level tern. In the deletion part of this PDD phase, item L. Line 4 represents step 4 of the method, and it finds LA8:35 is deleted from TF as the only item that have the SMFCI P in TF L long . Line 5 represents step 5 of the same n–support in TF |LA8:35 as in TF. The results the method, and it deletes all items from the filtered L−1 of this deletion are shown on Figure 4(c). database of transactions of the previous level, TF long , L−1 L The third and final PDD phase is implicitly shown that have the same n-support in TF long as in TF freq , in Figure 4(c). Since after the second PDD phase all the current level. Lines 6 and 7 check if the single the items in TF have the same n–support, the next most frequent closed itemset P meets the minimum projection is performed on any one of the items, i, requirements and store it accordingly. Lines 8 and 9 and the resulting item–projected database, TF |i , is processes frequent items in TF L long , which are not in identical to the current state of TF, depicted on Fig- P , in ascending order of n–support. Line 10 represent ure 4(c). Since all the items in TF |i have the same step 3 of the method, and for each such frequent item n–support as i, the closed itemset {GB8:10, HB8:15, i it constructs the item–conditional database of trans- 55 (1) procedure MineLSP (T , MinSupp, MinLength, n) 5 Experimental evaluation (2) TF 0freq ← MinSupportFilter (T , MinSupp, n) (3) TF 0long ← MinLengthFilter (TF 0freq , MinLength) The proposed method was implemented using MS- (4) for each freq item i in TF 0long ordered by asc n–supp SQL Server 2000 running on Windows XP on a 3.6GHz (5) TF 0|i ← ConstructConditionalDB (TF 0long , i) Pentium 4 processor with 2GB main memory. The me- (6) FindLSP (TF 0|i , 1, MinSupp, MinLength, n) thod was tested on the publicly available INFATI data (7) end for each set, which comes from intelligent speed adaptation ex- periments conducted at Aalborg University. This data (1) procedure FindLSP (T ,L,MinSupp,MinLength,n) set records cars moving around in the road network of (2) TF L Aalborg, Denmark over a period of several months. freq ← MinSupportFilter (T , MinSupp, n) (3) TF L L 20 distinct test cars and families participated in the long ← MinLengthFilter (TF freq , MinLength) (4) (P, P .supp(n)) ← FindSMFCI (TF L INFATI experiment; Team–1 consisting of 11 cars op- long ) (5) TF L−1 ← DeleteUnnecessaryItems (TF L−1 , TF L erated between December 2000 and January 2001 and freq ) (6) long long if P .supp(n) ≥ MinSupp and |P | ≥ MinLength Team–2 consisting of 9 cars operated between Febru- (7) StorePattern (P , P .supp(n)) ary and March 2001. Each car was equipped with a (8) for each freq item i in TF L GPS receiver, from which GPS positions were sampled long ordered by asc n–supp (9) if i is not in P every second whenever the car was operated. Addi- (10) TF L L tional information about the experiment can be found |i ← ConstructConditionalDB (TF long , i) (11) FindLSP (TF L in [8]. |i , L + 1, MinSupp, MinLength, n) (12) end for each The method presented in Section 3.1 identifies trips from continuous GPS measurements, which is not the case in the INFATI data. Hence in this case, a trip Figure 6: The LSP algorithm was defined as sequence of GPS readings where the time difference between two consecutive readings is less than 5 minutes. Using the definition, the INFATI data actions TF L |i at the current level L. Finally, line 11 contains 3699 trips. After projecting the temporal di- recursively calls procedure FindLSP to find LSPs in mension to the time–of–day domain and substituting TF L|i at the next level. the noisy GPS readings with 100 meter by 100 me- The structure and functionality of procedures ter by 5 minutes spatio-temporal regions, the resulting MineLSP and FindLSP have a significant overlap. trajectory database has the following characteristics. While the two functions can be merged into one, the There are 200929 unique items in the 3699 transac- separation of the two is used to emphasize the facts tions. The average number of items in a transaction is that (1) DeleteUnnecessaryItems presumes the exis- approximately 102. The average n–support of 1–, 2–, tence of databases constructed at the previous level, and 3–frequent items is 1.88, 4.2 and 6.4 respectively. and (2) FindSMFCI correctly operates only on an Notice that the averages only include the n–supports item–projected database, and hence it can only be ap- of 1–, 2–, and 3– frequent items. plied at level 1 and above. Two sets of experiments were performed, each varying one of the two parameters of the algorithm, Several implementation details are worth mention- MinSupp and MinLength. The performance of the al- ing. First, DeleteUnnecessaryItems deletes items from L−1 gorithms was measured in terms of processing time TF long based on the n–support of items in TF L freq , L and working space required, where the working space not TF long . This is important, as it was noted that required by the algorithm was approximated by the MinLengthFilter decreases the n–support of items in sum of the rows in the projected tables that were con- TF L freq , thereby possibly making an unnecessary item structed by the algorithm. Both measures were evalu- appear to be necessary. Second, arguments to func- ated in an absolute and a relative, per pattern, sense. tions and operands in statements are logical, i.e., the Figures 7(a), 7(c), and 7(e) show the results of the functions and statements can be more efficiently imple- first set of experiments, where MinSupp = 2, n = 2 mented using previously derived tables. For example, and MinLength is varied between 120 and 530. Lower both FindSMFCI and DeleteUnnecessaryItems are im- settings for MinLength were also tested, but due to plemented using previously derived n–support count the very low MinSupp value these measurement were tables not the actual trajectory tables. Third, simple terminated after exceeding the 2 hour processing time shortcuts can significantly improve the efficiency of the limit. Noting the logarithmic scale in Figure 7(a) it is method. For example, during the derivation of TF L freq , evident that both the running time and the working if the number of unique frequent items in TF L freq is less space required by the algorithm exponentially increase than MinLength, no further processing is required at as the MinLength parameter is decreased. Examining that level, since none of the CFIs that can be derived the same quantities in a relative, per pattern sense, from TF L freq are long. To preserve clarity, these simple Figure 7(e) reveals that the average running time and shortcuts are omitted from Figure 6. average working space required per pattern is approx- 56 Trajectories of 20 moving objects in INFATI Absolute Time and Space Performace vs MinLength 8 Absolute Time and Space Performace vs MinSupp 10 4 4 10 10 10 10 4 x 10 processing time processing time 4 total working space (# of rows) 2000 total working space (# of rows) working space working space total processing time (sec) total processing time (sec) Time−of−day (5min) 1800 3 Time (5min) 3 6 10 10 1600 2 1400 2 5 10 10 1200 1 1000 2 4 10 10 0 800 6.4 6.4 6500 6.35 6000 6.3 6000 5800 4 4 6.3 5600 x 10 6.2 5500 x 10 6.25 5400 5000 5200 1 2 0 0 6.1 4500 6.2 5000 10 10 10 10 Y−coor (100m) X−coor (100m) Y−coor (100m) X−coor (100m) 100 150 200 250 300 350 400 450 500 5 10 15 20 25 30 35 MinLength MinSupp (a) Absolute time and space (b) Absolute time and space (a) Moving object trajectories (b) 28 discovered LSPs Number of Patterns and Ineffective Projections vs MinLength 10 4 10 4 Number of Patterns and Ineffective Projections vs MinSupp 6 3 Figure 8: LSP discovery results in INFATI 10 10 patterns patterns ineffective projections ineffectibe projections of the trajectories of the 20 moving objects in the IN- # of ineffective projections # of ineffective projections 3 3 10 10 4 2 10 10 FATI data set. While some regularities are apparent # of patterns # of patterns 2 2 10 10 2 1 in it to the human eye, to find LSPs in it seems like a 10 1 10 1 10 10 daunting task. Figure 8(b) shows 28 LSPs in it that 0 0 0 0 are at least 200 long, sharable for at least 2 distinct 10 100 150 200 250 300 350 400 450 500 MinLength 10 10 5 10 15 20 MinSupp 25 30 10 35 objects, and have a support of at least 2. (c) Number of patterns (d) Number of patterns In conclusion, the experiments show that the me- Relative Time and Space Performace vs MinLength Relative Time and Space Performace vs MinSupp thod is effective and robust to changes in the user– 10000 40 processing time 4 4000 defined parameter settings, MinLength and MinSupp, working space per pattern (# of rows) processing time working space per pattern (# of rows) processing time per pattern (sec) processing time per pattern (sec) working space working space and is a useful analysis tool for finding LSPs in moving object trajectories. 20 5000 2 2000 6 Conclusions and future work 0 0 100 150 200 250 300 350 400 450 500 MinLength 0 5 10 15 20 MinSupp 25 30 0 35 The herein presented work, for the first time, considers (e) Relative time and space (f) Relative time and space the problem of mining LSPs in trajectories, and trans- forms it to a framework similar to that used in frequent Figure 7: Performance evaluation for various itemset mining. A database projection based method, MinLength (a,c,e) and MinSupp (b,d,f) settings and its simple SQL–implementation is presented for imately linearly decreasing as the MinLength param- mining LSPs in trajectories. The effectiveness of the eter is decreased. The presence of the two irregular method is demonstrated on a real world data set. bumps in Figure 7(e) can be explained in relation to The large number of patterns discovered are diffi- the number of patterns found, and the number of in- cult to analyze. To reduce this number, future work effective projections that yield no patterns, shown in will consider the mining of a compressed patterns in Figure 7(c). The sharp increases in relative process- trajectories [14]. ing time and working space are due to the fact that In future work, we also plan to perform additional the algorithm is unable to take some of the shortcuts experiments on larger real world data sets when such and it performs relatively more ineffective projections become available. These experiments will include the yielding no pattern discovery. The sharp decreases can investigation of scale–up properties of the algorithm as be explained by the presence of an increasing number the number of moving objects are increasing and/or of patterns that share the total pattern discovery cost. as the granularity of the spatio–temporal regions is Similar observations can be made about the second varied. set of experiments, shown in Figures 7(b), 7(d), and 7(f), where MinLength = 50, n = 2 and MinSupp Acknowledgments is varied between 7 and 33. For example, the sharp decrease in relative processing time in Figure 7(f) when This work was supported in part by the Danish Min- going from MinSupp = 33 to MinSupp = 32 is simply istry of Science, Technology, and Innovation under due to the sudden appearance of patterns in the data grant number 61480. for the given parameters. While there is only 1 pattern for MinSupp = 33, and an order of magnitude more References number of patterns for MinSupp = 32, the projections performed and hence the absolute processing time to [1] R. Agrawal, T. Imilienski, and A. Swami. Mining Asso- ciation Rules between Sets of Items in Large Databases. discover these patterns is approximately the same in In Proc. of SIGMOD, pp. 207–216, 1993. both cases. Hence, the relative processing time for [2] R. Agrawal and R. Srikant. Fast algorithms for mining MinSupp = 33 is an order of magnitude larger than association rules. In Proc. of VLDB, pp. 487–499, 1994. that for MinSupp = 32. [3] R. Agrawal and R. Srikant. Mining Sequential Pat- Figure 8(a) shows a 50–fold down–sampled version terns. In Proc. of ICDE, pp. 3–14, 1995. 57 [4] G. Gidofalvi and T. B. Pedersen. Spatio-Temporal Rule Mining: Issues and Techniques. In Proc. of DaWaK, pp. 275–284, 2005. [5] G. Gidofalvi and T. B. Pedersen. Mining Long, Common Patterns in Trajectories of Moving Objects. A DB Technical Report (15), 2006: www.cs.auc.dk/DBTR/DBPublications/DBTR-15.pdf [6] B. Goethals. Survey on frequent pattern mining. citeseer.ist.psu.edu/goethals03survey.html [7] C. S. Jensen, D. Pfoser, and Y. Theodoridis. Novel Ap- proaches to the Indexing of Moving Object Trajecto- ries. In Proc. of VLDB, pp. 395-406, 2000. [8] C. S. Jensen, H. Lahrmann, S. Pakalnis, and S. Runge. The INFATI data. Time Center TR–79, 2004: www.cs.aau.dk/TimeCenter [9] N. Mamoulis, H. Cao, G. Kollios, M. Hadjieleftheriou, Y. Tao, and D. W. Cheung. Mining, Indexing, and Querying Historical Spatiotemporal Data. In Proc. of KDD, pp. 236–245, 2004. [10] J. Pei, J. Han, and R. Mao. CLOSET: An efficient algorithm for mining frequent closed itemsets. In Proc. of DMKD, pp. 11-20, 2000. [11] X. Shang, K.-U. Sattler, and I. Geist. Efficient Fre- quent Pattern Mining in Relational Databases. In Proc. of LWA, pp. 84–91, 2004. [12] I. Tsoukatos and D. Gunopulos. Efficient Mining of Spatiotemporal Patterns. In Proc. of SSTD, pp. 425– 442, 2001. [13] M. Vlachos, D. Gunopoulos, and G. Kollios. Discover- ing Similar Multidimensional Trajectories. In Proc. of ICDE, pp. 673–685, 2002. [14] D. Xin, J. Han, X. Yan, and H. Cheng. Mining Com- pressed Frequent-Pattern Sets. In Proc. of VLDB, pp. 709–720, 2005. [15] X. Yan, J. Han, and R. Afshar. CloSpan: Mining closed sequential patterns in large datasets. In Proc. of SDM, pp. 166–177, 2003. 58