On Projections of Sequential Pattern Structures (with an application on care trajectories) Aleksey Buzmakov1,2 , Elias Egho1 , Nicolas Jay1 , Sergei O. Kuznetsov2 , Amedeo Napoli1 , and Chedy Raı̈ssi1 1 LORIA (CNRS – Inria NGE – U. de Lorraine), Vandœuvre-lès-Nancy, France 2 National Research University Higher School of Economics, Moscow, Russia {aleksey.buzmakov, chedy.raissi}@inria.fr, {elias.egho, nicolas.jay, amedeo.napoli}@loria.fr, skuznetsov@hse.ru Abstract. In this paper, we are interested in the analysis of sequen- tial data and we propose an original framework based on FCA. For that, we introduce sequential pattern structures, an original specifica- tion of pattern structures for dealing with sequential data. Sequential pattern structures are given by a subsumption operation between set of sequences, based on subsequence matching. To avoid a huge number of resulting concepts, domain knowledge projections can be applied. The original definition of projections is revised in order to operate on sequen- tial pattern structures in a meaningful way. Based on the introduced definition, several projections of sequential pattern structures involving domain or expert knowledge are defined and discussed. This projections are evaluated on a real dataset on care trajectories where every hos- pitalization is described by a heterogeneous tuple with different fields. The evaluation reveals interesting concepts and justify the usage of in- troduced projections of sequential pattern structures. This research work provides a new and efficient extension of FCA to deal with complex data, which can be an alternative to the analysis of sequential datasets. Keywords: formal concept analysis, pattern structures, projections, se- quential pattern structures, sequences Introduction Analysis of sequential data is a challenging task. In the last two decades, the main emphasis has been on developing efficient mining algorithms with effective pattern representations for sequences of itemsets [1–4]. The traditional sequential pattern mining algorithms generate a large number of frequent sequences while a few of them are truly relevant. Moreover, in some particular cases, only sequential patterns of a certain type are of interest and should be mined first. Are we able to develop a framework for taking into account only patterns of required types? Furthermore, in many cases sequential data are described by sequences with complex elements, e.g. a text is a sequence of syntactic trees. To process such kind of data with existing algorithms, elements of sequences can be scaled c paper author(s), 2013. Published in Manuel Ojeda-Aciego, Jan Outrata (Eds.): CLA 2013, pp. 199–210, ISBN 978–2–7466–6566–8, Laboratory L3i, University of La Rochelle, 2013. Copying permitted only for private and academic purposes. 200 Aleksey Buzmakov et al. into itemsets as it is done in the case of multilevel multidimensional data [5]. However, in this case it is rather difficult to introduce expert requirements within a sequence, which leads to even a larger set of resulting patterns. We approach this problem with FCA and pattern structures [6, 7]. FCA is successfully used for analysis of sequential data [8, 9]. Moreover, it allows one to use different measures of interestingness for the resulting patterns (concepts). Pattern structures allows to directly process sequential data without a scaling step. Furthermore, there are projections of pattern structures, which were intro- duced in order to simplify the computation of pattern lattices, by simplifying descriptions. Moreover, projections can be efficiently used as special domain knowledge requirements, allowing to reduce the number of irrelevant patterns. We generalize the original definitions of projections, in order to deal with pro- jections respecting domain knowledge. For example, sequences of length 1 are rare useful but they cannot be excluded by the original definition of projections. The rest of the paper is organized as follows. In Section 1 we remind FCA, pattern structures and measures of concept interestingness. Section 2 states the problem of complex sequences analysis and introduces sequential pattern struc- tures. In Section 3, first, the generalization of projections is defined, and, second, some projections specific to sequential pattern structures are introduced and analyzed. And finally before concluding the paper, we discuss an experimental evaluation in Section 4. 1 FCA and Pattern Structures FCA [6] is a mathematical formalism having many applications in data anal- ysis. Pattern structures is a generalization of FCA for dealing with complex structures, such as sequences or graphs [7]. Definition 1. A pattern structure is a triple (G, (D, u), δ), where G is a set of objects, (D, u) is a complete meet-semilattice of descriptions and δ : G → D maps an object to a description. The lattice operation in the semilattice (u) corresponds to the similarity between two descriptions. Standard FCA can be presented in terms of pattern structures. In this case, G is the set of objects, the semilattice of descriptions is (℘(M ), u), where a description is a set of attributes, with the u operation corresponding to the set intersection (℘(M ) denotes the powerset of M ). If x = {a, b, c} and y = {a, c, d} then x u y = x ∩ y = {a, c}. The mapping δ : G → ℘(M ) is given by, δ(g) = {m ∈ M | (g, m) ∈ I}, and returns the description for a given object as a set of attributes. The Galois connection for (G, (D, u), δ) is defined as follows: l A := δ(g), for A ⊆ G g∈A d := {g ∈ G | d v δ(g)}, for d ∈ D On Projections of Sequential Pattern Structures (with an appl. on care traj.) 201 The Galois connection makes a correspondence between sets of objects and descriptions. Given a set of objects A, A returns the description which is common to all objects in A. And given a description d, d is the set of all objects whose description subsumes d. More precisely, the partial order (or the subsumption order) on D (v) is defined w.r.t. the similarity operation u: c v d ⇔ c u d = c, and c is subsumed by d. Definition 2. A pattern concept of a pattern structure (G, (D, u), δ) is a pair (A, d) where A ⊆ G and d ∈ D such that A = d and d = A, A is called the concept extent and d is called the concept intent. As in standard FCA, a pattern concept corresponds to the maximal set of objects A whose description subsumes the description d, where d is the maximal common description for objects in A. The set of all concepts can be partially ordered w.r.t. partial order on extents (dually, intent patterns, i.e v), within a concept lattice. An example of a pattern structure is given and described in the next sections. It can be noticed that Table 1 defines a pattern structure, while the corresponding lattice is depicted in Figure 1. It is worth mentioning, that the size of the concept lattice can be exponential w.r.t. to the number of objects, and, thus, we need a special ranking method to select the most interesting concepts for further analysis. Several techniques are considered in [10], where it is shown that stability index [11] is more reliable in noisy data. Thus, we use this index in our current work. Definition 3. Given a concept c, the concept stability Stab(c) is the percent of subsets of the concept extent (denoted Ext(c)), whose description is equal to the concept intent (denoted Int(c)). |{s ∈ ℘(Ext(c)) | s = Int(c)}| Stab(c) := (1) |℘(Ext(c))| Stability measures how much the concept depends on the initial dataset. The larger the stability the more objects can be deleted from the context without affecting the intent of the concept, i.e. the intent of the most stable concepts are likely to be a characteristic pattern of a studied phenomena rather than an artifact of a data set. After a brief general description of the analysis with pattern structures, the analysis of sequential data can be specified. 2 Sequential Pattern Structures 2.1 An Example of Sequential Data Imagine that we have medical trajectories of patients, i.e. sequences of hospital- izations, where every hospitalization is described by a hospital name and a set of procedures. An example of sequential data on medical trajectories with three patients is given in Table 1. There are a set of procedures P = {a, b, c, d}, a 202 Aleksey Buzmakov et al. Patient Trajectory p1 h[H1 , {a}]; [H1 , {c, d}]; [H1 , {a, b}]; [H1 , {d}]i p2 h[H2 , {c, d}]; [H3 , {b, d}]; [H3 , {a, d}]i p3 h[H4 , {c, d}]; [H4 , {b}]; [H4 , {a}]; [H4 , {a, d}]i Table 1: Toy sequential data on patient medical trajectories. set of hospital names TH = {H1 , H2 , H3 , H4 , CL, CH, ∗}, where hospital names are hierarchically organized (by the level of generality), H1 and H2 are central hospitals (CH) and H3 and H4 are clinics (CL), and ∗ denotes the root of this hierarchy. For the sake of simplicity, we use the u operator in order to denote the least common ancestor in TH , i.e. H1 u H2 = CH. Every hospitalization is described with one hospital name and may contain several procedures. The pro- cedure order in each hospitalization is not important in our case. For example, the first hospitalization [H2 , {c, d}] for the second patient (p2 ) was in hospital H2 and during this hospitalization the patient underwent procedures c and d. An important task is to find the “characteristic” sequences of procedures and associated hospitals in order to improve hospitalization planning, optimize clin- ical processes or detect anomalies. This sequences can be found by searching for the most stable concepts in the lattice corresponding to a pattern structure. 2.2 Partial Order on Complex Sequences A sequence is constituted of elements from an alphabet. The classical subse- quence matching task requires no special properties of the alphabet. Several generalizations of the classical case were made by introducing a subsequence relation based on itemset alphabet [8] or on multidimensional and multilevel alphabet [5], scaled to itemset alphabet as well. Both these alphabets are cer- tain semilattices, and, thus, we generalize the previous cases, requiring for an alphabet to form a general semilattice (E, uE )1 . Thanks to pattern structure formalism we are able to process in a unified way all types of sequential datasets with poset-shaped alphabet. However, some sequential data can have connec- tions between elements, e.g. [12], and, thus, cannot be immediately processed by our approach. Definition 4. A sequence is an ordered list of e ∈ (E, uE ), such that e 6= ⊥E . Here, ∀e ∈ E, ⊥E = ⊥E uE e. The bottom element is required by the lattice structure but provide us with no useful information (it matches to any other element), thus, it is excluded from the sequences. In the same way, in mining of sequences of itemsets the empty itemset cannot be a proper element [2]. Definition 5. A sequence t = ht1 ; ...; tk i is a subsequence of a sequence s = hs1 ; ...; sn i, denoted t ≤ s, iff k ≤ n and there exist j1 , ..jk such that 1 ≤ j1 < j2 < ... < jk ≤ n and for all i ∈ {1, 2, ..., k}, ti vE sji (⇔ ti uE sji = ti ). 1 In this paper we consider two semilattices, the first one is related to the characters of the alphabet, (E, uE ), and the second one is related to pattern structures, (D, u). On Projections of Sequential Pattern Structures (with an appl. on care traj.) 203 n o  p1 , p2 , p3 ; ss4 , ss5 n o  n o  n o  p1 , p2 ; ss2 , ss3 p1 , p3 ; ss11 , ss12 p2 , p 3 ; ss6 , ss7 , ss8 n o  n o  n o  p1 ; p1 p2 ; p2 p3 ; p3 (∅; ∗) Fig. 1: The concept lattice for the pattern structure given by Table 1. Concept intents reference to sequences in Tables 1 and 2. Subsequences Subsequences ss1 h[CH, {c, d}]; [H1 , {b}]; [∗, {d}]i ss2 h[CH, {c, d}]; [∗, {b}]; [∗, {d}]i ss3 h[CH, {}]; [∗, {d}]; [∗, {a}]i ss4 h[∗, {c, d}]; [∗, {b}]i ss5 h[∗, {a}]i ss6 h[∗, {c, d}]; [CL, {b}]; [CL, {a}]i ss7 h[CL, {d}]; [CL, {}]i ss8 h[CL, {}]; [CL, {a, d}]i ss9 h[CH, {c, d}]i ss10 h[CL, {b}]; [CL, {a}]i ss11 h[∗, {c, d}]; [∗, {b}]i ss12 h[∗, {a}]; [∗, {d}]i Table 2: Subsequences of patient sequences in Table 1. With complex sequences and such kind of subsequences the computational procedure can be difficult, thus, to simplify the procedure, only “contiguous” subsequences are considered, where only the order of consequent elements is taken into account, i.e. given j1 in Definition 5, ji = ji−1 +1 for all i ∈ {2, 3, ..., k}. Such a restriction makes sens for our data, because a hospitalization is a discrete event and it is likely that the next hospitalization has a relation with the previous one, for example, hospitalizations for treating aftereffects of chemotherapy. Below the word “subsequence” refers to “contiguous” subsequence. Example 1. In the running example (Section 2.1), the alphabet is E = TH ×℘(P ) with the similarity operation (h1 , P1 ) u (h2 , P2 ) = (h1 u h2 , P1 ∩ P2 ), where h1 , h2 ∈ TH are hospitals and P1 , P2 ∈ ℘(P ) are sets of procedures. Thus, the sequence ss1 in Table 2 is a subsequence of p1 in Table 1 because if we set ji = i + 1 (Definition 5) then ss11 v p1j1 (‘CH’ is an ancestor for H1 and {c, d} ⊆ {c, d}), ss12 v p1j2 (the same hospital and {b} ⊆ {b, a}) and ss13 v p1j3 (‘*’ is an ancestor for anything and {d} ⊆ {d}). 2.3 Meet-semilattice of Sequences Using the previous definitions, we can precisely define the sequential pattern structures that are used for representing and managing sequences. For that, we make an analogy with pattern structures for graphs where the meet-semilattice operation u respects subgraph isomorphism [13]. Thus, we introduce a sequential meet-semilattice respecting subsequence relation. Let us consider S as the set of all sequences based on an alphabet (E, uE ). S is partially ordered w.r.t. Definition 5. (D, u) is a semilattice on sequences S, where D ⊆ ℘(S) such that 204 Aleksey Buzmakov et al. if d ∈ D contains a sequence s then all subsequences of s should be included into d, ∀s ∈ d, @s̃ ≤ s : s̃ ∈ / d, and the similarity operation is the set intersection for two sets of sequences. Given two patterns d1 , d2 ∈ D, the set intersection operation ensures that if a sequence s belongs to d1 u d2 then any subsequence of s belongs to d1 u d2 and thus d1 u d2 ∈ D. As the set intersection operation is idempotent, commutative and associative, (D, u) is a valid semilattice. Example 2. The sequential pattern  structure for our example (Subsection 2.1) is (G, (D, u), δ), where G = p1 , p2 , p3 is the set of patients, (D, u) is the semilattice of sequential descriptions, and δ is the mapping (shown in Table 1) associating a patient in G to a description in D. Figure 1 shows the resulting lattice of sequential pattern concepts for this particular pattern structure. The set of all possible subsequences for a given sequence can be rather large. Thus, it is more efficient and readable to keep a pattern d ∈ D as a set of only maximal sequences d, ˜ d˜ = {s ∈ d | @s∗ ∈ d : s∗ ≥ s}. In the rest of the paper, every pattern  2  3 is  given only by the set of its maximal sequences. For example, p u p = ss6 , ss7 , ss8 (see Tables 1 and 2), i.e. ss6 , ss7 , ss8 is the set of all  2maximal  sequences specifying the  intersection result  of two sets of sequences p and p3 , in the same way ss6 , ss7 , ss8 u p1 = ss4 , ss5 . Note that representing a pattern by the set of all maximal sequences allows for an efficient implementation of the intersection “u” of two patterns. The next proposition is follows from this subsection and Definition 5. Proposition 1. Given (G, (D, u), δ) and x, y ∈ D, x v y if and only if ∀sx ∈ x there is a sequence sy ∈ y, such that sx ≤ sy . 3 Projections of Sequential Pattern Structures Pattern structures can be hard to process due to the usually large number of concepts in the concept lattice and the complexity of the involved similarity operation (make the parallel with the graph isomorphism problem). Moreover, a given pattern structure can produce a lattice with a lot of patterns which are not interesting for an expert. Can we save computational time by avoiding the construction of unnecessary patterns? Projections of pattern structures “sim- plify” to some degree the computation and allow one to work with a reduced description. In fact, projections can be considered as constraints (or filters) on patterns respecting certain mathematical properties. These mathematical prop- erties ensure that the projection of a lattice is a lattice where projected concepts have certain correspondence to original ones. Moreover, the stability measure of projected concepts never decreases w.r.t the corresponding concepts. We in- troduce projections on sequential patterns, revising them from [7]. An extended definition of projections w.r.t. the definition in [7] should be provided in order to deal with interesting projections for real-life sequential datasets. Definition 6. A projection ψ : D → D is an interior operator, i.e. it is (1) mono- tone (x v y ⇒ ψ(x) v ψ(y)), (2) contractive (ψ(x) v x) and (3) idempotent (ψ(ψ(x)) = ψ(x)). On Projections of Sequential Pattern Structures (with an appl. on care traj.) 205 Under a projection ψ, a pattern structure (G, (D, u), δ) becomes the pro- jected pattern structure ψ((G, (D, u), δ)) = (G, (Dψ , uψ ), ψ ◦ δ), where Dψ = ψ(D) = {d ∈ D | ∃d∗ ∈ D : ψ(d∗ ) = d} and ∀x, y ∈ D, x uψ y := ψ(x u y). Note that in [7] ψ((G, (D, u), δ)) = (G, (D, u), ψ ◦ δ). Now we should show that (Dψ , uψ ) is a semilattice. Proposition 2. Given a semilattice (D, u) and a projection ψ, for all x, y ∈ D ψ(x u y) = ψ(ψ(x) u y). Proof. 1. ψ(x) v x, thus, x, y w (x u y) w (ψ(x) u y) w ψ(ψ(x) u y) 2. x v y ⇒ ψ(x) v ψ(y), thus, ψ(x u y) w ψ(ψ(x) u y) 3. ψ(x u y) u ψ(x) u y = ψ(x u y) u y = ψ(x u y), ψ(xuy)vψ(x) ψ(xuy)vy then (ψ(x) u y) w ψ(x u y) and ψ(ψ(x) u y) w ψ(ψ(x u y)) = ψ(x u y) 4. From (2) and (3) follows that ψ(x u y) = ψ(ψ(x) u y). Corollary 1. X1 uψ X2 uψ · · · uψ XN = ψ(X1 u X2 u · · · u XN ) Corollary 2. Given a semilattice (D, u) and a projection ψ, (Dψ , uψ ) is a semi- lattice, i.e. uψ is commutative, associative and idempotent. The concepts of a pattern structure and a projected pattern structure are connected with the next proposition, following from Corollary 1: Proposition 3. An extent in ψ((G, (D, u), δ)) is an extent in (G, (D, u), δ). An intent in ψ((G, (D, u), δ)) is of the form ψ(d), where d is the intent of the concept with the same extent. Moreover, preserving extents of some concepts, projections cannot decrease the stability of the projected concepts, i.e. if the projection preserves a stable concept, then its stability (Definition 3) can only increase. Proposition 4. Given a pattern structure (G, (D, u), δ), its concept c and a projected pattern structure (G, (Dψ , uψ ), ψ ◦ δ), and the projected concept c̃, if the concept extents are equal (Ext(c) = Ext(c̃)) then Stab(c) ≤ Stab(c̃). Proof. Concepts c and c̃ have the same extent. Thus, according to Definition 3, in order to prove the proposition statement, it is enough to prove that for any subset A ⊆ Ext(c), if A = Int(c) in the original pattern structure, then A = Int(c̃) in the projected one. It can be proven from contrary. Suppose that ∃A ⊂ Ext(c) such that A = Int(c) in the original pattern structure and A 6= Int(c̃) in the projected one. Then there is a descendant concept d˜ of c̃ in the projected pattern structure such that A = Int(d) ˜ in the projected lattice. Then there is an original concept d for the projected concept d˜ with the same Ext(d). Then A w Int(d) A Int(c) and, so, A cannot be equal to Int(c) in the original lattice. Contradiction. No we are going to present two projections of sequential pattern structures. The first projection comes from the following observation. In many cases it may 206 Aleksey Buzmakov et al. be more interesting to analyze quite long subsequences rather than short one. This kind of projections is called Minimal Length Projection (MLP) and it de- pends on the minimal allowed length parameter l for the sequences in a pattern. The corresponding function ψ maps a pattern without short sequences to it- self, and a sequence with short sequences to the pattern containing only long sequences, ψ(d) = {s ∈ d | length(s) > l}. Later, propositions 1 and 5 stay that MLP is coherent with Definition 6. Example 3. If we prefer common subsequences of length ≥ 3, then between p2 and p3 in Table 1 there is only one maximal common subsequence, ss6 in Table 2, while ss7 and ss8 are too short to be considered. Figure 2a shows the lattice corresponding the projected pattern structure (Table 1) with patterns of length more or equal to 3. Proposition 5. MLP is a monotone, contractive and idempotent function on the semilattice (D, u). Proof. The contractivity and idempotentcy are quite clear from the definition. Remains the proof for monotonicity. If X v Y where X and Y are sets of sequences then for every sequence x ∈ X there is a sequence y ∈ Y such that x ≤ y (Proposition 1). We should show that ψ(X) v ψ(Y ), or in other words for every sequence x ∈ ψ(X) there is a sequence y ∈ ψ(Y ), such that x ≤ y. Given x ∈ ψ(X), since ψ(X) is a subset of X and X v Y , then there is a sequence y ∈ Y such that x ≤ y, with |y| ≥ |x| ≥ l (l is a parameter of MLP), and thus, y ∈ ψ(Y ). The second projection of a sequential pattern structure is connected to a projection of an alphabet semilattice, (E, uE ). Example 4. An expert is interested in finding sequential patterns on how a pa- tient changes hospitals, but he has little interest in procedures. Thus, any el- ement of the alphabet lattice, containing a non-empty set of procedures can be projected to the element with the same hospital but with the empty set of procedures. Example 5. An expert is interested in finding sequential patterns containing some information about the hospital in every hospitalization, and the corre- sponding procedures, i.e. hospital field in the patterns cannot be equal to the element “any hospital”, denoted ∗, e.g., ss5 is an invalid pattern, while ss6 is a valid pattern in Table 2. Thus, any element of the alphabet semilattice with ∗ hospital can be projected to the ⊥E . Figure 2b shows the lattice corresponding to the projected pattern structure (Table 1), where projection comes from the projection of the alphabet semilattice. Below we formally define how the alphabet projection of a sequential pattern structure should be processed. Intuitively, every sequence in a pattern should be substituted with another sequence, by applying the alphabet projection to all its elements. However, the result can be an incorrect sequence, because ⊥E is forbidden to be in a sequence, thus, sequences in a pattern should be “developed” w.r.t. ⊥E , as it is explained below. On Projections of Sequential Pattern Structures (with an appl. on care traj.) 207 n o  n o  p1 , p2 , p3 ;∅ p1 , p 2 , p 3 ;∅ n o  n o  n o  n o  p1 , p 2 ; ss2 , ss3 p2 , p3 ; ss6 p1 , p2 ; ss9 p2 , p 3 ; ss7 , ss8 , ss10 n o  n o  n o  n o  n o  n o  p1 ; p1 p2 ; p2 p3 ; p3 p1 ; p1 p2 ; p2 p3 ; p3 (∅; ∗) (∅; ∗) (a) MLP projection, l = 3 (b) Projection removing ‘*’ hospitals Fig. 2: The projected concept lattices for the pattern structure given by Table 1. Concept intents refer to the sequences in Tables 1 and 2. Definition 7. Given an alphabet (E, uE ), an alphabet projection ψ and a se- quence s based on E, the projection of the sequence ψ(s) is the sequence s̃ such that, s̃i = ψ(si ) (si is the i-th element of sequence s). Here, it should be noted that s̃ can be incoherent with Definition 4, since it allows ⊥E to be an element. For simplicity, we allow this incoherence here. Definition 8. Given an alphabet (E, uE ), an alphabet projection ψ, and a pat- tern d ∈ D, an alphabet-projected pattern d˜ = ψ(d), is the set of sequences obtained by the following procedure. For every sequence s ∈ d, the projection of s is computed (Definition 7) and, then, the projection of the sequence is substi- tuted by the set of its maximal subsequences containing no ⊥. All the resulting ˆ and d˜ is the set of maximal sequences in d. sequences constitute the set d, ˆ Example 6. {ss6 } is an alphabet-projected pattern for the pattern {ss10 }, where alphabet lattice projection is given in Example 5. {h[CH, {c, d}]i} is an alphabet-projected pattern for the pattern {ss2 }, where alphabet lattice projection is given by projecting every element with medical pro- cedure b to the element with the same hospital and with the same set procedures excluding b. The projected sequence of sequence ss2 is h[CH, {c, d}]; [∗, {}]; [∗, {d}]i, but [∗, {}] = ⊥E , and, thus, in order to project the pattern {ss2 } the projected sequence is substituted by its maximal subsequences, i.e. ψ({h[CH, {c, d}]; [∗, {b}]; [∗, {d}]i}) = {h[CH, {c, d}]i}. Proposition 6. Considering an alphabet (E, uE ), the projection of an alphabet ψ, a sequential pattern structure (G, (D, u), δ), the procedure given by Defini- tion 8 is monotone, contractive and idempotent. Proof. This procedure is idempotent, since the projection of the alphabet is idempotent. It is contractive because for a pattern d, for any sequences s ∈ d, the projection of the sequence s̃ = ψ(s) is a subsequence of s. In the Definition 8 the projected sequences should be substituted by its maximal subsequences in order to avoid ⊥E , building the sets {s̃i }. Thus, s is a supersequence for any s̃i , and, so, the projected pattern d˜ = ψ(d) is subsumed by the pattern d. 208 Aleksey Buzmakov et al. Finally, we should show monotonicity. Given two patterns x, y ∈ D, such that x v y, i.e. ∀sx ∈ x, ∃sy ∈ y : sx ≤ sy , consider the projected sequence of sx , ψ(sx ). As sx ≤ sy for some sy then for some j0 < j1 < j|sx | (see Definition 5) sxi vE syji (i ∈ 1, 2, ..., |sx |), then ψ(sxi ) vE ψ(syji ) (by the monotonicity of the alphabet projection), i.e. projected sequence preserve the subsequence relation. Thus, the alphabet projection of the pattern preserve pattern subsumption re- lation, ψ(x) ≤ ψ(y) (Proposition 1), i.e. the alphabet projection is monotone. 4 Sequential Pattern Structure Evaluation 4.1 Implementation Nearly all state-of-the-art FCA algorithms can be adapted to process pattern structures. We adapted AddIntent algorithm [14], as the lattice structure is im- portant for us to calculate stability (see the algorithm for calculating stability in [15]). To compute the semilattice operation (u, v) between two sets of se- quences S = {s1 , ...sn } and T = {t1 , ..., tm }, S u T is calculated according to Section 2.3, i.e. maximal sequences among all maximal common subsequences for any pair of si and tj . To find all common subsequences of two sequences, the following observations is useful. If ss = hss1 ; ...; ssl i is a subsequence of s = hs1 ; ...; sn i with jis = k s +i (Definition 5: k s is the index difference from which ss is a subsequence of s) and a subsequence of t = ht1 ; ...; tm i with jit = k t + i (likewise), then for any index i ∈ {1, 2, ..., l}, ssi vE (sjis u tjit ). Thus, to find maximal common subsequences between s and t, we, first, align s and t in all pos- sible ways, and then for every alignment we compute the resulting intersection and keep only the maximal ones. 4.2 Experiments and Discussion The experiments are carried out on an “Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz” computer with 8Gb of memory under the Ubuntu 12.04 operating system. The algorithms are not parallelized and are coded in C++. The dataset considered here comes from a French healthcare system [16]. Each elements of a sequence has a “complex” nature. This dataset contains 2400 patients suffering from cancer. Every patient is described as a sequence of hos- pitalizations without any timestamps. The hospitalization is a tuple with three fields: (i) healthcare institution (e.g. University Hospital of Paris (CHUP aris )), (ii) reason of the hospitalization (e.g. a cancer disease), and (iii) set of medical procedures that the patient underwent. An example of a medical trajectory of a patient is h[CHUP aris , Cancer, {P1 , P2 }]; [CHN ancy , Chemo, {}]; [CHN ancy , Chemo, {}]i . This sequence represents a patient trajectory with three hospitalizations. It ex- presses that the patient was first admitted to the University Hospital of Paris (CHUP aris ) for a cancer problem as the reason, and underwent procedures P1 and P2 . Then he had two consequent hospitalizations in Central Hospital of Nancy (CHN ancy ) in order to do chemotherapy with no additional procedures. On Projections of Sequential Pattern Structures (with an appl. on care traj.) 209 For this dataset the computation of the whole lattice is infeasible. However a medical expert is not interested in all possible patterns, but rather in patterns which answer his analysis question(s). First of all, the patterns of length 1 are unlikely to be of interest for him. Thus, we use the MLP projection of length 2 or 3 taking into account the small average length of the sequences in the dataset. For the search of patterns containing only information about reasons and medical procedures, we should project every alphabet element on the element with the same reason and the same set of procedures, but substitute hospital- ization institution by the most general element in the corresponding taxonomy. Moreover, we do not want to allow reason to be empty, i.e. all such elements should be projected onto ⊥E . In this case computation takes 18 seconds pro- ducing a lattice with around 34700 concepts. One of the stable concepts has the following intent h[Cancer, {App.}]; [Ch.P rep, {}]; [Chemo, {}]i, specifying that a cancer was found during the appendix removal surgery, followed by a chemother- apy. This patterns highlight a discovered fact that acute appendicitis has been shown to occur antecedent to cancer within three years because of a carcinoma in colon or rectum [17]. To find patterns revealing dependences between hospitals and reasons all the procedures should be removed from each alphabet element and elements with most general hospital and/or with most general reason should be projected to ⊥E . The computation of the corresponding lattice takes 10 seconds, producing around 4200 concepts. h[Region Lorraine, Cancer]; [Clinic in Lorraine, Chemo]i is among stable concepts which is rather interesting, because the patients de- tected cancer somewhere in Region A but then went to exactly the same clinic for chemotherapy. It suggests that the department can lack from clinics for chemotherapy or the quality of the clinic is high. Conclusion In this paper, we present sequential pattern structures, an original specification of pattern structures able to deal with complex sequences. Projections of sequen- tial pattern structures allow us to efficiently build concept lattices, by specifying expert demands. To be able to introduce interesting projections, their classical definition is extended. This extension allows us to introduce special projections for sequential pattern structures. The introduced projections are efficiently used for analysis of a dataset on care trajectories. There are two main directions for future work. First, a study on properties of generalized projections within the overall framework of FCA should be carried out. Second, projections of sequential pattern structures can be deeper analyzed, for producing even more interesting and readable patterns. Acknowledgements: this research received funding from the Basic Research Program at the National Research University Higher School of Economics (Rus- sia) and from the BioIntelligence project (France). 210 Aleksey Buzmakov et al. References 1. Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.C.: PrefixSpan Mining Sequential Patterns Efficiently by Prefix Projected Pattern Growth. In: 17th International Conference on Data Engineering. (2001) 215–226 2. Yan, X., Han, J., Afshar, R.: CloSpan: Mining Closed Sequential Patterns in Large Databases. In: Proc. of SIAM Int’l Conf. Data Mining (SDM’03). (2003) 166–177 3. Raı̈ssi, C., Calders, T., Poncelet, P.: Mining conjunctive sequential patterns. Data Min. Knowl. Discov. 17(1) (2008) 77–93 4. Ding, B., Lo, D., Han, J., Khoo, S.C.: Efficient Mining of Closed Repetitive Gapped Subsequences from a Sequence Database. In: Proc. of IEEE 25th International Conference on Data Engineering, IEEE (March 2009) 1024–1035 5. Plantevit, M., Laurent, A., Laurent, D., Teisseire, M., Choong, Y.W.: Mining mul- tidimensional and multilevel sequential patterns. ACM Transactions on Knowledge Discovery from Data 4(1) (January 2010) 1–37 6. Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations. 1st edn. Springer, Secaucus, NJ, USA (1997) 7. Ganter, B., Kuznetsov, S.O.: Pattern Structures and Their Projections. In Delu- gach, H., Stumme, G., eds.: Conceptual Structures: Broadening the Base SE - 10. Volume 2120 of LNCS. Springer Berlin Heidelberg (2001) 129–142 8. Casas-Garriga, G.: Summarizing Sequential Data with Closed Partial Orders. In: Proc. of the 5th SIAM Int’l Conf. on Data Mining (SDM’05). (2005) 9. Ferré, S.: The Efficient Computation of Complete and Concise Substring Scales with Suffix Trees. In Kuznetsov, S.O., Schmidt, S., eds.: Formal Concept Analysis SE - 7. Volume 4390 of Lecture Notes in Computer Science. Springer (2007) 98–113 10. Klimushkin, M., Obiedkov, S.A., Roth, C.: Approaches to the Selection of Relevant Concepts in the Case of Noisy Data. In: Proc. of the 8th International Conference on Formal Concept Analysis. ICFCA’10, Springer (2010) 255–266 11. Kuznetsov, S.O.: On stability of a formal concept. Annals of Mathematics and Artificial Intelligence 49(1-4) (2007) 101–115 12. Adda, M., Valtchev, P., Missaoui, R., Djeraba, C.: A framework for mining mean- ingful usage patterns within a semantically enhanced web portal. In: Proceedings of the 3rd C* Conference on Computer Science and Software Engineering. C3S2E ’10, New York, NY, USA, ACM (2010) 138–147 13. Kuznetsov, S.O.: Learning of Simple Conceptual Graphs from Positive and Neg- ative Examples. In Żytkow, J., Rauch, J., eds.: Principles of Data Mining and Knowledge Discovery SE - 47. Volume 1704 of LNCS. Springer Berlin Heidelberg (1999) 384–391 14. Merwe, D.V.D., Obiedkov, S., Kourie, D.: AddIntent: A new incremental algorithm for constructing concept lattices. In Goos, G., Hartmanis, J., Leeuwen, J., Eklund, P., eds.: Concept Lattices. Volume 2961. Springer (2004) 372–385 15. Roth, C., Obiedkov, S., Kourie, D.G.: On succinct representation of knowledge community taxonomies with formal concept analysis A Formal Concept Analy- sis Approach in Applied Epistemology. International Journal of Foundations of Computer Science 19(02) (April 2008) 383–404 16. Fetter, R.B., Shin, Y., Freeman, J.L., Averill, R.F., Thompson, J.D.: Case mix definition by diagnosis-related groups. Med Care 18(2) (February 1980) 1–53 17. Arnbjörnsson, E.: Acute appendicitis as a sign of a colorectal carcinoma. Journal of Surgical Oncology 20(1) (May 1982) 17–20