On Projections of Sequential Pattern Structures
     (with an application on care trajectories)

        Aleksey Buzmakov1,2 , Elias Egho1 , Nicolas Jay1 , Sergei O. Kuznetsov2 ,
                       Amedeo Napoli1 , and Chedy Raı̈ssi1
    1
        LORIA (CNRS – Inria NGE – U. de Lorraine), Vandœuvre-lès-Nancy, France
        2
         National Research University Higher School of Economics, Moscow, Russia
                         {aleksey.buzmakov, chedy.raissi}@inria.fr,
           {elias.egho, nicolas.jay, amedeo.napoli}@loria.fr, skuznetsov@hse.ru


            Abstract. In this paper, we are interested in the analysis of sequen-
            tial data and we propose an original framework based on FCA. For
            that, we introduce sequential pattern structures, an original specifica-
            tion of pattern structures for dealing with sequential data. Sequential
            pattern structures are given by a subsumption operation between set of
            sequences, based on subsequence matching. To avoid a huge number of
            resulting concepts, domain knowledge projections can be applied. The
            original definition of projections is revised in order to operate on sequen-
            tial pattern structures in a meaningful way. Based on the introduced
            definition, several projections of sequential pattern structures involving
            domain or expert knowledge are defined and discussed. This projections
            are evaluated on a real dataset on care trajectories where every hos-
            pitalization is described by a heterogeneous tuple with different fields.
            The evaluation reveals interesting concepts and justify the usage of in-
            troduced projections of sequential pattern structures. This research work
            provides a new and efficient extension of FCA to deal with complex data,
            which can be an alternative to the analysis of sequential datasets.

            Keywords: formal concept analysis, pattern structures, projections, se-
            quential pattern structures, sequences


  Introduction

  Analysis of sequential data is a challenging task. In the last two decades, the
  main emphasis has been on developing efficient mining algorithms with effective
  pattern representations for sequences of itemsets [1–4]. The traditional sequential
  pattern mining algorithms generate a large number of frequent sequences while a
  few of them are truly relevant. Moreover, in some particular cases, only sequential
  patterns of a certain type are of interest and should be mined first. Are we
  able to develop a framework for taking into account only patterns of required
  types? Furthermore, in many cases sequential data are described by sequences
  with complex elements, e.g. a text is a sequence of syntactic trees. To process
  such kind of data with existing algorithms, elements of sequences can be scaled

c paper author(s), 2013. Published in Manuel Ojeda-Aciego, Jan Outrata (Eds.): CLA
  2013, pp. 199–210, ISBN 978–2–7466–6566–8, Laboratory L3i, University of La
  Rochelle, 2013. Copying permitted only for private and academic purposes.
200     Aleksey Buzmakov et al.


into itemsets as it is done in the case of multilevel multidimensional data [5].
However, in this case it is rather difficult to introduce expert requirements within
a sequence, which leads to even a larger set of resulting patterns.
    We approach this problem with FCA and pattern structures [6, 7]. FCA is
successfully used for analysis of sequential data [8, 9]. Moreover, it allows one to
use different measures of interestingness for the resulting patterns (concepts).
Pattern structures allows to directly process sequential data without a scaling
step. Furthermore, there are projections of pattern structures, which were intro-
duced in order to simplify the computation of pattern lattices, by simplifying
descriptions. Moreover, projections can be efficiently used as special domain
knowledge requirements, allowing to reduce the number of irrelevant patterns.
We generalize the original definitions of projections, in order to deal with pro-
jections respecting domain knowledge. For example, sequences of length 1 are
rare useful but they cannot be excluded by the original definition of projections.
    The rest of the paper is organized as follows. In Section 1 we remind FCA,
pattern structures and measures of concept interestingness. Section 2 states the
problem of complex sequences analysis and introduces sequential pattern struc-
tures. In Section 3, first, the generalization of projections is defined, and, second,
some projections specific to sequential pattern structures are introduced and
analyzed. And finally before concluding the paper, we discuss an experimental
evaluation in Section 4.


1     FCA and Pattern Structures

FCA [6] is a mathematical formalism having many applications in data anal-
ysis. Pattern structures is a generalization of FCA for dealing with complex
structures, such as sequences or graphs [7].

Definition 1. A pattern structure is a triple (G, (D, u), δ), where G is a set of
objects, (D, u) is a complete meet-semilattice of descriptions and δ : G → D
maps an object to a description.

    The lattice operation in the semilattice (u) corresponds to the similarity
between two descriptions. Standard FCA can be presented in terms of pattern
structures. In this case, G is the set of objects, the semilattice of descriptions
is (℘(M ), u), where a description is a set of attributes, with the u operation
corresponding to the set intersection (℘(M ) denotes the powerset of M ). If
x = {a, b, c} and y = {a, c, d} then x u y = x ∩ y = {a, c}. The mapping
δ : G → ℘(M ) is given by, δ(g) = {m ∈ M | (g, m) ∈ I}, and returns the
description for a given object as a set of attributes.
    The Galois connection for (G, (D, u), δ) is defined as follows:
                      l
               A :=      δ(g),                          for A ⊆ G
                      g∈A

                d := {g ∈ G | d v δ(g)},                  for d ∈ D
On Projections of Sequential Pattern Structures (with an appl. on care traj.)   201


   The Galois connection makes a correspondence between sets of objects and
descriptions. Given a set of objects A, A returns the description which is
common to all objects in A. And given a description d, d is the set of all
objects whose description subsumes d. More precisely, the partial order (or
the subsumption order) on D (v) is defined w.r.t. the similarity operation u:
c v d ⇔ c u d = c, and c is subsumed by d.

Definition 2. A pattern concept of a pattern structure (G, (D, u), δ) is a pair
(A, d) where A ⊆ G and d ∈ D such that A = d and d = A, A is called the
concept extent and d is called the concept intent.

    As in standard FCA, a pattern concept corresponds to the maximal set of
objects A whose description subsumes the description d, where d is the maximal
common description for objects in A. The set of all concepts can be partially
ordered w.r.t. partial order on extents (dually, intent patterns, i.e v), within a
concept lattice. An example of a pattern structure is given and described in the
next sections. It can be noticed that Table 1 defines a pattern structure, while
the corresponding lattice is depicted in Figure 1.
    It is worth mentioning, that the size of the concept lattice can be exponential
w.r.t. to the number of objects, and, thus, we need a special ranking method to
select the most interesting concepts for further analysis. Several techniques are
considered in [10], where it is shown that stability index [11] is more reliable in
noisy data. Thus, we use this index in our current work.

Definition 3. Given a concept c, the concept stability Stab(c) is the percent of
subsets of the concept extent (denoted Ext(c)), whose description is equal to the
concept intent (denoted Int(c)).

                                 |{s ∈ ℘(Ext(c)) | s = Int(c)}|
                    Stab(c) :=                                                  (1)
                                          |℘(Ext(c))|

    Stability measures how much the concept depends on the initial dataset. The
larger the stability the more objects can be deleted from the context without
affecting the intent of the concept, i.e. the intent of the most stable concepts
are likely to be a characteristic pattern of a studied phenomena rather than an
artifact of a data set.
    After a brief general description of the analysis with pattern structures, the
analysis of sequential data can be specified.


2     Sequential Pattern Structures
2.1   An Example of Sequential Data
Imagine that we have medical trajectories of patients, i.e. sequences of hospital-
izations, where every hospitalization is described by a hospital name and a set
of procedures. An example of sequential data on medical trajectories with three
patients is given in Table 1. There are a set of procedures P = {a, b, c, d}, a
202       Aleksey Buzmakov et al.


                  Patient Trajectory
                    p1    h[H1 , {a}]; [H1 , {c, d}]; [H1 , {a, b}]; [H1 , {d}]i
                    p2    h[H2 , {c, d}]; [H3 , {b, d}]; [H3 , {a, d}]i
                    p3    h[H4 , {c, d}]; [H4 , {b}]; [H4 , {a}]; [H4 , {a, d}]i
            Table 1: Toy sequential data on patient medical trajectories.

set of hospital names TH = {H1 , H2 , H3 , H4 , CL, CH, ∗}, where hospital names
are hierarchically organized (by the level of generality), H1 and H2 are central
hospitals (CH) and H3 and H4 are clinics (CL), and ∗ denotes the root of this
hierarchy. For the sake of simplicity, we use the u operator in order to denote
the least common ancestor in TH , i.e. H1 u H2 = CH. Every hospitalization is
described with one hospital name and may contain several procedures. The pro-
cedure order in each hospitalization is not important in our case. For example,
the first hospitalization [H2 , {c, d}] for the second patient (p2 ) was in hospital
H2 and during this hospitalization the patient underwent procedures c and d.
An important task is to find the “characteristic” sequences of procedures and
associated hospitals in order to improve hospitalization planning, optimize clin-
ical processes or detect anomalies. This sequences can be found by searching for
the most stable concepts in the lattice corresponding to a pattern structure.

2.2     Partial Order on Complex Sequences
A sequence is constituted of elements from an alphabet. The classical subse-
quence matching task requires no special properties of the alphabet. Several
generalizations of the classical case were made by introducing a subsequence
relation based on itemset alphabet [8] or on multidimensional and multilevel
alphabet [5], scaled to itemset alphabet as well. Both these alphabets are cer-
tain semilattices, and, thus, we generalize the previous cases, requiring for an
alphabet to form a general semilattice (E, uE )1 . Thanks to pattern structure
formalism we are able to process in a unified way all types of sequential datasets
with poset-shaped alphabet. However, some sequential data can have connec-
tions between elements, e.g. [12], and, thus, cannot be immediately processed by
our approach.
Definition 4. A sequence is an ordered list of e ∈ (E, uE ), such that e 6= ⊥E .
    Here, ∀e ∈ E, ⊥E = ⊥E uE e. The bottom element is required by the lattice
structure but provide us with no useful information (it matches to any other
element), thus, it is excluded from the sequences. In the same way, in mining of
sequences of itemsets the empty itemset cannot be a proper element [2].
Definition 5. A sequence t = ht1 ; ...; tk i is a subsequence of a sequence s =
hs1 ; ...; sn i, denoted t ≤ s, iff k ≤ n and there exist j1 , ..jk such that 1 ≤ j1 <
j2 < ... < jk ≤ n and for all i ∈ {1, 2, ..., k}, ti vE sji (⇔ ti uE sji = ti ).
1
    In this paper we consider two semilattices, the first one is related to the characters
    of the alphabet, (E, uE ), and the second one is related to pattern structures, (D, u).
On Projections of Sequential Pattern Structures (with an appl. on care traj.)                                                   203

                                          n                  o                 
                                               p1 , p2 , p3       ; ss4 , ss5


                n       o               n             o                         n              o                       
                  p1 , p2 ; ss2 , ss3          p1 , p3       ; ss11 , ss12               p2 , p 3       ; ss6 , ss7 , ss8


                     n  o                         n        o                               n         o          
                       p1 ; p1                           p2       ; p2                              p3        ; p3


                                                         (∅; ∗)


Fig. 1: The concept lattice for the pattern structure given by Table 1. Concept
intents reference to sequences in Tables 1 and 2.
        Subsequences                                                              Subsequences
   ss1 h[CH, {c, d}]; [H1 , {b}]; [∗, {d}]i                                  ss2 h[CH, {c, d}]; [∗, {b}]; [∗, {d}]i
   ss3 h[CH, {}]; [∗, {d}]; [∗, {a}]i                                        ss4 h[∗, {c, d}]; [∗, {b}]i
   ss5 h[∗, {a}]i                                                            ss6 h[∗, {c, d}]; [CL, {b}]; [CL, {a}]i
   ss7 h[CL, {d}]; [CL, {}]i                                                 ss8 h[CL, {}]; [CL, {a, d}]i
   ss9 h[CH, {c, d}]i                                                        ss10 h[CL, {b}]; [CL, {a}]i
   ss11 h[∗, {c, d}]; [∗, {b}]i                                              ss12 h[∗, {a}]; [∗, {d}]i
              Table 2: Subsequences of patient sequences in Table 1.

   With complex sequences and such kind of subsequences the computational
procedure can be difficult, thus, to simplify the procedure, only “contiguous”
subsequences are considered, where only the order of consequent elements is
taken into account, i.e. given j1 in Definition 5, ji = ji−1 +1 for all i ∈ {2, 3, ..., k}.
Such a restriction makes sens for our data, because a hospitalization is a discrete
event and it is likely that the next hospitalization has a relation with the previous
one, for example, hospitalizations for treating aftereffects of chemotherapy. Below
the word “subsequence” refers to “contiguous” subsequence.
Example 1. In the running example (Section 2.1), the alphabet is E = TH ×℘(P )
with the similarity operation (h1 , P1 ) u (h2 , P2 ) = (h1 u h2 , P1 ∩ P2 ), where
h1 , h2 ∈ TH are hospitals and P1 , P2 ∈ ℘(P ) are sets of procedures. Thus,
the sequence ss1 in Table 2 is a subsequence of p1 in Table 1 because if we
set ji = i + 1 (Definition 5) then ss11 v p1j1 (‘CH’ is an ancestor for H1 and
{c, d} ⊆ {c, d}), ss12 v p1j2 (the same hospital and {b} ⊆ {b, a}) and ss13 v p1j3
(‘*’ is an ancestor for anything and {d} ⊆ {d}).

2.3    Meet-semilattice of Sequences
Using the previous definitions, we can precisely define the sequential pattern
structures that are used for representing and managing sequences. For that, we
make an analogy with pattern structures for graphs where the meet-semilattice
operation u respects subgraph isomorphism [13]. Thus, we introduce a sequential
meet-semilattice respecting subsequence relation. Let us consider S as the set
of all sequences based on an alphabet (E, uE ). S is partially ordered w.r.t.
Definition 5. (D, u) is a semilattice on sequences S, where D ⊆ ℘(S) such that
204     Aleksey Buzmakov et al.


if d ∈ D contains a sequence s then all subsequences of s should be included
into d, ∀s ∈ d, @s̃ ≤ s : s̃ ∈
                             / d, and the similarity operation is the set intersection
for two sets of sequences. Given two patterns d1 , d2 ∈ D, the set intersection
operation ensures that if a sequence s belongs to d1 u d2 then any subsequence
of s belongs to d1 u d2 and thus d1 u d2 ∈ D. As the set intersection operation
is idempotent, commutative and associative, (D, u) is a valid semilattice.
Example 2. The sequential pattern
                                    structure for our example (Subsection 2.1)
is (G, (D, u), δ), where G = p1 , p2 , p3 is the set of patients, (D, u) is the
semilattice of sequential descriptions, and δ is the mapping (shown in Table 1)
associating a patient in G to a description in D. Figure 1 shows the resulting
lattice of sequential pattern concepts for this particular pattern structure.
    The set of all possible subsequences for a given sequence can be rather large.
Thus, it is more efficient and readable to keep a pattern d ∈ D as a set of only
maximal sequences d,  ˜ d˜ = {s ∈ d | @s∗ ∈ d : s∗ ≥ s}. In the rest of the paper,
every pattern
 2  3        is
                  given only by the set of its maximal sequences. For example,
  p u p = ss6 , ss7 , ss8 (see Tables 1 and 2), i.e. ss6 , ss7 , ss8 is the set of
all
 2maximal  sequences specifying the
                                    intersection result
                                                         of two sets of sequences
  p and p3 , in the same way ss6 , ss7 , ss8 u p1 = ss4 , ss5 . Note that
representing a pattern by the set of all maximal sequences allows for an efficient
implementation of the intersection “u” of two patterns. The next proposition is
follows from this subsection and Definition 5.
Proposition 1. Given (G, (D, u), δ) and x, y ∈ D, x v y if and only if ∀sx ∈ x
there is a sequence sy ∈ y, such that sx ≤ sy .

3     Projections of Sequential Pattern Structures
Pattern structures can be hard to process due to the usually large number of
concepts in the concept lattice and the complexity of the involved similarity
operation (make the parallel with the graph isomorphism problem). Moreover,
a given pattern structure can produce a lattice with a lot of patterns which are
not interesting for an expert. Can we save computational time by avoiding the
construction of unnecessary patterns? Projections of pattern structures “sim-
plify” to some degree the computation and allow one to work with a reduced
description. In fact, projections can be considered as constraints (or filters) on
patterns respecting certain mathematical properties. These mathematical prop-
erties ensure that the projection of a lattice is a lattice where projected concepts
have certain correspondence to original ones. Moreover, the stability measure
of projected concepts never decreases w.r.t the corresponding concepts. We in-
troduce projections on sequential patterns, revising them from [7]. An extended
definition of projections w.r.t. the definition in [7] should be provided in order
to deal with interesting projections for real-life sequential datasets.
Definition 6. A projection ψ : D → D is an interior operator, i.e. it is (1) mono-
tone (x v y ⇒ ψ(x) v ψ(y)), (2) contractive (ψ(x) v x) and (3) idempotent
(ψ(ψ(x)) = ψ(x)).
On Projections of Sequential Pattern Structures (with an appl. on care traj.)    205


    Under a projection ψ, a pattern structure (G, (D, u), δ) becomes the pro-
jected pattern structure ψ((G, (D, u), δ)) = (G, (Dψ , uψ ), ψ ◦ δ), where Dψ =
ψ(D) = {d ∈ D | ∃d∗ ∈ D : ψ(d∗ ) = d} and ∀x, y ∈ D, x uψ y := ψ(x u y).
Note that in [7] ψ((G, (D, u), δ)) = (G, (D, u), ψ ◦ δ). Now we should show that
(Dψ , uψ ) is a semilattice.

Proposition 2. Given a semilattice (D, u) and a projection ψ, for all x, y ∈ D
ψ(x u y) = ψ(ψ(x) u y).

Proof. 1. ψ(x) v x, thus, x, y w (x u y) w (ψ(x) u y) w ψ(ψ(x) u y)
 2. x v y ⇒ ψ(x) v ψ(y), thus, ψ(x u y) w ψ(ψ(x) u y)
 3. ψ(x u y) u ψ(x) u y    =       ψ(x u y) u y    =    ψ(x u y),
                         ψ(xuy)vψ(x)                ψ(xuy)vy
    then (ψ(x) u y) w ψ(x u y) and ψ(ψ(x) u y) w ψ(ψ(x u y)) = ψ(x u y)
 4. From (2) and (3) follows that ψ(x u y) = ψ(ψ(x) u y).

Corollary 1. X1 uψ X2 uψ · · · uψ XN = ψ(X1 u X2 u · · · u XN )

Corollary 2. Given a semilattice (D, u) and a projection ψ, (Dψ , uψ ) is a semi-
lattice, i.e. uψ is commutative, associative and idempotent.

   The concepts of a pattern structure and a projected pattern structure are
connected with the next proposition, following from Corollary 1:
Proposition 3. An extent in ψ((G, (D, u), δ)) is an extent in (G, (D, u), δ). An
intent in ψ((G, (D, u), δ)) is of the form ψ(d), where d is the intent of the concept
with the same extent.
   Moreover, preserving extents of some concepts, projections cannot decrease
the stability of the projected concepts, i.e. if the projection preserves a stable
concept, then its stability (Definition 3) can only increase.

Proposition 4. Given a pattern structure (G, (D, u), δ), its concept c and a
projected pattern structure (G, (Dψ , uψ ), ψ ◦ δ), and the projected concept c̃, if
the concept extents are equal (Ext(c) = Ext(c̃)) then Stab(c) ≤ Stab(c̃).

Proof. Concepts c and c̃ have the same extent. Thus, according to Definition 3, in
order to prove the proposition statement, it is enough to prove that for any subset
A ⊆ Ext(c), if A = Int(c) in the original pattern structure, then A = Int(c̃)
in the projected one. It can be proven from contrary.
    Suppose that ∃A ⊂ Ext(c) such that A = Int(c) in the original pattern
structure and A 6= Int(c̃) in the projected one. Then there is a descendant
concept d˜ of c̃ in the projected pattern structure such that A = Int(d)  ˜ in the
projected lattice. Then there is an original concept d for the projected concept d˜
with the same Ext(d). Then A w Int(d) A Int(c) and, so, A cannot be equal
to Int(c) in the original lattice. Contradiction.

   No we are going to present two projections of sequential pattern structures.
The first projection comes from the following observation. In many cases it may
206     Aleksey Buzmakov et al.


be more interesting to analyze quite long subsequences rather than short one.
This kind of projections is called Minimal Length Projection (MLP) and it de-
pends on the minimal allowed length parameter l for the sequences in a pattern.
The corresponding function ψ maps a pattern without short sequences to it-
self, and a sequence with short sequences to the pattern containing only long
sequences, ψ(d) = {s ∈ d | length(s) > l}. Later, propositions 1 and 5 stay that
MLP is coherent with Definition 6.
Example 3. If we prefer common subsequences of length ≥ 3, then between p2
and p3 in Table 1 there is only one maximal common subsequence, ss6 in Table 2,
while ss7 and ss8 are too short to be considered. Figure 2a shows the lattice
corresponding the projected pattern structure (Table 1) with patterns of length
more or equal to 3.
Proposition 5. MLP is a monotone, contractive and idempotent function on
the semilattice (D, u).
Proof. The contractivity and idempotentcy are quite clear from the definition.
Remains the proof for monotonicity.
   If X v Y where X and Y are sets of sequences then for every sequence x ∈ X
there is a sequence y ∈ Y such that x ≤ y (Proposition 1). We should show that
ψ(X) v ψ(Y ), or in other words for every sequence x ∈ ψ(X) there is a sequence
y ∈ ψ(Y ), such that x ≤ y. Given x ∈ ψ(X), since ψ(X) is a subset of X and
X v Y , then there is a sequence y ∈ Y such that x ≤ y, with |y| ≥ |x| ≥ l (l is
a parameter of MLP), and thus, y ∈ ψ(Y ).
   The second projection of a sequential pattern structure is connected to a
projection of an alphabet semilattice, (E, uE ).
Example 4. An expert is interested in finding sequential patterns on how a pa-
tient changes hospitals, but he has little interest in procedures. Thus, any el-
ement of the alphabet lattice, containing a non-empty set of procedures can
be projected to the element with the same hospital but with the empty set of
procedures.
Example 5. An expert is interested in finding sequential patterns containing
some information about the hospital in every hospitalization, and the corre-
sponding procedures, i.e. hospital field in the patterns cannot be equal to the
element “any hospital”, denoted ∗, e.g., ss5 is an invalid pattern, while ss6 is a
valid pattern in Table 2. Thus, any element of the alphabet semilattice with ∗
hospital can be projected to the ⊥E . Figure 2b shows the lattice corresponding
to the projected pattern structure (Table 1), where projection comes from the
projection of the alphabet semilattice.
    Below we formally define how the alphabet projection of a sequential pattern
structure should be processed. Intuitively, every sequence in a pattern should be
substituted with another sequence, by applying the alphabet projection to all
its elements. However, the result can be an incorrect sequence, because ⊥E is
forbidden to be in a sequence, thus, sequences in a pattern should be “developed”
w.r.t. ⊥E , as it is explained below.
On Projections of Sequential Pattern Structures (with an appl. on care traj.)                                                                                                                  207

                                    n                     o                                                                n                    o        
                                         p1 , p2 , p3          ;∅                                                                 p1 , p 2 , p 3       ;∅


    n              o                                              n       o                  n              o                             n                o                        
         p1 , p 2       ; ss2 , ss3                                   p2 , p3 ; ss6                   p1 , p2        ; ss9                             p2 , p 3       ; ss7 , ss8 , ss10


           n        o                   n        o                  n        o              n         o                   n  o                        n         o          
                p1       ; p1                  p2       ; p2                 p3       ; p3              p1       ; p1               p2 ; p2                           p3       ; p3


                                               (∅; ∗)                                                                                  (∅; ∗)


              (a) MLP projection, l = 3                                                          (b) Projection removing ‘*’ hospitals
Fig. 2: The projected concept lattices for the pattern structure given by Table 1.
Concept intents refer to the sequences in Tables 1 and 2.
Definition 7. Given an alphabet (E, uE ), an alphabet projection ψ and a se-
quence s based on E, the projection of the sequence ψ(s) is the sequence s̃ such
that, s̃i = ψ(si ) (si is the i-th element of sequence s).

    Here, it should be noted that s̃ can be incoherent with Definition 4, since it
allows ⊥E to be an element. For simplicity, we allow this incoherence here.

Definition 8. Given an alphabet (E, uE ), an alphabet projection ψ, and a pat-
tern d ∈ D, an alphabet-projected pattern d˜ = ψ(d), is the set of sequences
obtained by the following procedure. For every sequence s ∈ d, the projection of
s is computed (Definition 7) and, then, the projection of the sequence is substi-
tuted by the set of its maximal subsequences containing no ⊥. All the resulting
                             ˆ and d˜ is the set of maximal sequences in d.
sequences constitute the set d,                                          ˆ


Example 6. {ss6 } is an alphabet-projected pattern for the pattern {ss10 }, where
alphabet lattice projection is given in Example 5.
   {h[CH, {c, d}]i} is an alphabet-projected pattern for the pattern {ss2 }, where
alphabet lattice projection is given by projecting every element with medical pro-
cedure b to the element with the same hospital and with the same set procedures
excluding b. The projected sequence of sequence ss2 is h[CH, {c, d}]; [∗, {}]; [∗, {d}]i,
but [∗, {}] = ⊥E , and, thus, in order to project the pattern {ss2 } the projected
sequence is substituted by its maximal subsequences, i.e.
ψ({h[CH, {c, d}]; [∗, {b}]; [∗, {d}]i}) = {h[CH, {c, d}]i}.

Proposition 6. Considering an alphabet (E, uE ), the projection of an alphabet
ψ, a sequential pattern structure (G, (D, u), δ), the procedure given by Defini-
tion 8 is monotone, contractive and idempotent.

Proof. This procedure is idempotent, since the projection of the alphabet is
idempotent. It is contractive because for a pattern d, for any sequences s ∈ d,
the projection of the sequence s̃ = ψ(s) is a subsequence of s. In the Definition 8
the projected sequences should be substituted by its maximal subsequences in
order to avoid ⊥E , building the sets {s̃i }. Thus, s is a supersequence for any s̃i ,
and, so, the projected pattern d˜ = ψ(d) is subsumed by the pattern d.
208      Aleksey Buzmakov et al.


    Finally, we should show monotonicity. Given two patterns x, y ∈ D, such
that x v y, i.e. ∀sx ∈ x, ∃sy ∈ y : sx ≤ sy , consider the projected sequence of sx ,
ψ(sx ). As sx ≤ sy for some sy then for some j0 < j1 < j|sx | (see Definition 5)
sxi vE syji (i ∈ 1, 2, ..., |sx |), then ψ(sxi ) vE ψ(syji ) (by the monotonicity of the
alphabet projection), i.e. projected sequence preserve the subsequence relation.
Thus, the alphabet projection of the pattern preserve pattern subsumption re-
lation, ψ(x) ≤ ψ(y) (Proposition 1), i.e. the alphabet projection is monotone.


4     Sequential Pattern Structure Evaluation

4.1    Implementation

Nearly all state-of-the-art FCA algorithms can be adapted to process pattern
structures. We adapted AddIntent algorithm [14], as the lattice structure is im-
portant for us to calculate stability (see the algorithm for calculating stability
in [15]). To compute the semilattice operation (u, v) between two sets of se-
quences S = {s1 , ...sn } and T = {t1 , ..., tm }, S u T is calculated according to
Section 2.3, i.e. maximal sequences among all maximal common subsequences
for any pair of si and tj . To find all common subsequences of two sequences,
the following observations is useful. If ss = hss1 ; ...; ssl i is a subsequence of
s = hs1 ; ...; sn i with jis = k s +i (Definition 5: k s is the index difference from which
ss is a subsequence of s) and a subsequence of t = ht1 ; ...; tm i with jit = k t + i
(likewise), then for any index i ∈ {1, 2, ..., l}, ssi vE (sjis u tjit ). Thus, to find
maximal common subsequences between s and t, we, first, align s and t in all pos-
sible ways, and then for every alignment we compute the resulting intersection
and keep only the maximal ones.


4.2    Experiments and Discussion

The experiments are carried out on an “Intel(R) Core(TM) i7-2600 CPU @
3.40GHz” computer with 8Gb of memory under the Ubuntu 12.04 operating
system. The algorithms are not parallelized and are coded in C++.
    The dataset considered here comes from a French healthcare system [16].
Each elements of a sequence has a “complex” nature. This dataset contains 2400
patients suffering from cancer. Every patient is described as a sequence of hos-
pitalizations without any timestamps. The hospitalization is a tuple with three
fields: (i) healthcare institution (e.g. University Hospital of Paris (CHUP aris )),
(ii) reason of the hospitalization (e.g. a cancer disease), and (iii) set of medical
procedures that the patient underwent. An example of a medical trajectory of a
patient is h[CHUP aris , Cancer, {P1 , P2 }]; [CHN ancy , Chemo, {}]; [CHN ancy , Chemo, {}]i .
This sequence represents a patient trajectory with three hospitalizations. It ex-
presses that the patient was first admitted to the University Hospital of Paris
(CHUP aris ) for a cancer problem as the reason, and underwent procedures P1
and P2 . Then he had two consequent hospitalizations in Central Hospital of
Nancy (CHN ancy ) in order to do chemotherapy with no additional procedures.
On Projections of Sequential Pattern Structures (with an appl. on care traj.)      209


    For this dataset the computation of the whole lattice is infeasible. However
a medical expert is not interested in all possible patterns, but rather in patterns
which answer his analysis question(s). First of all, the patterns of length 1 are
unlikely to be of interest for him. Thus, we use the MLP projection of length 2
or 3 taking into account the small average length of the sequences in the dataset.
    For the search of patterns containing only information about reasons and
medical procedures, we should project every alphabet element on the element
with the same reason and the same set of procedures, but substitute hospital-
ization institution by the most general element in the corresponding taxonomy.
Moreover, we do not want to allow reason to be empty, i.e. all such elements
should be projected onto ⊥E . In this case computation takes 18 seconds pro-
ducing a lattice with around 34700 concepts. One of the stable concepts has
the following intent h[Cancer, {App.}]; [Ch.P rep, {}]; [Chemo, {}]i, specifying that a
cancer was found during the appendix removal surgery, followed by a chemother-
apy. This patterns highlight a discovered fact that acute appendicitis has been
shown to occur antecedent to cancer within three years because of a carcinoma
in colon or rectum [17].
    To find patterns revealing dependences between hospitals and reasons all the
procedures should be removed from each alphabet element and elements with
most general hospital and/or with most general reason should be projected to
⊥E . The computation of the corresponding lattice takes 10 seconds, producing
around 4200 concepts. h[Region Lorraine, Cancer]; [Clinic in Lorraine, Chemo]i is
among stable concepts which is rather interesting, because the patients de-
tected cancer somewhere in Region A but then went to exactly the same clinic
for chemotherapy. It suggests that the department can lack from clinics for
chemotherapy or the quality of the clinic is high.


Conclusion

In this paper, we present sequential pattern structures, an original specification
of pattern structures able to deal with complex sequences. Projections of sequen-
tial pattern structures allow us to efficiently build concept lattices, by specifying
expert demands. To be able to introduce interesting projections, their classical
definition is extended. This extension allows us to introduce special projections
for sequential pattern structures. The introduced projections are efficiently used
for analysis of a dataset on care trajectories.
    There are two main directions for future work. First, a study on properties of
generalized projections within the overall framework of FCA should be carried
out. Second, projections of sequential pattern structures can be deeper analyzed,
for producing even more interesting and readable patterns.

Acknowledgements: this research received funding from the Basic Research
Program at the National Research University Higher School of Economics (Rus-
sia) and from the BioIntelligence project (France).
210     Aleksey Buzmakov et al.


References
 1. Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.C.:
    PrefixSpan Mining Sequential Patterns Efficiently by Prefix Projected Pattern
    Growth. In: 17th International Conference on Data Engineering. (2001) 215–226
 2. Yan, X., Han, J., Afshar, R.: CloSpan: Mining Closed Sequential Patterns in Large
    Databases. In: Proc. of SIAM Int’l Conf. Data Mining (SDM’03). (2003) 166–177
 3. Raı̈ssi, C., Calders, T., Poncelet, P.: Mining conjunctive sequential patterns. Data
    Min. Knowl. Discov. 17(1) (2008) 77–93
 4. Ding, B., Lo, D., Han, J., Khoo, S.C.: Efficient Mining of Closed Repetitive Gapped
    Subsequences from a Sequence Database. In: Proc. of IEEE 25th International
    Conference on Data Engineering, IEEE (March 2009) 1024–1035
 5. Plantevit, M., Laurent, A., Laurent, D., Teisseire, M., Choong, Y.W.: Mining mul-
    tidimensional and multilevel sequential patterns. ACM Transactions on Knowledge
    Discovery from Data 4(1) (January 2010) 1–37
 6. Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations. 1st
    edn. Springer, Secaucus, NJ, USA (1997)
 7. Ganter, B., Kuznetsov, S.O.: Pattern Structures and Their Projections. In Delu-
    gach, H., Stumme, G., eds.: Conceptual Structures: Broadening the Base SE - 10.
    Volume 2120 of LNCS. Springer Berlin Heidelberg (2001) 129–142
 8. Casas-Garriga, G.: Summarizing Sequential Data with Closed Partial Orders. In:
    Proc. of the 5th SIAM Int’l Conf. on Data Mining (SDM’05). (2005)
 9. Ferré, S.: The Efficient Computation of Complete and Concise Substring Scales
    with Suffix Trees. In Kuznetsov, S.O., Schmidt, S., eds.: Formal Concept Analysis
    SE - 7. Volume 4390 of Lecture Notes in Computer Science. Springer (2007) 98–113
10. Klimushkin, M., Obiedkov, S.A., Roth, C.: Approaches to the Selection of Relevant
    Concepts in the Case of Noisy Data. In: Proc. of the 8th International Conference
    on Formal Concept Analysis. ICFCA’10, Springer (2010) 255–266
11. Kuznetsov, S.O.: On stability of a formal concept. Annals of Mathematics and
    Artificial Intelligence 49(1-4) (2007) 101–115
12. Adda, M., Valtchev, P., Missaoui, R., Djeraba, C.: A framework for mining mean-
    ingful usage patterns within a semantically enhanced web portal. In: Proceedings
    of the 3rd C* Conference on Computer Science and Software Engineering. C3S2E
    ’10, New York, NY, USA, ACM (2010) 138–147
13. Kuznetsov, S.O.: Learning of Simple Conceptual Graphs from Positive and Neg-
    ative Examples. In Żytkow, J., Rauch, J., eds.: Principles of Data Mining and
    Knowledge Discovery SE - 47. Volume 1704 of LNCS. Springer Berlin Heidelberg
    (1999) 384–391
14. Merwe, D.V.D., Obiedkov, S., Kourie, D.: AddIntent: A new incremental algorithm
    for constructing concept lattices. In Goos, G., Hartmanis, J., Leeuwen, J., Eklund,
    P., eds.: Concept Lattices. Volume 2961. Springer (2004) 372–385
15. Roth, C., Obiedkov, S., Kourie, D.G.: On succinct representation of knowledge
    community taxonomies with formal concept analysis A Formal Concept Analy-
    sis Approach in Applied Epistemology. International Journal of Foundations of
    Computer Science 19(02) (April 2008) 383–404
16. Fetter, R.B., Shin, Y., Freeman, J.L., Averill, R.F., Thompson, J.D.: Case mix
    definition by diagnosis-related groups. Med Care 18(2) (February 1980) 1–53
17. Arnbjörnsson, E.: Acute appendicitis as a sign of a colorectal carcinoma. Journal
    of Surgical Oncology 20(1) (May 1982) 17–20