<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sequence mining using FCA and the NextPriorityConcept algorithm</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Salah Eddine Boukhetta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christophe Demko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>J´er´emy Richard</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karell Bertet</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Laboratory L3i, La Rochelle University</institution>
          ,
          <addr-line>La Rochelle</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <fpage>209</fpage>
      <lpage>222</lpage>
      <abstract>
        <p>In this paper, we are interested in sequence mining using GALACTIC, a new library based on Formal Concept Analysis (FCA) for generating a concept lattice from heterogeneous and complex data. Inspired from pattern structure theory, data in Galactic are described by predicates according to their types and a system of plugins which allows an easy integration of new characteristics, descriptions and strategies. We present new plugins to describe and analyze a sequential dataset with predicates: the KCS description (where a set of sequences are described by the set of K-Common Subsequences), the MCS description (where Maximal Common Subsequences are used to describe a sequential dataset) and the PCS description (where the Prefix Common Subsequence is used). Experimentation on both real and synthetic datasets shows the effectiveness of our plugins in terms of the number of generated predicates and concepts.</p>
      </abstract>
      <kwd-group>
        <kwd>Formal Concept Analysis</kwd>
        <kwd>Lattice</kwd>
        <kwd>Pattern structure</kwd>
        <kwd>Sequences</kwd>
        <kwd>Sequence mining</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Sequences appear in many areas: sequences of words in a text, trajectories,
sequences of life stage, sequences of surfing on the internet, or sequences of buying
products in a supermarket. A sequence is an ordered list of symbols, sets or
events. Sequence mining is a topic of data mining which aims at finding patterns
in a database of sequences that appears more frequently, where patterns can be
subsequences, prefixes, suffixes, subsequences according to a sliding window [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
      <p>
        The first generation of algorithms emerged in the 90s with the article of
Agrawal and Srikant [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] that extends the well-known Apriori algorithm to
sequence mining. The Apriori algorithm can be used to discover frequent
subsequences such as h{M ilk, Cof f ee}, {Sugar}i in a customer transaction database,
indicating that {Milk, Coffee}, followed by {Sugar}, are frequently purchased
together by most customers. Many algorithms were proposed to improve the time
computation of Apriori algorithm, and could be categorized in three classes
[
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]: horizontal formatting methods (GSP [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], PSP [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]), vertical formatting
methods (Spade [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], Spam [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) and projection-based methods (FreeSpan [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ],
      </p>
      <p>
        PrefixSpan [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]). All these algorithms take as input a database of sequences and
a minimum support threshold, and generate the set of frequent subsequences.
Algorithms with horizontal formatting databases require multiple scans of the
database, while using vertical formatting needs at most three database scans for
the construction of the ID-list tables, which allows a fast calculation of the
support [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. Projection-based methods use projected databases by means of many
types of extensions [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>For big databases and a short minimum support, these algorithms take
a huge computing time, and generate a too large number of patterns that
are difficult to interpret and contain redundancy. For example, subsequences
ha, b, ci, ha, bi, ha, ci, hb, ci, h i</p>
      <p>a , hbi, and hci, all with the same support, are
redondant while ha, b, ci clearly represents all the others.</p>
      <p>
        A second generation of algorithms focuses on maximal patterns also denoted
closed patterns because they verify a well-known property of closure. Many
algorithms directly address this problem (CloSpan [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ], ClaSP [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and BIDE [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]).
Some algorithms also appear within
      </p>
      <p>
        Formal Concept Analysis (FCA) frameworks [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and their extensions to
pattern structures[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], where the lattice property of closed patterns is promoted.
We can mention here an article for analysing titles of papers using suffix tree
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], mining medical care trajectories using pattern structures [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], sequence
mining to discover rare patterns [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Also other studies on demographic sequences
with FCA and pattern structures, [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        Formal Concept Analysis (FCA) appears in 1982 [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], then in the Ganter
and Wille’s 1999 work [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. It is based on mathematical ordered sets theory,
and the theory of lattices introduced by Birkhoff in 1940 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Starting from a
binary relation between a set of objects and a set of attributes, formal concepts
are built as maximal sets of objects in relation with maximal sets of attributes.
The whole set of concepts equipped with a specialization/generalization relation
forms a partially ordered set with the lattice property called the concept lattice.
This lattice represents the initial data where concepts are clusters of similar
objects, and the concept lattice is a hierarchy of clusters.
      </p>
      <p>
        FCA is issued from a branch of applied lattice theory that first appeared in
the Barbut and Monjardet’s 1970 work [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. By means of derivation operators
between objects and attributes forming a Galois connection and whose
composition is a closure operator [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], nice bijections between (reduced) binary data,
lattices and bases of rules have been stated, providing efficient methods of data
analysis based on these canonical structures, thus with few tuning. The lattice
property guarantees both a hierarchy of clusters, and a complete and consistent
navigation structure for interactive approaches [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        The formalism of pattern structures [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and abstract conceptual
navigation [
        <xref ref-type="bibr" rid="ref11 ref13">11,13</xref>
        ] extend FCA to deal with non binary data, where data is described
by patterns such that the patterns space must be organised as a semi-lattice in
order to maintain a Galois connection between objects and their descriptions. By
FCA framework, pattern lattice and bases of rules are defined, where a concept
is composed of a subset of objects together with their common patterns, and
a rule possesses patterns in premises and conclusions. However, pattern lattices
are huge, often untractable [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], and the need for approaches to drive the search
towards the most interesting patterns is a current challenge.
      </p>
      <p>
        Inspired from pattern structures, the NextPriorityConcept algorithm,
introduced in a recent article [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], proposes a user-driven patterns mining
approach for heterogeneous and complex data as input. This algorithm allows a
generic pattern computation through specific descriptions of objects by
predicates. It also proposes to reduce predecessors of a concept by refinement of a set
of objects into a fewer one through specific user exploration strategies, thus a
reduction of the number of generated patterns.
      </p>
      <p>In this article, we propose to specialize this NextPriorityConcept
algorithm for new sequence mining approaches with three descriptions and two
strategies dedicated to sequences. As descriptions, we propose the classical
maximal common subsequences (MCS), the maximal prefix (PCS), and the k-subsequences
(KCS). Therefore a set of sequences can be mined by one of these three
descriptions in a generic way. We also propose two strategies of pattern exploration in
order to reduce a given set of sequences into their predecessors in the pattern
lattice. The first strategy (simple) considers all the possible subsets of sequences
as in classical approaches, while the second one (AAS) proposes a refinement of
a set of sequences by adding only one item to their common description.</p>
      <p>Section 2 introduces a short description of the NextPriorityConcept
algorithm and basic definitions related to sequence mining. Section 3 will be
dedicated to our new sequence descriptions and strategies. Experimental results are
presented in Section 4.
2
2.1</p>
      <p>Preliminaries</p>
    </sec>
    <sec id="sec-2">
      <title>Description of the NextPriorityConcept algorithm</title>
      <p>
        The NextPriorityConcept algorithm [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] computes concepts for
heterogeneous and complex data G as inputand its main characteristics are:
Heterogeneous data as input, described by specific predicates. The
algorithm introduces the notion of description δ as an application to provide
predicates describing a set of objects A ⊆ G according to their
characteristics, each predicate being specific to one type of characteristic. Predicates
describing the objects A are locally computed by a specific treatment for
each type of characteristics, depending on whether it is digital, discrete or
more complex, and the final description δ is the union of these predicates.
Each concept (A, δ(A)) is then composed of a subset of objects A together
with a set of predicates δ(A) describing them. Such generic use of
predicates makes it possible to consider heterogeneous data as input, i.e. digital,
discrete or more complex data. As in pattern structutres, the descriptions
must verify δ(A) v δ(A0) for A0 ⊆ A in order to obtain a lattice. However,
unlike classical pattern structures, predicates are not globally computed in
a preprocessing step, but locally for each concept.
      </p>
      <p>
        Concept lattice generation. NextPriorityConcept algorithm is inspired
from Bordat’s algorithm[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], also found in Linding’s work [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], that
recursively computes the immediates successors of a concept, starting with the
bottom concept. It is a dual version that computes the immediate
predecessors of a concept, starting with the top concept (G, δ(G)) containing the
whole set of objects, until no more concepts can be generated. The use of a
priority queue ensures that each concept is generated before its predecessors,
and a mechanism of propagation of constraints ensures that meets will be
computed. NextPriorityConcept computes a concept lattice and
therefore is positioned in FCA framework, with the possibility of extraction of
rules, closure computations or navigation in the lattice, that can be useful
in many fields of pattern mining and discovery.
      </p>
      <p>Predecessors selection by specific strategies. The algorithm also introduces
the notion of strategy σ to provide predicates (called selectors) generating
the predecessors of a concept (A, δ(A)) according to each characterisitc. A
selector proposes a way to refine the description δ(A) to a reduced set A0 ⊆ A
of objects. Selectors can be defined either by a restriction on the description
δ(A) or on the set A. Several strategies are possible to generate predecessors
of a concept, going from the naive strategy classically used in FCA that
considers all the possible predecessors, to strategies reducing the number of
predecessors in order to obtain smaller lattices. Let us observe that
selectors are only used for the predecessors generation, they are not kept either
in the description or in the final set of predicates. Therefore, choosing or
testing several strategies at each iteration in a user-driven pattern discovery
approach would be interesting. It is also possible to introduce a filter (or
meta-strategy) on the selector as for example maximal support (number of
objects) of a concept, or entropy (according to a class information).
Galactic. The use of predicates regardless of the data allows a generic
implementation of our algorithm which, mixed with a system of plugins, makes it
possible easy integration of new kinds of data (description and strategies).
Galactic1 (GAlois LAttices, Concept Theory, Implicational systems and
Closures) is a development platform of our algorithm allowing easy
integration of new plugins.</p>
      <p>
        The main result in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] states that the NextPriorityConcept algorithm
computes the formal context hG, P, IP i and its concept lattice (where P is the
set of predicates describing the characteristics, and IP = {(a, p), a ∈ G, p ∈ P :
p(a)} is the relation between objects and predicates) if the description δ verifies
δ(A) v δ(A0) for A0 ⊆ A. The run-time of NextPriorityConcept algorithm
has a complexity O(|B| |G| |P |2 (cσ + cδ)) (where B is the number of concepts,
cσ is the cost of the strategy and cδ is the cost of the description), and a space
memory in O(w |P |2) (where w is the width of the concept lattice).
      </p>
      <sec id="sec-2-1">
        <title>1 https://galactic.univ-lr.fr</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Sequences</title>
      <p>A sequence s = hs1, s2, . . . sni is a list of ordered events belonging to an alphabet
Σ of event types. The length |s| of a sequence s is the number of its elements,
here |s| = n.</p>
      <p>Let us denote G a database of sequences.</p>
      <p>Example 1 in Table 1 represents a database G which contains 4 sequences
with the alphabet Σ = {A, C, G, T }.</p>
      <p>A sequence a = ha1a2 . . . ani is a subsequence of a sequence b = hb1b2 . . . bmi,
and we write2 a vs b if there are integers 1 ≤ i1 &lt; i2 &lt; . . . &lt; in ≤ m such that
aj = bij with j ≤ n. It is also said that b is a super-sequence of a, or b contains
a. From the example in Table 1, we have 3 vs 4.</p>
      <p>A sequence p = hp1, p2 . . . pmi, is a prefix of a sequence s = hs1, s2, . . . sni if
m &lt; n and pi = si for any i ≤ m. For example, hA, Gi is a prefix of sequence 1.
A k-subsequence is a subsequence of length k. For a subset A ⊆ G of sequences,
s is a common subsequence of A if, ∀x ∈ A, we have s vs x.</p>
      <p>The support of a subsequence s is support(s) = |{x ∈ G : s vs x}|.
A closed subsequence c is a subsequence that has no super-sequences with the
same support, i.e if ∃c0 ∈ Σ∗ such that support(c) = support(c0) then c0 @s c or
c0 = c.</p>
      <p>1 hA, G, T i
2 hG, C, Gi
3 hT, C, Ai
4 hT, G, C, Ai
NextPriorityConcept is a generic algorithm that considers any characteristic
as input, that must be supplied with a description δ and a strategy σ. In order to
mine sequences with this algorithm, we have to define descriptions and strategies
for a set G of input sequences defined on an alphabet Σ.</p>
      <p>A description δ is a mapping δ : 2G → 2P which defines a set of predicates
δ(A) describing any subset A ⊆ G of sequences.</p>
      <p>A strategy σ is an mapping σ : 2G → 2P which defines a set of selectors
σ(A) to select strict subset A0 of A corresponding to the predecessors of any
concept (A, δ(A)).
2 The symbol vs denotes a subsequence, to distinguish with the symbol v classicaly
used for the subsumption</p>
      <p>For sequences, predicates are based on subsequences, and are of the form ”a
matches x” meaning x vs a. For better readability, the sets δ(A) and σ(A) will
be treated either as sets of predicates or as sets of subsequences in the rest of this
document, they can reciprocally be deduced from each other with a predicate
on the form ”matches x” for each subsequence x. These predicates could be
specialized according to the type of subsequences given by the description. For
example, for a description that computes prefixes, predicates are ”starts with”.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Descriptions for sequences</title>
      <p>We define three descriptions for a set A of sequences. The first one generates the
classical maximal common subsequences of A. The second one is the restriction
to prefix sub-sequences. The third one offers the possibility to limit the size of
subsequences according to a given length:</p>
    </sec>
    <sec id="sec-5">
      <title>The Maximal Common Subsequences (MCS) description is defined, for</title>
      <p>a subset A ⊆ G of sequences by:</p>
      <p>
        δmcs(A) = {x ∈ Σ∗ : ∀a ∈ A, x vs a and 6 ∃x0 ∈ δmcs(A) : x vs x0} (1)
For this description we use the BIDE algorithm [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] to generate all maximal
common subsequences, with a time complexity cδmcs = O(2|A|) ≤ O(2|G|)
      </p>
    </sec>
    <sec id="sec-6">
      <title>The Prefix Common Subsequence (PCS) description is defined for a sub</title>
      <p>set A ⊆ G of sequences by:
δpcs(A) = {x ∈ Σ∗ : ∀a ∈ A, x prefix of a and 6 ∃x0 ∈ δpcs(A) : x prefix of x’}
(2)
The complexity of this description is cδpcs = O(|G|.|Σ|.lmax), where lmax is
the length of the maximal sequence in A.</p>
    </sec>
    <sec id="sec-7">
      <title>The K-Common Subsequence (KCS) description is defined for a subset</title>
      <p>A ⊆ G and a length k by:
δkcs(A, k) = {x ∈ Σ∗ : ∀a ∈ A, x vs a and |x| = k}
(3)
This description has the same time complexity of cδmcs because maximal
common subsequences have to be computed, thus cδkcs = O(2|A|) ≤ O(2|G|).</p>
      <p>Some examples of descriptions for the sequences in Table 1 are given in
Table 2.</p>
      <p>To ensure that NextPriorityConcept will generate a concept lattice,
these three descriptions must verify δ(A) v δ(A0) for A0 ⊆ A where v is the
subsumption relation between sets of subsequences defined by:</p>
      <p>δ(A) v δ(A0) ⇐⇒ ∀x ∈ δ(A), ∃x0 ∈ δ(A0) such that x vs x0
Proposition 1 For A0 ⊆ A ⊆ G and a length k, we have the three following
properties:
1. δmcs(A) v δmcs(A0)
2. δpcs(A) v δpcs(A0)
3. δkcs(A, k) v δkcs(A0, k)
Proof: Let A and A0 be two subsets of G such that A0 ⊆ A.
1. Let c ∈ δmcs(A), i.e. c is a maximal subsequence of A. By A0 ⊆ A we
can deduce that c is also a subsequence of sequences in A0, but c is not
necessarily a maximal subsequence for A0. If c is a maximal subsequence in
A0 then c ∈ δmcs(A0). Otherwise, there exists c0 ∈ δmcs(A0) such that c is a
subsequence of c0. In these two cases, we can deduce that, δmcs(A) v δmcs(A0)
2. The proof is similar for the prefix common subsequence description δpcs.
3. For δkcs, the subsumption relation can be refined by the inclusion relation
because it is obvious that k-subsequences of a set A of sequences also are
k-subsequences of any subset A0 of A. Then δkcs(A) ⊆ δkcs(A0) and therefore
δkcs(A) v δkcs(A0).
3.2</p>
    </sec>
    <sec id="sec-8">
      <title>Strategies and selectors for sequences</title>
      <p>NextPriorityConcept refines each concept (A, δ(A)) into predecessors with
fewer sequences and more specific descriptions using an input strategy that gives
a means of reducing the set of objects from the concept to its predecessors. In
this section, we introduce two strategies for sequences. The first one considers all
the possible subsets of A, whereas the second one consists in adding one element
of the alphabet to the current common subsequences of the description.
The Simple strategy σsimple is defined for a subset A of sequences and a
description δ by:
σSimple(A, δ) = {x : x ∈ δ(A0), A0 = A\{a}, for all a ∈ A}
(4)
The cost for this strategy depends on the cost of the description δ : cσSimple =
O(cδ.|G|).</p>
    </sec>
    <sec id="sec-9">
      <title>The Augmented Alphabet Subsequence (AAS) strategy σaas is defined</title>
      <p>for a subset A and a descirption δ of sequences by:
σaas(A, δ) = {x + a : x ∈ δ(A), a ∈ Σ}
(5)
Note that the x + a notation means the concatenation of the item a to the
subsequence x. The cost for this strategy is caσas = O(|δ(A)|.|Σ|)</p>
      <p>Table 3 gives some generated predicates for A = {2, 4} ( concept 5 in Figure 1,
concept 2 in Figure 3) and with these two strategies. For our example in Table 2,
those which generated predecessors are mentioned by adding the $ symbol with
the concept number, from Figures 1 and 3.</p>
      <p>Figure 2 displays the concept lattice generated with the AAS strategy and the
PCS description. Figure 1 displays the concept lattice generated with the Simple
strategy and the MCS description. Figures 3 and 4 display the two concept
lattices generated with the simple strategy and the KCS description with resp.
k = 2 and k = 3.</p>
      <p>The concept $1 in Figure 2 shows that sequences 3 and 4 both start with the
prefix T . We can observe that we obtain fewer concepts with k = 3 than with
k = 2. In Figure 3, concept $2 contains the subsequence hG, Ci which is the
only 2-common subsequence of sequences 2 and 4.</p>
      <p>Clearly the predicates in the concept lattice in Figure 1 corresponds to all the
closed subsequences since the MCS description generates maximal subsequences,
and the simple strategy considers all the possible subsets A of sequences. We can
observe that the other lattices contain fewer concepts.</p>
      <p>Fig. 2. Hasse diagram of the reduced concept lattice
Fig. 1. Hasse diagram of the reduced con- for the AAS strategy and the PCS description
cept lattice for the Simple strategy and
the MCS description
To experimentally evaluate the effectiveness of our descriptions and strategies, we
use Galactic3 (GAlois LAttices, Concept Theory, Implicational systems and
Closures), a development platform of the NextPriorityConcept algorithm
which, mixed with a system of plugins, makes it possible easy integration of new
kinds of data (description and strategies). We have implemented new plugins for
the sequences: MCS Match, PCS Match and KCS Match as description plugins, and
Simple Match and AAS Match as strategies plugins. We consider the following
synthetic and real datasets that are summarized in Table 4:
3 groups of synthetic datasets. They are generated by the IBM Quest Dataset
Generator software hosted in the SPMF Website4. These datasets are
generated by varying three parameters; the dataset size |G|, the alphabet size
|A|, and the sequences size | |</p>
      <p>S . Generated datasets are named as follows,
G100A30S7 for the dataset where |G| = 100, | | = 30 and |S| = 7. In
A
total, 10 datasets were generated : G200A30S7, G100A30S7, G75A30S7,
G50A30S7, G25A30S7, G100A20S7, G100A10S7, G100A30S5, G100A30S6,
G100A30S8
2 real datasets. The Daily-actions dataset is a small database of sequences
that represents daily actions of 25 individuals of L3i laboratory5, where
daily actions can be [W akeup, Breakf ast, W ork, Dinner]. The Wine city
dataset is issued from the museum data ”La cit´e du vin” in Bordeaux,
France6, gathered from the visits on a period of one year (May 2016 to</p>
      <sec id="sec-9-1">
        <title>3 https://galactic.univ-lr.fr</title>
        <p>4 http://www.philippe-fournier-viger.com/spmf/
5 https://l3i.univ-larochelle.fr/
6 https://www.laciteduvin.com/en</p>
        <p>May 2017). The museum is a large ”open-space”, where visitors are free to
explore the museum the way they want, without predetermined path. When
they arrive at the museum, they receive a small personal device to detect
whenever a visitor is close to an animation spot called a module. The
museum contains 20 modules, so we can say that extracted sequences for this
data is quite short. By extracting the sequences of activation of each module,
we end up with a precise enough idea of what the visit looked like for each
visitor of the museum.</p>
        <p>datasets dataset size alphabet size average sequence size
Group 1 {25, 50, 75, 100, 200} 30 7
Group 2 100 {10, 20, 30} 7
Group 3 100 30 {5, 6, 7, 8}
Daily-actions 25 12 8
Wine city {25, 50, 75, 100, 200} 30</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Closed subsequences with the MCS description</title>
      <p>
        Table 5 shows the number of predicates and concepts generated by δmcs/σsimple
and δmcs/σaas for real and synthetic datasets, and the number of closed
subsequences generated by Clospan[
        <xref ref-type="bibr" rid="ref30">30</xref>
        ].
      </p>
      <p>Using δmcs/σsimple, we clearly generate all the closed subsequences since the
MCS description generate closed subsequences, and the simple strategy
considers all the possible subsets of sequences. We can indeed observe that the number
of predicates is the same as the number of closed subsequences with Clospan.
The number of concepts is greater than the number of predicates/closed
subsequences because many concepts are generated from their successors without new
predicates.</p>
      <p>We can also observe that the AAS strategy generates less closed subsequences
since only the right augmented subsequences are considered. But this difference
is only observed with more than 100 sequences in G and more than 20 elements
in A, meaning that the AAS strategy is close to the simple strategy, but with
a better time complexity. This illustrates the possibility of reducing the closed
subsequences generated according to a specific strategy.</p>
    </sec>
    <sec id="sec-11">
      <title>KCS description and the Daily-actions dataset</title>
      <p>Figure 5 and Table 6 show a comparison between MCS and KCS descriptions
for synthetic and the Daily-actions datasets.</p>
      <p>Clearly, the KCS description does not generate closed subsequences because
we set the size of the subsequences that are not maximal. Therefore we obtain
more predicates, but fewer concepts. Sequences are likely to share more
subsequences of small length, but when we increase the length, concepts sharing
the same subsequences became fewer. We can also observe that we get fewer
concepts as the k parameter increases. The KCS description is clearly adapted
to the Daily-actions dataset because small subsequences make more sense to
analyze daily actions than the maximal / closed subsequences. The possibility
to fix k allows to avoid the generation of all possible subsequences.
prefixes since we were focused on the flow of visitors at the entrance of the
museum, that is which paths were visited first. Here we can observe that we
obtain fewer concepts and predicates with the PCS description compared to the
MCS description that generates a very large number of closed subsequences.
In this paper, we present a sequence mining approach using the
NextPriorityConcept algorithm that generated pattern concepts in a user-driven patterns
mining approach for heterogeneous and complex data as input. This algorithm
allows a generic patterns computation through specific descriptions and
strategies by predicates.</p>
      <p>We present three descriptions and two strategies dedicated to the analysis
of sequential data. As descriptions we propose the classical maximal common
subsequences (MCS), the maximal prefix (PCS), or the k-subsequences (KCS).
Therefore a set of sequences can be mined by one of these three descriptions
in a generic way. We also propose two strategies of pattern exploration. The
first strategy (simple) considers all the possible subsets of sequences as in
classical approaches, while the second one (AAS) proposes a refinement of a set of
sequences by adding only one item to their common description.</p>
      <p>Sequence analysis of real world data may require a specific way of describing
the sequences, and a specific strategy of exploration of the predecessors, and the
NextPriorityConcept algorithm could be very useful since it allows several
descriptions and strategies in a generic way. For example the prefix match
description was triggered by the need to know which spaces are visited first in a
museum, which common paths are chosen by visitors, or which spaces may be
forgotten within the visit.</p>
      <p>For future work, we are interested in finding new ways of describing sequences
to make it more specific to the application, one of which is the use of distances
between elements in the sequence, or specify a given window parameter to deal
with long sequences.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srikant</surname>
          </string-name>
          , R.:
          <article-title>Mining sequential patterns</article-title>
          .
          <source>In: Data Engineering</source>
          ,
          <year>1995</year>
          . Proceedings of the Eleventh International Conference on. pp.
          <fpage>3</fpage>
          -
          <lpage>14</lpage>
          . IEEE (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ayres</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Flannick</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gehrke</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yiu</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Sequential pattern mining using a bitmap representation</article-title>
          .
          <source>In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          . pp.
          <fpage>429</fpage>
          -
          <lpage>435</lpage>
          . ACM (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Barbut</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monjardet</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Ordres et classifications : Alg`ebre et combinatoire</article-title>
          . Hachette, Paris (
          <year>1970</year>
          ), 2 tomes
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bertet</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demko</surname>
          </string-name>
          , Ch.,
          <string-name>
            <surname>Viaud</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          , Gu´erin, C.:
          <article-title>Lattices, closure systems and implication bases: A survey of structural aspects and algorithms</article-title>
          .
          <source>Theoretical Computer Science</source>
          <volume>743</volume>
          ,
          <fpage>93</fpage>
          -
          <lpage>109</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Birkhoff</surname>
          </string-name>
          , G.:
          <article-title>Lattice theory</article-title>
          , vol.
          <volume>25</volume>
          . American Mathematical Soc. (
          <year>1940</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Bordat</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          :
          <article-title>Calcul pratique du treillis de Galois d'une correspondance</article-title>
          .
          <source>Math´ematiques et Sciences humaines 96</source>
          ,
          <fpage>31</fpage>
          -
          <lpage>47</lpage>
          (
          <year>1986</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Buzmakov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Egho</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jay</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Ra¨ıssi, C.:
          <article-title>Fca and pattern structures for mining care trajectories (</article-title>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Codocedo</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bosc</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaytoue</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boulicaut</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A proposition for sequence mining using pattern structures</article-title>
          .
          <source>In: International Conference on Formal Concept Analysis</source>
          . pp.
          <fpage>106</fpage>
          -
          <lpage>121</lpage>
          . Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Demko</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bertet</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Faucher</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Viaud</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Next priority concept: A new and generic algorithm computing concepts from complex and heterogeneous data</article-title>
          .
          <source>arXiv preprint arXiv:1912</source>
          .
          <volume>11038</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Ferr´e,
          <string-name>
            <surname>S.:</surname>
          </string-name>
          <article-title>The efficient computation of complete and concise substring scales with suffix trees</article-title>
          .
          <source>In: International Conference on Formal Concept Analysis</source>
          . pp.
          <fpage>98</fpage>
          -
          <lpage>113</lpage>
          . Springer (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. Ferr´e, S.:
          <article-title>Reconciling Expressivity and Usability in Information Access - From Filesystems to the Semantic Web</article-title>
          . Habilitation, University of Rennes 1, France (november
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Ferr´e, S.:
          <article-title>Reconciling expressivity and usability in information access from file systems to the semantic web (</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. Ferr´e, S.:
          <article-title>Syst`emes d'information logiques : un paradigme logico-contextuel pour interroger, naviguer et apprendre</article-title>
          . Doctorat, University of Rennes 1, France (Oct
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Ganter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          :
          <article-title>Pattern structures and their projections</article-title>
          .
          <source>In: LNCS of International Conference on Conceptual Structures (ICCS'01)</source>
          . pp.
          <fpage>129</fpage>
          -
          <lpage>142</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Ganter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wille</surname>
          </string-name>
          , R.:
          <source>Formal Concept Analysis, Mathematical foundations</source>
          . Springer Verlag, Berlin (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Gizdatullin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baixeries</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ignatov</surname>
            ,
            <given-names>D.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitrofanova</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muratova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Espy</surname>
            ,
            <given-names>T.H.</given-names>
          </string-name>
          :
          <article-title>Learning interpretable prefix-based patterns from demographic sequences</article-title>
          .
          <source>In: International Conference on Intelligent Data Processing: Theory and Applications</source>
          . pp.
          <fpage>74</fpage>
          -
          <lpage>91</lpage>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Gizdatullin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ignatov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitrofanova</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muratova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Classification of demographic sequences based on pattern structures and emerging patterns</article-title>
          .
          <source>In: Supplementary Proceedings of 14th International Conference on Formal Concept Analysis, ICFCA</source>
          . pp.
          <fpage>49</fpage>
          -
          <lpage>66</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Gomariz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Campos</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marin</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goethals</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Clasp: An efficient algorithm for mining frequent closed sequences</article-title>
          .
          <source>In: Pacific-Asia Conference on Knowledge Discovery and Data Mining</source>
          . pp.
          <fpage>50</fpage>
          -
          <lpage>61</lpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Han</surname>
            ,
            <given-names>J</given-names>
          </string-name>
          .,
          <string-name>
            <surname>Pei</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mortazavi-Asl</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dayal</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsu</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          :
          <article-title>Freespan: frequent pattern-projected sequential pattern mining</article-title>
          .
          <source>In: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          . pp.
          <fpage>355</fpage>
          -
          <lpage>359</lpage>
          . ACM (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Kaytoue</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Contributions to Pattern Discovery. Habilitation, University of Lyon, France (february
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Kaytoue</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Codocedo</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buzmakov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baixeries</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Pattern structures and concept lattices for data mining and knowledge processing</article-title>
          .
          <source>In: In Proceedings of ECML-PKDDl</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Linding</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Fast concept analysis</article-title>
          . In:
          <article-title>Working with Conceptual StructuresContributions to ICC</article-title>
          . pp.
          <fpage>235</fpage>
          -
          <lpage>248</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Mannila</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toivonen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verkamo</surname>
            ,
            <given-names>A.I.</given-names>
          </string-name>
          :
          <article-title>Discovery of frequent episodes in event sequences. Data mining and knowledge discovery 1(3</article-title>
          ),
          <fpage>259</fpage>
          -
          <lpage>289</lpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Masseglia</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cathala</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poncelet</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>The psp approach for mining sequential patterns</article-title>
          .
          <source>In: European Symposium on Principles of Data Mining and Knowledge Discovery</source>
          . pp.
          <fpage>176</fpage>
          -
          <lpage>184</lpage>
          . Springer (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Pei</surname>
            ,
            <given-names>J</given-names>
            ., Han, J
          </string-name>
          .,
          <string-name>
            <surname>Mortazavi-Asl</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pinto</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dayal</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsu</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          : Prefixspan:
          <article-title>Mining sequential patterns efficiently by prefix-projected pattern growth</article-title>
          . In: icccn. p.
          <fpage>0215</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Pei</surname>
            ,
            <given-names>J</given-names>
            ., Han, J
          </string-name>
          .,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Mining sequential patterns with constraints in large databases</article-title>
          .
          <source>In: Proceedings of the eleventh international conference on Information and knowledge management</source>
          . pp.
          <fpage>18</fpage>
          -
          <lpage>25</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Srikant</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agrawal</surname>
          </string-name>
          , R.:
          <article-title>Mining sequential patterns: Generalizations and performance improvements, edbt (</article-title>
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J</given-names>
            ., Han, J
          </string-name>
          .: Bide:
          <article-title>Efficient mining of frequent closed sequences</article-title>
          .
          <source>In: Proceedings. 20th international conference on data engineering</source>
          . pp.
          <fpage>79</fpage>
          -
          <lpage>90</lpage>
          . IEEE (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Wille</surname>
          </string-name>
          , R.:
          <article-title>Restructuring lattice theory : an approach based on hierarchies of concepts</article-title>
          .
          <source>Ordered</source>
          sets pp.
          <fpage>445</fpage>
          -
          <lpage>470</lpage>
          (
          <year>1982</year>
          ), i. Rival (ed.), Dordrecht-Boston, Reidel.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          , Han,
          <string-name>
            <given-names>J</given-names>
            .,
            <surname>Afshar</surname>
          </string-name>
          , R.: Clospan: Mining:
          <article-title>Closed sequential patterns in large datasets</article-title>
          .
          <source>In: Proceedings of the 2003 SIAM international conference on data mining</source>
          . pp.
          <fpage>166</fpage>
          -
          <lpage>177</lpage>
          . SIAM (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Zaki</surname>
            ,
            <given-names>M.J.:</given-names>
          </string-name>
          <article-title>Spade: An efficient algorithm for mining frequent sequences</article-title>
          .
          <source>Machine learning 42(1-2)</source>
          ,
          <fpage>31</fpage>
          -
          <lpage>60</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>