Interval-based sequence mining using FCA and the NextPriorityConcept algorithm Salah Eddine Boukhetta, Jérémy Richard, Christophe Demko, and Karell Bertet Laboratory L3i, La Rochelle University, La Rochelle, France Abstract. In this paper, we are interested in sequential data analy- sis using GALACTIC, a new library based on Formal Concept Analysis (FCA) for calculating a concept lattice from heterogeneous and com- plex data. Inspired by the pattern structure theory, data in GALACTIC are described by predicates according to their types and a system of plu- gins allows an easy integration of new characteristics and new descrip- tions. We present new ways to analyse interval-based sequences, where items persist in time. Here we address the question of mining relevant sequential patterns, describing a set of sequences, by maximal common subsequences, or shortest supersequences. Experimentation on two real sequential datasets shows the effectiveness of our plugins in term of size of the lattice and of running time. Keywords: Formal concept analysis · Lattice · Pattern structures · Interval- based sequences · Maximal common subsequences · Shortest common superse- quences 1 Introduction Sequences appear in many areas: sequences of words in a text, trajectories, surf- ing on the internet, or buying products in a supermarket. A sequence is a suc- cession hxi i of symbols, sets or events. Sequence mining is a topic of data mining which aims at finding frequent patterns in a dataset of sequences. Many algo- rithms have been proposed for mining sequential patterns, such as GSP [25], PrefixSpan [24], CloSpan [29], etc. These algorithms take as input a dataset of sequences and a minimum support threshold, and generate all frequent subse- quences. Some algorithms mine time-point sequences h(ti , xi )i, where an item xi occurred at a timestamp ti , for example for discovering episodes in a long time- point sequence [23, 26]. In real world applications, events may persist in time, or in an interval of time (ti , ti ), we call these sequences, interval-based sequences h(ti , ti , Xi )i, where Xi is an itemset. They are mostly analysed using Allen’s in- terval relations [1]. To quote from Kam and Fu’s work on discovering temporal interval sequences [18], the patterns discovered are of type ”event A’s occurrence time overlaps with that of event B and both of these events occur before event C appears”. Other works also used Allen’s relations to discover interval based patterns [17, 28]. Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Formal Concept Analysis (FCA) appears in 1982 [27], then in the Ganter and Wille’s 1999 work [14], it is issued from a branch of applied lattice theory that first appeared in the book of Barbut and Monjardet in 1970 [2]. The lattice property guarantees both a hierarchy of clusters, and a complete and consistent navigation structure for interactive approaches [11]. The formalism of pattern structures [13, 20] and abstract conceptual navigation [10, 9] extend FCA to deal with non-binary data, where data is described by patterns such that the pat- tern space must be organised as a semi-lattice in order to maintain a Galois connection between objects and their descriptions. By FCA framework, pattern lattice and bases of rules are defined, where a concept is composed of a subset of objects together with their common patterns, and a rule possesses patterns in premises and conclusions. However, pattern lattices are huge, often untractable [19], and the need for approaches to drive the search towards the most relevant patterns is a current challenge. Logical Concept Analysis [12] is a generaliza- tion of FCA in which sets of attributes are replaced by logical expressions. The power set of attributes mentioned by the Galois connection is replaced by an arbitrary set of formulas to which are associated a deduction relation (i.e., sub- sumption), and conjunctive and disjunctive operations, and therefore forms a lattice. Inspired by pattern structures, the NextPriorityConcept algorithm, introduced in a recent article [8] proposes a user-driven pattern mining approach for heterogeneous and complex data as input. This algorithm allows a generic pattern computation through specific descriptions of objects by predicates. It also proposes to reduce predecessors of a concept by the refinement of a set of objects into a fewer one through specific user exploration strategies, resulting in a reduction of the number of generated patterns. Some algorithms appear within FCA framework for analysing sequence data; we can mention works for mining medical care trajectories using pattern structures [5, 6], sequence mining to discover rare patterns [7], and other studies on demographic sequences [15, 16]. But for discovering interval-based sequence using FCA methods, we found fewer works. We can cite Kaytoue et al.’s work on gene expression data [21]. In this article, we propose a new sequence mining approaches using the NextPriorityConcept algorithm, with descriptions and strategies dedicated to interval-based sequences. We propose two different descriptions that describe a subset of objects by subsequences or supersequences. We also propose five strategies of pattern exploration in order to generate a reduction of a cluster of interval-based sequences, i.e., its predecessors in the pattern lattice. Section 2 introduces basic definitions related to interval sequence mining and a short description of the NextPriorityConcept algorithm. Section 3 will be dedicated to our new interval-based sequence descriptions and strategies. Experimental results are presented in Section 4. 2 Preliminaries 2.1 Interval-based Sequences A sequence s is a succession of itemsets from a dictionary Σ, often in the form of s =< Xi >i≤n , where Xi ⊆ Σ is a subset of items i.e., itemset. A temporal sequence is a sequence where each itemset Xi must have an associated timestamp ti . An Event (or Time frame) E, is a triple E = (t, t, X) where X ⊆ Σ is an itemset, t is the starting time and t is the ending time, t ≤ t. For better readability we refer to (t, t) by T . Interval-based sequence. An interval-based sequence (or Time frame sequence) s = h(Ti , Xi )ii≤n is a list of events (or time frames), verifying ti < ti+1 , thus an interval-based sequence is a list of separate intervals containing itemsets. The size of the interval-based sequence is the number of its time frames. We refer to the interval-based sequence by sequence. Consider the example in Figure 1 for an alphabet Σ = {C, M, P, H} (where C stands for Castle, M for Museum, P for Public Garden and H for Historical Place), the sequences represent trajectories of visits of three tourists s1, s2 and s3. In this example, visitors may be in two or more different locations at the same interval as the intervals are large enough and we may don’t have the exact interval of each location. 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 t s1 P H, M s2 P, H M s3 P M, C Fig. 1. Example of interval-based sequences Subinterval. For two intervals, T = (t, t) and T 0 = (t0 , t0 ), we say that T is subinterval T 0 , if : t ≥ t0 and t ≤ t0 and we write T  T 0 , that corresponds to the containing relation from Allen’s relations [1]. Projections. We introduce the projection operator Φ of a sequence s, over a given interval T , that selects all the itemsets of the sequence included in this interval : ΦT (s) = {X 0 : T 0  T and (T 0 , X 0 ) ∈ s}. Dually, the projection operator Φ, over an itemset X ⊆ Σ selects all the intervals where the items of X may occur: ΦX (s) = {T 0 : X 0 ⊆ X and (T 0 , X 0 ) ∈ s}. ΦΣ (s) represents a set of all the intervals in s. Subsequence. A sequence s, is subsequence of another sequence s0 , s b s0 if for all (T, X) ∈ s, there exists (T 0 , X 0 ) ∈ s0 such that T  T 0 and X ⊆ X 0 . We also say that s0 is supersequence of s. Affix. A prefix/suffix of a sequence s = h(Ti , Xi )ii≤n according to a window w, is the subsequence of s composed by the first/last w time frames of s, prefix(s, w) = h(Ti , Xi )i1≤i≤w , suffix(s, w) = h(Ti , Xi )i(n−w)