Mining rare sequential patterns with ASP Ahmed Samet Thomas Guyet Université Rennes 1/IRISA-UMR6074 AGROCAMPUS-OUEST/IRISA-UMR6074 Benjamin Negrevergne Université Paris-Dauphine, PSL Research University, CNRS, LAMSADE Abstract This article presents an approach of meaningful rare sequential pattern mining based on the declarative programming paradigm of Answer Set Programming (ASP). The setting of rare sequential pattern mining is introduced. Our ASP approach provides an easy manner to encode ex- pert constraints on expected patterns to cope with the huge amount of meaningless rare patterns. Encodings are presented and quantitatively compared to a procedural baseline. An application on care pathways analysis illustrates the interest of expert constraints encoding. 1 Introduction Pattern mining aims at extracting meaningful structured knowledge hidden in large amounts of raw data. Build- ing on the hypothesis that an interesting pattern occurs frequently in the data, most research on pattern mining have focused on frequent pattern mining. However, in many cases, rare patterns can also be meaningful. For example, this is true when physicians want to identify dangerous outcomes out of ordinary procedures from care pathway data. Fortunately for patients, such outcomes are rares but, unfortunately for data analysts, such events are dicult to extract using standard approaches. Mining rare patterns has been studied in the context of itemsets [1]. But to the best of our knowledge, the problem of mining rare sequential patterns has not been addressed yet. The lack of work on this topic has been recently identied by Hu et al. [2] as an important matter for future research. The main problem with rare patterns, known from experiments on rare itemsets, is their huge number. Condensed rare patterns, called minimal rare patterns [3], have been proposed to reduce the number of patterns to extract without loosing information. Yet, it is desirable to further reduce the number of patterns and to improve the pattern signicance. The approach we propose in this paper, is to let the expert express extra constraints to specify the most interesting patterns. To achieve this goal, we need to develop a method to extract condensed rare sequential patterns which will be versatile enough to support easy addition of expert constraints. However, most approaches based on procedural programs would require specic and long developments to integrate extra constraints. Instead, declarative programming appears to be an interesting alternative to extract knowledge from data under complex expert constraints. Declarative programming, and more especially logic programming received much attention in the past decades, and even more recently thanks to the progress in solver eciency. The state-of-the-art can be organized according to two analysis axis: the data analysis tasks and the declarative programming paradigm. Data analysis task are classically separated in supervised vs unsupervised learning. More especially, inductive logic programming (ILP) [4] belongs to the eld of supervised machine learning, while the more recent eld of declarative pattern mining [5], which consists in mining patterns with declarative encodings, belongs to the eld of unsupervised learning. For these two tasks, several declarative paradigms have been proposed: logic programming, Prolog or Answer Copyright c by the paper's authors. Copying permitted for private and academic purposes. In: Nicolas Lachiche, Christel Vrain (eds.): Late Breaking Papers of ILP 2017, Orléans, France, September 4-6, 2017, published at http://ceur-ws.org 51 s1 s2 s3 s4 s5 s6 s7 Seq. hd, a, b, ci ha, c, b, ci ha, b, ci ha, b, ci ha, ci hbi hci Table 1: Example of articial dataset of 7 sequences. Set Programming (ASP), Constraint Programming (CP) and Satisability Solving (SAT). For each approach, the objective is to propose a uniform representation for examples, background knowledge and hypotheses. ILP started with encodings in Prolog [4], and alternative systems have been implemented based on CP [6] or on ASP [7]. In declarative pattern mining, most of the approaches have been implemented using SAT [8], CP [9, 10] and some with ASP [11]. In this work, we choose the paradigm of ASP to implement our rare sequential pattern mining to benet from a rst-order logic programming language. It makes the encoding of extra constraints easier and its solvers have proved their eciency [11]. The contributions of this paper are threefold: 1. we formalize the general problem of rare sequential pattern mining and we model it in ASP, together with two important variations of this problem (non-zero-rare and minimal rare patterns). Thanks to the exibility of ASP, we were able to solve all three problems with the same ASP solver, avoiding the tedious work of designing and implementing an algorithm for each new problem. 2. We provide important insights on modelling eciency by comparing several alternative models. The ex- perimental comparison shows that general ASP-based approaches can compete with ad-hoc specialized algorithms. 3. We apply rare sequential mining to our target application of analyzing care pathway data. In particular, we demonstrate the benet of additional constraints to extract meaningful results in this domain. 2 Rare sequential patterns: denitions and properties This section introduces the basic concepts of rare sequential pattern mining. Let I = {i1 , i2 , . . . , i|I| } be a set of items. A sequence s, denoted by hsj i1≤j≤n is an ordered list of items sj ∈ I , where n denotes the length of the sequence. Given two sequences s = hsj i1≤j≤n and s = hsj i1≤j≤m with n ≤ m, we say that s is a subsequence of s , 0 0 0 0 0 denoted s  s , i there exists n integers i1 < . . . < in such that sk = si , ∀k ∈ {1, . . . , n}. k , be a dataset of N sequences. The support of any sequence s, denoted supp(s), is  1 2 N Let D = s , s , . . . , s i the number of sequences s ∈ D that contain s: supp(s) = si ∈ D|s  si  (1) Let σ ∈ N + be a frequency threshold given by the data analyst. A sequence s is rare i supp(s) < σ . In such case, the sequence s is called a rare sequential pattern or a rare pattern for short. It is noteworthy that rare patterns are strongly related to the frequent patterns, i.e. the sequence s such that supp(s) ≥ σ . The set of rare patterns is the dual of frequent patterns. A pattern that is not rare is frequent and vice versa. Moreover, the rarity constraint is monotone, i.e. for two sequences s and s0 , if s  s0 and s is rare, then s0 is rare. This property comes from the anti-monotonicity of the support measure. Example 2.1 (Sequences and rare patterns) Table 1 presents a simple dataset D, dened over a set of items I = {a, b, c, d}. The support of sequence p1 = ha, bi is 4 and the support of p2 = hc, ci is 1. For a support threshold σ = 2, p2 is a rare pattern but p1 is not. Due to the monotonicity of rarity, every extension of p2 is also rare (e.g. ha, c, ci). We now introduce two additional denitions: zero-rare patterns and minimal rare patterns. A zero-rare pattern is a rare pattern with a null support. This means that this pattern does not occur in dataset sequences. A non-zero-rare pattern is a rare pattern which is not zero-rare. Finally, a sequence s is a minimal rare pattern (mRP) i it is rare but all its proper subsequences are frequent. mRP = {s|supp(s) < σ ∧ ∀s0 ≺ s, supp(s0 ) ≥ σ} (2) 52 FP < d, a, b, c > < a, c, b, c > RP < a, b, c > < a, c, b > < a, c, c > < c, b, c > < d, a, b > < d, b, c > mRP < a, b > < a, c > < b, c > < c, b > < c, c > < d, a > < d, b > < d, c > {} Figure 1: Lattice of sequential patterns of dataset D . Zero-rare patterns are omitted for sake of clarity. In white: frequent patterns, in grey: rare patterns, surrounded: minimal rare patterns for σ = 2. seq (1 ,1 , d ). seq (1 ,2 , a ). seq (1 ,3 , b ). seq (1 ,4 , c ). seq (2 ,1 , a ). seq (2 ,2 , c ). seq (2 ,3 , b ). seq (2 ,4 , c ). seq (3 ,1 , a ). seq (3 ,2 , b ). seq (3 ,3 , c ). seq (4 ,1 , a ). seq (4 ,2 , b ). seq (4 ,3 , c ). seq (5 ,1 , a ). seq (5 ,2 , c ). seq (6 ,1 , b ). seq (7 ,1 , c ). Listing 1: Facts specifying the sequence dataset of Table 1 Example 2.2 (minimal rare patterns) Figure 1 illustrates the lattice of non-zero rare patterns in the dataset D of Table 1 for σ = 2. For example, ha, c, bi is a non-zero-rare pattern, but is not a mRP since one of its subpatterns is rare (i.e. hc, bi ≺ ha, c, bi). hb, ai (omitted in Figure 1) is a zero-rare pattern as it does not appear in D. 3 Mining rare sequential patterns with ASP In this section, we detail ASP encodings for mining rare and minimal rare sequential patterns. ASP is a declarative programming paradigm founded on the semantic of stable models [12]. From a general point of view, declarative programming gives a description of what is a problem instead of specifying how to solve it. A program describes the problem and a dedicated solver nds out its solutions. An ASP program is a set of rules of the form a0 :- a1 , . . . , am , not am+1 , . . . , not an , where each ai is a propo- sitional atom for 0 ≤ i ≤ n and not stands for default negation. Atoms can be encoded using a rst-order logic syntax. If n = 0, such rule is called a fact. If a0 is omitted, such rule represents an integrity constraint. The reader can refer to Janhunen et al. [12] for a more detailed introduction to ASP. Dedicated solvers, such as clingo [13], ensure the ecient solving of ASP encodings. In the following, Section 3.1 introduces the ASP generation of candidate sequential patterns, then the rarity constraints is introduced in Section 3.2 and nally, Section 3.3 introduces additional constraints to extract eciently only the minimal-rare patterns. First of all, the dataset has to be encoded as facts. It is represented by atoms seq(s,t,i) stating that item i occurs at time t in sequence s. Listing 1 illustrates facts encoding of the sequence dataset of Table 1. 3.1 Enumerate candidate sequential patterns in ASP Listing 2 species the candidate generation. It reuses several notations and principles introduced by Gebser et al. [11]. Line 1 lists all atoms present in the dataset. For each item i ∈ I a unique atoms item(i) is generated. The token "_" stands for an anonymous variable that does not recur anywhere. Line 2 denes the sequence lengths of each sequence from the dataset. The length of the sequence is an item position for which there is no items afterwards. Lines 4 to 7 enumerate all possible patterns. A pattern is a sequence of positions to ll up with 1 exactly one item . It is encoded by atoms pat(x,i) where x is a position in the pattern and i ∈ I . Predicate patpos/1 takes one variable and denes the sequence position for pattern items. Rules with curly brackets in the head are choice rules such as lines 5 and 7. Rule at line 5 makes the pattern length choice which can not be larger than maxlen. Line 7 chooses exactly one atom pat/2 for each position X and thus generates the candidate patterns combinatorics. 1 We do not consider sequences of itemsets for sake of encoding simplication. 53 Figure 2: Illustration of the ll-gaps embedding strategy (see Listing 2) for pattern habci in the sequence h. . . a . . . b . . . bdc . . . b . . . i. Blank boxes can contain any item except a, b or c. Each black-circle illustrates one occ/3 atoms in a model. ASP atoms of this predicate are detailed for three circles. Lines 9 to 13 evaluate the pattern support. This encoding uses the ll-gaps strategy proposed in [11] (see Figure 2). The idea consists in mapping each sequence position to one (and exactly one) pattern position. In addition, some sequence positions are associated with an item that does not match the item in the mapped pattern position. This denotes lling positions and indicates which part of the pattern has already been  read . Lines 9 to 11 traverse each sequence s = hsj i1≤j≤n in D to yield atoms of the form occ(s,l,pl ) where pl is a position in the pattern and l is a position in the sequence s. occ(s,l,pl ) indicates that the pl -th rst items of the patterns have been read at position l of the sequence. Line 9 searches for a compatible position P with the rst pattern item. Line 11 yields a new atom occ(s,l,pl ) while the (l − 1)-th item has been read at position (pl −1) and the pl -th sequence item matches the l-th pattern item. In any case, line 10 yields an atom associating position pl with already read positions of the pattern. The support(s) is yielded line 13 while sequence t holds the current pattern. 1 item ( I ) : - seq (_ , _ , I ). 2 seqlen (S , L ) : - seq (S ,L , _ ) , not seq (S ,L +1 , _ ). 4 patpos (1). 5 { patpos ( X +1) } : - patpos ( X ) , X < maxlen . 6 patlen ( L ) : - patpos ( L ) , not patpos ( L +1). 7 1 { pat (X , I ) : item ( I ) } 1 : - patpos ( X ). 9 occ (S ,1 , P ) : - pat (1 , I ) , seq (S ,P , I ). 10 occ (S ,L , P ) : - occ (S , L , P -1) , seq (S ,P , _ ). 11 occ (S ,L , P ) : - occ (S , L -1 , P -1) , seq (S ,P , C ) , pat (L , C ). 13 support ( S ) : - occ (S , L , LS ) , patlen ( L) , seqlen (S , LS ). Listing 2: Encoding of candidate patterns generation For more details about ll-gaps strategy and its alternatives, we refer the readers to the article of Gebser et al. [11]. 3.2 Ecient encoding of rarity constraint (ASP-RSM) The modularity of ASP encoding enables to easily add constraints to the previous program. Where Gebser et al. added constraints to select only the frequent patterns, we select the answer sets that correspond to non-zero- rare patterns. The following constraint states that the number of supported sequences must be strictly below the given threshold th ( i.e. not above th). th is a program constant that can be specied at solving time. : - # count { S : support ( S ) } >= th . Although this is a correct formulation, we propose an alternative encoding that explicitly expresses the mono- tonicity property to help the solver pruning non-zero-rare patterns. This encoding introduces a rare/1 predicate (see listing 3). Instead of evaluating the support of pattern hpj i1≤j≤n , it evaluates independently the support of 54 Figure 3: mRP occurences for pattern p = ha, b, c, di and u = 3, on the left, in case of a sequence that support pu but not p. Grey disks illustrate occ/3 atoms and black circles illustrate spocc/4. each of its prexes. For any l, 1 ≤ l ≤ n, rare(l) means that the subpattern hpj i1≤j≤l is rare. Line 18 imposes to have the atom rare(n) where n is the pattern length. Line 17 gives a simpler manner to yield such atom: if a 0 0 prex-pattern of length l is rare, then all patterns of length l , l > l, are rare. This rule prevents the solver from evaluating all rules of line 15 and it potentially prevents from evaluating occ/3 atoms. Finally, line 19 prunes zero-rare patterns imposing to have at least one supported sequence. 15 rare ( IL ): - IL =1.. L , patlen ( L ) , 16 # count { S : occ (S , IL , LS ) , seqlen (S , LS ) } < th . 17 rare ( L ) : - rare (L -1) , L <= PL , patlen ( PL ). 18 : - not rare ( L ) , patlen ( L ). 19 : - not support ( S ) : seq (S ,_ , _ ). Listing 3: Encoding of rare sequence mining As the number of non-zero-rare patterns may be very high, the following section presents ASP-MRSM, an extension of ASP-RSM that extracts the condensed set of minimal-rare patterns, mRP. 3.3 ASP Minimal Rare Sequence Miner (ASP-MRSM) The encoding in Listing 4 can be used to mine minimal rare sequences. It is based on the observation that it is sucient to evaluate the support of all sub-patterns of size n − 1 to determine whether a pattern p of size n is a minimal. Let p = hpi i1≤i≤n be a non-zero-rare pattern. p is a mRP i ∀u ∈ {1, . . . , n}, the sub-pattern pu = hpi ii∈{1,...,u−1,u+1,...,n} is frequent (i.e. not rare). According to Equation 2, p is a mRP i ∀p0 ≺ p, p0 is frequent. It is then clear that if p is a mRP, then ∀u, pu is frequent. Conversely, for all sequence s such that s ≺ p, ∃u such that s ≺ pu . Then, as pu is frequent and by anti-monotonicity of the frequency constraint, s is frequent too. Furthermore, the encoding takes advantage of two properties of sequential patterns: 1. a sequence that supports the pattern p supports each pattern pu , 2. if a sequence s is supported by pu but not by p, then for all l ∈ [1, u − 1], occ(s,l,pl ) is part of the ASP model computed by Listing 2, where (pl )l∈[1,(n−1)] is an occurrence of pu . In fact, as there is no dierence between pu and p before position u, there is no dierences between their occurrences. These properties are helpful to decide whether pu is rare. In Listing 4, line 20 enumerates all subpatterns, identied by U. Similarly to occ/3 predicates, atoms spocc(s,u,l,pl ) describe the occurrence of pu in sequence s (Figure 3). Lines 22 to 26 compute occurrences of pattern pu according to the ll-gaps principle (see section 3.1) with two dierences: (1) spocc/4 are yielded only if sequence t is not supported, otherwise the sequence is obviously supported knowing that pu ≺ p; and (2) spocc/4 starts from pattern position u reusing occ(t,u − 1,_) atoms. It thus avoids redundant computation according to the second property above. Lines 28 to 33 determine whether the subpattern pu is rare. This encoding diers from section 3.2 mainly at lines 30-32. The total number of sequences supported by pu is the number of sequences supported by p (line 32) plus the number of sequences supported by pu but not by p (line 31), i.e. sequences for which spocc/4 atoms reached the last pattern item. Finally, line 33 eliminates answer sets for which a subpattern is rare. It is worth noticing that in ASP two answer sets can not share information during the solving process. Gebser et al. [11] proposed to use a solver extension, called asprin, to encode the mining of closed patterns, i.e. patterns which have any larger patterns with the same support. A similar approach may be used for mRP. Our approach prefers a pure declarative but more ecient encoding. 55 20 suppat ( U ) : - U =1.. L , patlen ( L ) , L >1. 22 spocc (S ,1 ,2 , P ) : - seq (S ,P , I ) , pat (2 , I ) , not support ( S ). 23 spocc (S ,U ,U , P ) : - suppat ( U ) , occ (S ,U -1 , P ) , not support ( S ). 24 spocc (S ,U , L ,P +1) : - spocc (S ,U ,L , P ) , seq (S ,P +1 , _ ). 25 spocc (S ,U , L +1 , P +1) : - spocc (S ,U ,L , P ) , seq (S , P +1 , I) , 26 pat ( L +1 , I ). 28 sprare (U ,U -1): - sprare (U -1) , suppat ( U ). 29 sprare (U , L ) : - sprare (U , L -1) , L <= LP , patlen ( LP ). 30 sprare (U , IL ): - suppat ( U ) , IL = U +1.. L , patlen ( L ) , 31 # count { S: spocc (S ,U , IL , LS ) , seqlen (S , LS ); 32 S : support ( S ) } < th . 33 : - sprare (U , L ) , patlen ( L ). Listing 4: Encoding part for minimal rare patterns 4 Experiments and results In this section, we evaluate our approach on synthetic and real datasets. First, we use synthetic datasets to compare performances of our ASP encodings with procedural algorithms. Then, we demonstrate the exibility of our declarative pattern mining approach on a case study. In all experiments, we used clingo, version 5.0. All programs were run on a computing server with 16Gb RAM. Multi-threading capabilities of clingo have been disabled (1) to make a fair comparison with the procedural algorithms that is not parallel and (2) to prevent from having high variance in run times measurements due to multi-threading strategies of clingo. 4.1 Runtime and memory eciency on synthetic datasets In this rst set of experiments, we use synthetic datasets to compare the eciency of our ASP encodings with procedural algorithms. Four approaches have been compared: ASP-RSM and Apriori-Rare extract non-zero-rare patterns; and ASP-MRSM and MRG-EXP extract minimal rare patterns. Since Apriori-Rare [3] and MRG-Exp [14], were originally developed for rare itemset mining, they were adapted to mine rare sequential patterns . 4 We use several synthetic datasets to measure the impact of various parameters such as vocabulary size and mean sequence length on the runtime and on the memory usage. Our sequence generator4 produces datasets following a 3-step retro-engineering procedure inspired from the IBM Quest Synthetic Data Generator: 1) a set of random patterns is generated, 2) for each pattern, a list of occurrences is generated, and 3) synthetic sequences are built so that each sequence contains a pattern if the sequence is in the pattern occurrence list. The sequences are then completed with random items so that the mean sequence length of the resulting dataset reaches the input parameter λ. Figure 4 presents runtimes for with respect to dataset size (a), sequence length (b), and vocabulary size (c). Table 2 presents the number of extracted patterns and the memory footprint. First, we can see in Figure 4-(a) that ASP-MRSM is an order of magnitude faster than alternatives approaches. This is an important result since previous work on declarative pattern mining have shown that declarative approaches are often slower than the procedural counterparts [11, 15]. This experiment demonstrates that the overhead induced by the declarative encoding is balanced by the more ecient solving strategy achieved by the ASP solver. Second, we focus on ASP-RSM and Apriori-Rare (non-zero-rare-patterns). In general, decreasing the threshold induces a larger search space and thus higher runtimes. However, we can see that ASP-RSM is far less sensitive to variations of the threshold than Apriori-Rare is. As a consequence, ASP-RSM is faster for very low thresholds. Apriori-Rare is greatly impacted by the search space size, even though the number of rare patterns remains almost constant as it can be seen in Table 2. In contrast the runtime of ASP-RSM remains fairly low, even when the threshold is low. This demonstrates again that the ASP solver is able to explore the search space more eciently than our procedural algorithm. 4 The source codes of the ASP programs, of procedural algorithms and of the dataset generator are available at http://www. ilp-paper-1.co.nf/ 56 103 104 Time (s) Time (s) 103 102 102 10 15 20 25 30 0.5 1 1.5 2 Thres. (%) Dictionary size (|D|/1000) db 10000  length 15 λ =20  σ = 20% (a) (b) ASP-MRSM 104 ASP-RSM MRG-EXP Apriori-Rare Time (s) 103 102 10 15 20 25 30 Mean sequence length (λ) |D| = 250  σ = 20% (c) w.r.t. threshold for synthetic databases of size 105 (with a mean sequence length Figure 4: (a) is computing times of 15). (b) and (c) are mean computing times w.r.t. sequence length and vocabulary size (with |D| = 250 and σ = 20%). Figure 4-(b) shows that both declarative and procedural approaches are penalised by larger sequence lengths. The algorithm behaviour on this plot is consistent with previous plots and ASP-MRSM remains the most ecient approach. Figure 4-(c) shows the mean computing time with an increasing vocabulary size |I|. Understanding the impact of the size of the vocabulary is not trivial, since a smaller vocabulary means fewer items to process but also increases the dataset density. In Table 2, we can see that the fewer are items, the fewer are rare patterns (with a xed size, |D| and mean length, λ). Remark that the slope decreases with the vocabulary size: the number of patterns increases quickly for small vocabularies, but increases slowly for larger vocabularies. This behaviour is explained by the combinatorics of items: with a lot of rare items, the mean length of mRP is smaller because it is more likely to have patterns of length 1 that are non-zero rare (with our simulated dataset). Such patterns are easy to extract and eliminate many patterns from the set of mRP ( i.e. all patterns that contain the item). Thus, the computation time increases less faster. In fact, the shorter is a mRP, the more rare patterns it represents. As a consequence, the computation time increases slower. 4.2 Application to rare care pathway analytics We now apply our approach to care pathway analytics to illustrate that the number of rare patterns can be reduced using additional expert constraints. This is not a new result but we show that extracting rare patterns becomes practically interesting with constraints. The ASP approach was here crucial to incorporate constraints easily. Care pathway analytics is a research eld that consists in exploring patient care pathways to understand care uses, more especially drugs consumption in our case, and their impacts on patients health. This case study analyzes care pathways of patients exposed to anti-epileptic drugs who had epileptic seizures. Rare pattern mining aims at identifying (i) patients to exclude from the study because of their unexpected rare care pathways, for instance due to specic heavy treatments (cancer, hepatitis C, AIDS, etc.); (ii) rare adverse drugs reactions. The dataset is made of 500 care pathways with drugs deliveries during one year. This represents a total amount of 101, 793 events, with |I| = 3, 671. Drug deliveries are encoded with seq(t,p,e) atoms where e 57 Thres- #patterns Memory usage (bytes) hold (σ ) #Rare #mRP ASP-RSM ASP- Apriori-Rare MRG- MRSM EXP Dataset  1000 sequences 8 8 6 6 5% 1 011 117 260 1.0×10 0.7×10 2.5×10 5.1×10 8 8 6 6 10% 1 011 993 527 1.0×10 0.7×10 5.3×10 7.5×10 8 8 6 7 20% 1 012 360 2 280 1.1×10 0.7×10 9.6×10 1.2×10 8 8 6 7 30% 1 012 428 7 123 1.3×10 0.9×10 9.8×10 5.3×10 Dataset  5000 sequences 8 8 5 5 5% 105 907 477 1.0×10 3.5×10 5.2×10 5.8×10 8 8 6 5 10% 109 861 957 1.0×10 3.5×10 5.2×10 7.3×10 8 8 6 6 20% 110 879 4 754 1.0×10 3.7×10 5.6×10 3.7×10 8 8 6 7 30% 111 022 16 530 1.2×10 4.8×10 7.8×10 3.8×10 Dataset  10000 sequences 8 8 6 6 5% 108 266 456 2.0×10 6.5×10 1.3×10 1.3×10 8 8 6 6 10% 110 334 734 2.0×10 6.5×10 3.4×10 3.1×10 8 8 6 6 20% 110 955 4 086 2.0×10 6.5×10 5.6×10 3.6×10 8 8 6 6 30% 111 036 10 177 2.0×10 8.0×10 7.2×10 7.3×10 Table 2: Memory consumption and number of patterns of ASP-RSM, ASP-MRSM, Apriori-Rare and MRG-EXP w.r.t dataset size and rarity threshold (with λ = 10). 2,000 2,000 # Patterns # Patterns 1,500 1,000 1,000 0 20 40 60 80 0 2 4 6 max-gap Minimal frequency threshold (%) Figure 5: Number of rare patterns extracted from care pathways w.r.t expert constraints. is the active component of the delivered drug. The rough mining of minimal rare sequential patterns with σ = 10% yields 7, 758 mRPs of length at most 3 (maxlen = 3). This pattern number makes these results dicult to analyze by clinicians. Two types of constraints have been added in order to illustrate the pattern reduction. The rst one is a relaxation of rare patterns. Too rare patterns are poorly signicant but so numerous. We thus introduced a second parameters fmin such that the support of extracted patterns must be between fmin and σ . The second type of constraints concerns the maximum delay between two events of a pattern occurrence. This constraint, known as the maxgap constraint, avoids to enumerate an occurrence while two of its events are too far in time. It assumes that the rst event may causally be related to the second one. Figure 5 illustrates the eciency of this constraint to reduce the number of rare patterns with σ = 10% and maxlen = 3. The plot on the left illustrates the number of patterns w.r.t. the maxgap constraints value (from 3 to 25 events). Considering that having larger max-gap reduces the number of non-zero patterns, this number increases with maxgap. The plot on the right illustrates the number of patterns w.r.t. the minimal frequency constraint, fmin (from 0.4% to 6%). The closer fmin is to σ , the fewer patterns will be generated. This number decreases exponentially with fmin . The number of patterns extracted with fmin = 6% fall down to 87. This number of patterns is suciently low to be presented to clinicians. Moreover, the computing time is reduced by using expert constraints. This nal experiment is solved in 425s which is signicantly faster than in our experiments on synthetic datasets. The two constraints can be combined. We choose to select patterns with fmin = 5% and maxgap = 3. In such case, the number of patterns fall down to 133. One interesting pattern describes the deliveries of lamotrigine, 58 an anti-epileptic drug, co-occurring with two antibiotics ceftriaxone and amoxicillin. This unexpected potential adverse drug reaction may lead to new epidemiological questions. 5 Conclusion This article extends Gebser et al. [11] work, on declarative sequential pattern mining with Answer Set Pro- gramming by proposing an ecient encoding for rare patterns mining. The paper shows the versatility of the declarative approach to design easily a new data mining task. More specically, we have shown that ASP encod- ings is an ecient solution compared to procedural approaches. Unfortunatly, the memory consumption limits the analysis to mid-size databases. This problem is known in pattern mining with ASP [11] and the solution can be to use dedicated propagators [15] to improve computing eciency of the approach. We have also illustrated the benet of our approach by extracting sequential patterns satisfying additional expert constraints, in the context of a care pathway analytics application. In this context, our approach can eciently prune useless patterns and reduce computation times. As future work, we aim at nding and adding contraints through a human-machine interface in which the user expresses his needs and the machine converts them into ASP constraints. Acknowledgements This work is a part of the PEPS project funded by the French national agency for medicines and health products safety (ANSM), and of the SePaDec project funded by Region Bretagne. References [1] Koh, Y.S., Ravana, S.D.: Unsupervised rare pattern mining: A survey. Transactions on Knowledge Discovery from Data 10(4) (2016) 129 [2] Hu, Z., Wang, H., Zhu, J., Li, M., Qiao, Y., Deng, C.: Discovery of rare sequential topic patterns in document stream. In: Proceedings of the SIAM International Conference on Data Mining. (2014) 533541 [3] Szathmary, L.: Finding minimal rare itemsets with an extended version of the Apriori algorithm. In: Proceedings of the International Conference on Applied Informatics. Volume 1. (2014) 8592 [4] Muggleton, S., de Raedt, L.: Inductive logic programming: Theory and methods. The Journal of Logic Programming 19 (1994) 629679 [5] De Raedt, L., Guns, T., Nijssen, S.: Constraint programming for itemset mining. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM (2008) 204212 [6] Sebag, M., Rouveirol, C.: Constraint inductive logic programming. In: Advances in Inductive Logic Programming. IOS Press (1996) 277294 [7] Corapi, D., Russo, A., Lupu, E.: Inductive logic programming in answer set programming. In: International Conference on Inductive Logic Programming, Springer (2011) 9197 [8] Jabbour, S., Sais, L., Salhi, Y.: Decomposition based SAT encodings for itemset mining problems. In: Proceeding of the Pacic-Asia Conference Advances on Knowledge Discovery and Data Mining, Part II. (2015) 662674 [9] De Raedt, L., Guns, T., Nijssen, S.: Constraint programming for data mining and machine learning. In: Proceedings of the Twenty-Fourth AAAI Conference on Articial Intelligence (AAAI-10). (2010) 16711675 [10] Guns, T., Dries, A., Nijssen, S., Tack, G., Raedt, L.D.: Miningzinc: A declarative framework for constraint- based mining. Artif. Intell. 244 (2017) 629 [11] Gebser, M., Guyet, T., Quiniou, R., Romero, J., Schaub, T.: Knowledge-based sequence mining with ASP. In: Proceedings of the Twenty-Fifth International Joint Conference on Articial Intelligence (IJCAI). (2016) 14971504 [12] Janhunen, T., Niemelä, I.: The answer set programming paradigm. AI Magazine 37 (2016) 1324 59 [13] Gebser, M., Kaminski, R., Kaufmann, B., Ostrowski, M., Schaub, T., Schneider, M.: Potassco: The Potsdam answer set solving collection. AI Communications 24 (2) (2011) 107124 [14] Szathmary, L., Valtchev, P., Napoli, A.: Generating rare association rules using the minimal rare itemsets family. International Journal on Software and Informatics 4 (3) (2010) 219238 [15] Negrevergne, B., Guns, T.: Constraint-based sequence mining using constraint programming. In: Proceed- ings of the International Conference on Integration of AI and OR Techniques in Constraint Programming. (2015) 288305 60