Introduction

Mining rare sequential patterns with ASP

Ahmed Samet Thomas Guyet

0 0 UniversitØ Rennes 1/IRISA-UMR6074 AGROCAMPUS-OUEST/IRISA-UMR6074 Benjamin Negrevergne UniversitØ Paris-Dauphine, PSL Research University , CNRS, LAMSADE

51 60

This article presents an approach of meaningful rare sequential pattern mining based on the declarative programming paradigm of Answer Set Programming (ASP). The setting of rare sequential pattern mining is introduced. Our ASP approach provides an easy manner to encode expert constraints on expected patterns to cope with the huge amount of meaningless rare patterns. Encodings are presented and quantitatively compared to a procedural baseline. An application on care pathways analysis illustrates the interest of expert constraints encoding.

Introduction

In: Nicolas Lachiche, Christel Vrain (eds.): Late Breaking Papers of ILP 2017, OrlØans, France, September 4-6, 2017, published at http://ceur-ws.org Seq.

s2 ha; c; b; ci

s3 ha; b; ci

s4 ha; b; ci

s5 ha; ci s6 hbi s7 hci

Set Programming (ASP), Constraint Programming (CP) and Satisability Solving (SAT). For each approach, the objective is to propose a uniform representation for examples, background knowledge and hypotheses. ILP started with encodings in Prolog [ 4 ], and alternative systems have been implemented based on CP [ 6 ] or on ASP [ 7 ]. In declarative pattern mining, most of the approaches have been implemented using SAT [ 8 ], CP [ 9, 10 ] and some with ASP [ 11 ]. In this work, we choose the paradigm of ASP to implement our rare sequential pattern mining to benet from a rst-order logic programming language. It makes the encoding of extra constraints easier and its solvers have proved their eciency [ 11 ].

The contributions of this paper are threefold: 1. we formalize the general problem of rare sequential pattern mining and we model it in ASP, together with two important variations of this problem (non-zero-rare and minimal rare patterns). Thanks to the exibility of ASP, we were able to solve all three problems with the same ASP solver, avoiding the tedious work of designing and implementing an algorithm for each new problem. 2. We provide important insights on modelling eciency by comparing several alternative models. The experimental comparison shows that general ASP-based approaches can compete with ad-hoc specialized algorithms. 3. We apply rare sequential mining to our target application of analyzing care pathway data. In particular, we demonstrate the benet of additional constraints to extract meaningful results in this domain. 2

Rare sequential patterns: denitions and properties This section introduces the basic concepts of rare sequential pattern mining.

Let I = fi1; i2; : : : ; ijIjg be a set of items. A sequence s, denoted by hsj i1 j n is an ordered list of items sj 2 I, where n denotes the length of the sequence.

Given two sequences s = hsj i1 j n and s0 = hs0j i1 j m with n m, we say that s is a subsequence of s0, denoted s s0, i there exists n integers i1 < : : : < in such that sk = s0ik ; 8k 2 f1; : : : ; ng.

Let D = s1; s2; : : : ; sN , be a dataset of N sequences. The support of any sequence s, denoted supp(s), is the number of sequences si 2 D that contain s: supp(s) = si 2 Djs si

Let 2 N+ be a frequency threshold given by the data analyst. A sequence s is rare i supp(s) < . In such case, the sequence s is called a rare sequential pattern or a rare pattern for short.

It is noteworthy that rare patterns are strongly related to the frequent patterns, i.e. the sequence s such that supp(s) . The set of rare patterns is the dual of frequent patterns. A pattern that is not rare is frequent and vice versa.

Moreover, the rarity constraint is monotone, i.e. for two sequences s and s0, if s s0 and s is rare, then s0 is rare. This property comes from the anti-monotonicity of the support measure.

Example 2.1 (Sequences and rare patterns) Table 1 presents a simple dataset D, dened over a set of items I = fa; b; c; dg. The support of sequence p1 = ha; bi is 4 and the support of p2 = hc; ci is 1. For a support threshold = 2, p2 is a rare pattern but p1 is not. Due to the monotonicity of rarity, every extension of p2 is also rare (e.g. ha; c; ci).

We now introduce two additional denitions: zero-rare patterns and minimal rare patterns . A zero-rare pattern is a rare pattern with a null support. This means that this pattern does not occur in dataset sequences. A non-zero-rare pattern is a rare pattern which is not zero-rare.

Finally, a sequence s is a minimal rare pattern (mRP) i it is rare but all its proper subsequences are frequent. (1) (2) mRP = fsjsupp(s) < ^ 8s0 s; supp(s0)

g fg < c; c > < c > < d; a > < d; b >

< d; c > < d >

seq (1 ,1 , d ). seq (1 ,2 , a ). seq (1 ,3 , b ). seq (1 ,4 , c ). seq (2 ,1 , a ). seq (2 ,2 , c ). seq (2 ,3 , b ). seq (2 ,4 , c ). seq (3 ,1 , a ). seq (3 ,2 , b ). seq (3 ,3 , c ). seq (4 ,1 , a ). seq (4 ,2 , b ). seq (4 ,3 , c ). seq (5 ,1 , a ). seq (5 ,2 , c ). seq (6 ,1 , b ). seq (7 ,1 , c ).

Listing 1: Facts specifying the sequence dataset of Table 1

Example 2.2 (minimal rare patterns) Figure 1 illustrates the lattice of non-zero rare patterns in the dataset D of Table 1 for = 2. For example, ha; c; bi is a non-zero-rare pattern, but is not a mRP since one of its subpatterns is rare (i.e. hc; bi ha; c; bi). hb; ai (omitted in Figure 1) is a zero-rare pattern as it does not appear in D. 3

Mining rare sequential patterns with ASP

In this section, we detail ASP encodings for mining rare and minimal rare sequential patterns. ASP is a declarative programming paradigm founded on the semantic of stable models [ 12 ]. From a general point of view, declarative programming gives a description of what is a problem instead of specifying how to solve it. A program describes the problem and a dedicated solver nds out its solutions.

An ASP program is a set of rules of the form a0 :- a1; : : : ; am; not am+1; : : : ; not an, where each ai is a propositional atom for 0 i n and not stands for default negation. Atoms can be encoded using a rst-order logic syntax. If n = 0, such rule is called a fact. If a0 is omitted, such rule represents an integrity constraint. The reader can refer to Janhunen et al. [ 12 ] for a more detailed introduction to ASP. Dedicated solvers, such as clingo [ 13 ], ensure the ecient solving of ASP encodings.

In the following, Section 3.1 introduces the ASP generation of candidate sequential patterns, then the rarity constraints is introduced in Section 3.2 and nally, Section 3.3 introduces additional constraints to extract eciently only the minimal-rare patterns.

First of all, the dataset has to be encoded as facts. It is represented by atoms seq(s,t,i) stating that item i occurs at time t in sequence s. Listing 1 illustrates facts encoding of the sequence dataset of Table 1. 3.1

Enumerate candidate sequential patterns in ASP Listing 2 species the candidate generation. It reuses several notations and principles introduced by Gebser et al. [ 11 ]. Line 1 lists all atoms present in the dataset. For each item i 2 I a unique atoms item(i) is generated. The token "_" stands for an anonymous variable that does not recur anywhere. Line 2 denes the sequence lengths of each sequence from the dataset. The length of the sequence is an item position for which there is no items afterwards. Lines 4 to 7 enumerate all possible patterns. A pattern is a sequence of positions to ll up with exactly one item 1. It is encoded by atoms pat(x,i) where x is a position in the pattern and i 2 I. Predicate patpos/1 takes one variable and denes the sequence position for pattern items. Rules with curly brackets in the head are choice rules such as lines 5 and 7. Rule at line 5 makes the pattern length choice which can not be larger than maxlen. Line 7 chooses exactly one atom pat/2 for each position X and thus generates the candidate patterns combinatorics.

1We do not consider sequences of itemsets for sake of encoding simplication.

Lines 9 to 13 evaluate the pattern support. This encoding uses the ll-gaps strategy proposed in [ 11 ] (see Figure 2). The idea consists in mapping each sequence position to one (and exactly one) pattern position. In addition, some sequence positions are associated with an item that does not match the item in the mapped pattern position. This denotes lling positions and indicates which part of the pattern has already been read . Lines 9 to 11 traverse each sequence s = hsji1 j n in D to yield atoms of the form occ(s,l,pl) where pl is a position in the pattern and l is a position in the sequence s. occ(s,l,pl) indicates that the pl-th rst items of the patterns have been read at position l of the sequence. Line 9 searches for a compatible position P with the rst pattern item. Line 11 yields a new atom occ(s,l,pl) while the (l 1)-th item has been read at position (pl 1) and the pl-th sequence item matches the l-th pattern item. In any case, line 10 yields an atom associating position pl with already read positions of the pattern. The support(s) is yielded line 13 while sequence t holds the current pattern.

1 item (I) :- seq (_ , _ ,I ). 2 seqlen (S ,L) :- seq (S ,L ,_), not seq (S ,L +1 , _ ). 4 patpos (1). 5 { patpos (X +1) } :- patpos (X), X < maxlen . 6 patlen (L) :- patpos (L), not patpos (L +1). 7 1 { pat (X ,I) : item (I) } 1 :- patpos (X ). 9 occ (S ,1 , P) :- pat (1 , I), seq (S ,P ,I ). 10 occ (S ,L ,P) :- occ (S , L , P -1) , seq (S ,P ,_ ). 11 occ (S ,L ,P) :- occ (S , L -1 , P -1) , seq (S ,P ,C), pat (L ,C ). 13 support (S) :- occ (S , L , LS ), patlen (L), seqlen (S , LS ).

Listing 2: Encoding of candidate patterns generation For more details about ll-gaps strategy and its alternatives, we refer the readers to the article of Gebser et al. [11]. 3.2

Ecient encoding of rarity constraint (ASP-RSM) The modularity of ASP encoding enables to easily add constraints to the previous program. Where Gebser et al. added constraints to select only the frequent patterns, we select the answer sets that correspond to non-zerorare patterns.

The following constraint states that the number of supported sequences must be strictly below the given threshold th (i.e. not above th). th is a program constant that can be specied at solving time. :- # count { S : support (S) } >= th .

Although this is a correct formulation, we propose an alternative encoding that explicitly expresses the monotonicity property to help the solver pruning non-zero-rare patterns. This encoding introduces a rare/1 predicate (see listing 3). Instead of evaluating the support of pattern hpji1 j n, it evaluates independently the support of each of its prexes. For any l; 1 l n, rare(l) means that the subpattern hpj i1 j l is rare. Line 18 imposes to have the atom rare(n) where n is the pattern length. Line 17 gives a simpler manner to yield such atom: if a prex-pattern of length l is rare, then all patterns of length l0, l0 > l, are rare. This rule prevents the solver from evaluating all rules of line 15 and it potentially prevents from evaluating occ/3 atoms. Finally, line 19 prunes zero-rare patterns imposing to have at least one supported sequence. 15 rare ( IL ): - IL =1.. L , patlen ( L ) , 16 # count { S : occ (S , IL , LS ) , seqlen (S , LS ) } < th . 17 rare ( L ) : - rare (L -1) , L <= PL , patlen ( PL ). 18 : - not rare ( L ) , patlen ( L ). 19 : - not support ( S ) : seq (S ,_ , _ ).

Listing 3: Encoding of rare sequence mining As the number of non-zero-rare patterns may be very high, the following section presents ASP-MRSM, an extension of ASP-RSM that extracts the condensed set of minimal-rare patterns, mRP. 3.3

ASP Minimal Rare Sequence Miner (ASP-MRSM) The encoding in Listing 4 can be used to mine minimal rare sequences. It is based on the observation that it is sucient to evaluate the support of all sub-patterns of size n 1 to determine whether a pattern p of size n is a minimal. Let p = hpii1 i n be a non-zero-rare pattern. p is a mRP i 8u 2 f1; : : : ; ng, the sub-pattern pu = hpiii2f1;:::;u 1;u+1;:::;ng is frequent (i.e. not rare). According to Equation 2, p is a mRP i 8p0 p, p0 is frequent. It is then clear that if p is a mRP, then 8u; pu is frequent. Conversely, for all sequence s such that s p, 9u such that s pu. Then, as pu is frequent and by anti-monotonicity of the frequency constraint, s is frequent too.

Furthermore, the encoding takes advantage of two properties of sequential patterns: 1. a sequence that supports the pattern p supports each pattern pu, 2. if a sequence s is supported by pu but not by p, then for all l 2 [1; u 1], occ(s,l,pl) is part of the ASP model computed by Listing 2, where (pl)l2[1;(n 1)] is an occurrence of pu. In fact, as there is no dierence between pu and p before position u, there is no dierences between their occurrences.

These properties are helpful to decide whether pu is rare. In Listing 4, line 20 enumerates all subpatterns, identied by U. Similarly to occ/3 predicates, atoms spocc(s,u,l,pl) describe the occurrence of pu in sequence s (Figure 3).

Lines 22 to 26 compute occurrences of pattern pu according to the ll-gaps principle (see section 3.1) with two dierences: (1) spocc/4 are yielded only if sequence t is not supported, otherwise the sequence is obviously supported knowing that pu p; and (2) spocc/4 starts from pattern position u reusing occ(t,u 1,_) atoms. It thus avoids redundant computation according to the second property above. Lines 28 to 33 determine whether the subpattern pu is rare. This encoding diers from section 3.2 mainly at lines 30-32. The total number of sequences supported by pu is the number of sequences supported by p (line 32) plus the number of sequences supported by pu but not by p (line 31), i.e. sequences for which spocc/4 atoms reached the last pattern item. Finally, line 33 eliminates answer sets for which a subpattern is rare.

It is worth noticing that in ASP two answer sets can not share information during the solving process. Gebser et al. [ 11 ] proposed to use a solver extension, called asprin, to encode the mining of closed patterns, i.e. patterns which have any larger patterns with the same support. A similar approach may be used for mRP. Our approach prefers a pure declarative but more ecient encoding. 22 spocc (S ,1 ,2 , P) :- seq (S ,P ,I), pat (2 , I), not support (S ). 23 spocc (S ,U ,U ,P) :- suppat (U), occ (S ,U -1 , P), not support (S ). 24 spocc (S ,U ,L ,P +1) :- spocc (S ,U ,L ,P), seq (S ,P +1 , _ ). 25 spocc (S ,U ,L +1 , P +1) :- spocc (S ,U ,L ,P), seq (S ,P +1 , I), 26 pat (L +1 , I ). 28 sprare (U ,U -1): - sprare (U -1) , suppat (U ). 29 sprare (U ,L) :- sprare (U , L -1) , L <= LP , patlen ( LP ). 30 sprare (U , IL ): - suppat (U), IL =U +1.. L , patlen (L), 31 # count { S: spocc (S ,U ,IL , LS ), seqlen (S , LS ); 32 S: support (S) } < th . 33 :- sprare (U ,L), patlen (L ).

Listing 4: Encoding part for minimal rare patterns

Experiments and results

In this section, we evaluate our approach on synthetic and real datasets. First, we use synthetic datasets to compare performances of our ASP encodings with procedural algorithms. Then, we demonstrate the exibility of our declarative pattern mining approach on a case study.

In all experiments, we used clingo, version 5.0. All programs were run on a computing server with 16Gb RAM. Multi-threading capabilities of clingo have been disabled (1) to make a fair comparison with the procedural algorithms that is not parallel and (2) to prevent from having high variance in run times measurements due to multi-threading strategies of clingo. 4.1

Runtime and memory eciency on synthetic datasets In this rst set of experiments, we use synthetic datasets to compare the eciency of our ASP encodings with procedural algorithms. Four approaches have been compared: ASP-RSM and Apriori-Rare extract non-zero-rare patterns; and ASP-MRSM and MRG-EXP extract minimal rare patterns. Since Apriori-Rare [ 3 ] and MRG-Exp [ 14 ], were originally developed for rare itemset mining, they were adapted to mine rare sequential patterns 4.

We use several synthetic datasets to measure the impact of various parameters such as vocabulary size and mean sequence length on the runtime and on the memory usage. Our sequence generator 4 produces datasets following a 3-step retro-engineering procedure inspired from the IBM Quest Synthetic Data Generator: 1) a set of random patterns is generated, 2) for each pattern, a list of occurrences is generated, and 3) synthetic sequences are built so that each sequence contains a pattern if the sequence is in the pattern occurrence list. The sequences are then completed with random items so that the mean sequence length of the resulting dataset reaches the input parameter .

Figure 4 presents runtimes for with respect to dataset size (a), sequence length (b), and vocabulary size (c). Table 2 presents the number of extracted patterns and the memory footprint.

First, we can see in Figure 4-(a) that ASP-MRSM is an order of magnitude faster than alternatives approaches. This is an important result since previous work on declarative pattern mining have shown that declarative approaches are often slower than the procedural counterparts [ 11, 15 ]. This experiment demonstrates that the overhead induced by the declarative encoding is balanced by the more ecient solving strategy achieved by the ASP solver.

Second, we focus on ASP-RSM and Apriori-Rare (non-zero-rare-patterns). In general, decreasing the threshold induces a larger search space and thus higher runtimes. However, we can see that ASP-RSM is far less sensitive to variations of the threshold than Apriori-Rare is. As a consequence, ASP-RSM is faster for very low thresholds. Apriori-Rare is greatly impacted by the search space size, even though the number of rare patterns remains almost constant as it can be seen in Table 2. In contrast the runtime of ASP-RSM remains fairly low, even when the threshold is low. This demonstrates again that the ASP solver is able to explore the search space more eciently than our procedural algorithm.

4The source codes of the ASP programs, of procedural algorithms and of the dataset generator are available at http://www. ilp-paper-1.co.nf/ 0:5 1 1:5 Dictionary size ( jDj/1000)

=20 = 20%

ASP-MRSM ASP-RSM MRG-EXP Apriori-Rare

2 (b) 10 15 20 25 Mean sequence length ( ) jDj = 250 = 20% 30 (c) We now apply our approach to care pathway analytics to illustrate that the number of rare patterns can be reduced using additional expert constraints. This is not a new result but we show that extracting rare patterns becomes practically interesting with constraints. The ASP approach was here crucial to incorporate constraints easily. Care pathway analytics is a research eld that consists in exploring patient care pathways to understand care uses, more especially drugs consumption in our case, and their impacts on patients health.

This case study analyzes care pathways of patients exposed to anti-epileptic drugs who had epileptic seizures. Rare pattern mining aims at identifying (i) patients to exclude from the study because of their unexpected rare care pathways, for instance due to specic heavy treatments (cancer, hepatitis C, AIDS, etc.); (ii) rare adverse drugs reactions. The dataset is made of 500 care pathways with drugs deliveries during one year. This represents a total amount of 101; 793 events, with jIj = 3; 671. Drug deliveries are encoded with seq(t,p,e) atoms where e Threshold ( ) 5% 10% 20% 30% 5% 10% 20% 30% 5% 10% 20% 30% s n r e t t a P # 2;000 1;500 1;000

Memory usage (bytes) #Rare

#mRP is the active component of the delivered drug.

The rough mining of minimal rare sequential patterns with = 10% yields 7; 758 mRPs of length at most 3 (maxlen = 3). This pattern number makes these results dicult to analyze by clinicians.

Two types of constraints have been added in order to illustrate the pattern reduction. The rst one is a relaxation of rare patterns. Too rare patterns are poorly signicant but so numerous. We thus introduced a second parameters fmin such that the support of extracted patterns must be between fmin and . The second type of constraints concerns the maximum delay between two events of a pattern occurrence. This constraint, known as the maxgap constraint, avoids to enumerate an occurrence while two of its events are too far in time. It assumes that the rst event may causally be related to the second one.

Figure 5 illustrates the eciency of this constraint to reduce the number of rare patterns with = 10% and maxlen = 3. The plot on the left illustrates the number of patterns w.r.t. the maxgap constraints value (from 3 to 25 events). Considering that having larger max-gap reduces the number of non-zero patterns, this number increases with maxgap. The plot on the right illustrates the number of patterns w.r.t. the minimal frequency constraint, fmin (from 0.4% to 6%). The closer fmin is to , the fewer patterns will be generated. This number decreases exponentially with fmin. The number of patterns extracted with fmin = 6% fall down to 87. This number of patterns is suciently low to be presented to clinicians. Moreover, the computing time is reduced by using expert constraints. This nal experiment is solved in 425s which is signicantly faster than in our experiments on synthetic datasets.

The two constraints can be combined. We choose to select patterns with fmin = 5% and maxgap = 3. In such case, the number of patterns fall down to 133. One interesting pattern describes the deliveries of lamotrigine, an anti-epileptic drug, co-occurring with two antibiotics ceftriaxone and amoxicillin. This unexpected potential adverse drug reaction may lead to new epidemiological questions. 5

Conclusion

This article extends Gebser et al. [ 11 ] work, on declarative sequential pattern mining with Answer Set Programming by proposing an ecient encoding for rare patterns mining. The paper shows the versatility of the declarative approach to design easily a new data mining task. More specically, we have shown that ASP encodings is an ecient solution compared to procedural approaches. Unfortunatly, the memory consumption limits the analysis to mid-size databases. This problem is known in pattern mining with ASP [ 11 ] and the solution can be to use dedicated propagators [ 15 ] to improve computing eciency of the approach. We have also illustrated the benet of our approach by extracting sequential patterns satisfying additional expert constraints, in the context of a care pathway analytics application. In this context, our approach can eciently prune useless patterns and reduce computation times. As future work, we aim at nding and adding contraints through a human-machine interface in which the user expresses his needs and the machine converts them into ASP constraints.

Acknowledgements

This work is a part of the PEPS project funded by the French national agency for medicines and health products safety (ANSM), and of the SePaDec project funded by Region Bretagne. [1] Koh, Y.S., Ravana, S.D.: Unsupervised rare pattern mining: A survey. Transactions on Knowledge Discovery from Data 10(4) (2016) 129 37 (2016) 1324 Potassco: The

[2] Hu , Z. , Wang , H. , Zhu , J. , Li , M. , Qiao , Y. , Deng , C. : Discovery of rare sequential topic patterns in document stream . In: Proceedings of the SIAM International Conference on Data Mining . ( 2014 ) 533541

[3] Szathmary , L. : Finding minimal rare itemsets with an extended version of the Apriori algorithm . In: Proceedings of the International Conference on Applied Informatics . Volume 1 . ( 2014 ) 8592

[4] Muggleton , S. , de Raedt , L.: Inductive logic programming: Theory and methods . The Journal of Logic Programming 19 ( 1994 ) 629679

[5]

Raedt , L. , Guns , T. , Nijssen , S.: Constraint programming for itemset mining . In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining , ACM ( 2008 ) 204212

[6] Sebag , M. , Rouveirol , C. : Constraint inductive logic programming . In: Advances in Inductive Logic Programming . IOS Press ( 1996 ) 277294

[7] Corapi , D. , Russo , A. , Lupu , E.: Inductive logic programming in answer set programming . In: International Conference on Inductive Logic Programming , Springer ( 2011 ) 9197

[8] Jabbour , S. , Sais , L. , Salhi , Y. : Decomposition based SAT encodings for itemset mining problems . In: Proceeding of the Pacic-Asia Conference Advances on Knowledge Discovery and Data Mining , Part

. ( 2015 ) 662674

[9]

Raedt , L. , Guns , T. , Nijssen , S.: Constraint programming for data mining and machine learning . In: Proceedings of the Twenty-Fourth AAAI Conference on Articial Intelligence (AAAI-10) . ( 2010 ) 16711675

[10] Guns , T. , Dries , A. , Nijssen , S. , Tack , G. , Raedt , L.D.: Miningzinc: A declarative framework for constraintbased mining . Artif. Intell . 244 ( 2017 ) 629

[11] Gebser , M. , Guyet , T. , Quiniou , R. , Romero , J. , Schaub , T. : Knowledge-based sequence mining with ASP . In: Proceedings of the Twenty-Fifth International Joint Conference on Articial Intelligence (IJCAI) . ( 2016 ) 14971504

[12] Janhunen , T. , Niemel , I. : The answer set programming paradigm . AI Magazine

[13] Gebser , M. , Kaminski , R. , Kaufmann, B. , Ostrowski , M. , Schaub , T. , Schneider , M. : Potsdam answer set solving collection . AI Communications 24 ( 2 ) ( 2011 ) 107124

[14] Szathmary , L. , Valtchev , P. , Napoli , A. : Generating rare association rules using the minimal rare itemsets family . International Journal on Software and Informatics 4 ( 3 ) ( 2010 ) 219238

[15] Negrevergne , B. , Guns , T. : Constraint-based sequence mining using constraint programming . In: Proceedings of the International Conference on Integration of AI and OR Techniques in Constraint Programming . ( 2015 ) 288305