Introduction

Mining Correlated Association Rules from Multi-Relational Data using Interval Patterns

Hirohisa Seki

seki@nitech.ac.jp 1 2

Masahiro Nagao?

1 2 0 58, Department of Computer Science, Palacky University Olomouc , 2018. Copying permitted only for private and academic purposes 1 Dept. of Computer Science, Nagoya Institute of Technology Gokiso-cho , Showa-ku, Nagoya 466-8555 , Japan 2 c paper author(s), 2018. Proceedings volume published and copyrighted by its editors. Paper published in Dmitry I. Ignatov , Lhouari Nourine (Eds.): CLA 2018, pp. 47

47 58

In this paper, we consider the problem of mining numerical association rules (ARs) from a multi-relational database (MRDB). More specifically, we examine the effectiveness of numerical ARs with interval patterns (IPs) proposed by Kaytoue et al. in FCA (Formal Concept Analysis), and show that the MinIntChange algorithm by Kaytoue et al. can be readily extended to mine correlated interval-based ARs with the maximal significance in terms of the χ2 measure, by incorporating into the algorithm a pruning technique by Morishita et al. Moreover, since the search space for computing closed IPs becomes larger as the number of numerical attributes increases, we utilize Super CWC , an off the shelf feature selection algorithm to reduce the number of attributes to use. Our approach is experimentally evaluated and compared with the conventional methods such as a discretization-based approach or an optimization-based approach.

Introduction

Numerical data arise prevalently in databases, including business and scientific databases. Handling numerical (or quantitative) data in data mining has attracted much attention since the work on mining quantitative association rules by Srikant and Agrawal [ 16 ]. Conventionally, data discretization is commonly used to handle numerical data; for a quantitative attribute which can have continuous values, it reduces the number of values by dividing the range of the attribute into intervals. The other approaches to handling numerical data have also been proposed, including a statistical distribution-based approach and an optimization-based approach (see the survey in [ 14 ]).

Kaytoue et al. [ 8 ] proposed an FCA-based approach to handling quantitative attributes, and introduced the notions of closed interval patterns (CIPs) as well as generators. The notion of IPs is an instance of the general framework of pattern structures studied by Ganter and Kuznetsov [ 5 ]. Recently, some methods have been proposed to use CIPs for mining association rules [ 6,13 ]. In particular, the approach in [ 13 ] handles multi-relational data mining (MRDM); it uses CIPs for mining relational quantitative association rules of the form A → C, where A and C are relational patterns (i.e., logical conjunctions), and they consist of categorical attributes and quantitative ones from a given multi-relational data. In both work, although closed interval patterns allow us to represent intervals concisely, the numbers of generated patterns (i.e., rules) are still large, which makes the computation expensive and imposes a significant burden on the user’s understanding.

In this paper, we study the problem of mining optimal relational association rules that have the maximum χ2 value between the assumption and the conclusion of the rule. To find such rules, we use the original MinIntChange algorithm by Kaytoue et al., and incorporate into it a pruning technique by Morishita et al. [ 11 ]. Moreover, since the search space for computing closed IPs becomes larger as the number of numerical attributes increases, we utilize Super CWC [ 15 ], an off the shelf feature selection algorithm to reduce the number of attributes to use. We give some experimental results, which show the effectiveness of the proposed method.

The organization of the rest of this paper is as follows. We first summarize some basic notations and definitions of relational association rule mining and interval patterns in Sect. 2. We then explain our approach to mining quantitative association rules from multi-relational data in Sect. 3, and show some experimental results in Sect. 4. Finally, we give a summary of this work in Sect. 5. 2

Relational Association Rules with Quantitative Attributes 2.1

Relational Pattern Mining and Interval Patterns

We use some basic notions of MRDM in [ 3 ]. To represent data and patterns, we use a class of first-order logical formulas. An atom is an expression of the form p(t1, . . . .tn), where p is a predicate and each ti is a term (i.e., a constant or a variable). A substitution θ = {X1/t1, . . . , Xn/tn} is an assignment of terms to variables. The result of applying a substitution θ to a formula (i.e., an atom or a conjunction in this case) F is the formula F θ, where all occurrences of variables Vi have been simultaneously replaced by the corresponding terms ti in θ. The set of variables occurring in a formula F is denoted by Var (F ). A pattern is expressed as a conjunction l1 ∧ · · · ∧ ln of atoms, denoted simply by l1, . . . , ln.

A database DB is a set of ground atoms. For a pattern C, let answerset (C; DB ) be the set of substitutions θ such that Cθ is logically entailed by a database DB , denoted by DB |= Cθ.

In MRDM, we often specify one of the predicates as a key (e.g., [ 2,1 ]), which determines the entities of interest and what is to be counted. The key (target) is thus to be present in all patterns considered. Given a database DB and a conjunction C containing a key atom key (X), the support (or frequency ) of C, denoted by supp(C), is defined to be the number of different keys that answer C divided by the total number of keys. C is said to be frequent , if supp(C) is no less than some user defined threshold minsup. customerId (c1 ). customerId (c2 ). customerId (c3 ). . . . age(c1 , 30 ). age(c2 , 25 ). age(c3 , 55 ). . . .

marriedTo(c1 , c2 ). marriedTo(c2 , c1 ). marriedTo(c3 , c4 ). bigSpender (c1 ). bigSpender (c2 ).

income(c1 , 200 ). income(c2 , 120 ). income(c3 , 50 ).

An association rule we consider in this paper is an existentially quantified implication of the form: A → C, where A (resp., C) is a conjunction of the form: a1, . . . , am (resp., a single atom) (m ≥ 1). We call A (C) the antecedent (conclusion) of the rule, respectively. The support of a (relational) association rule is defined as the support of A ∧ C, while the confidence of an association rule is defined as the support of C divided by the support of the antecedent A. Following [ 7 ], we call a rule strong , if it satisfies both a minimum support threshold (minsup) and a minimum confidence (minconf ).

Example 1. Consider a toy example of a multi-relational database DB in Fig. 1, which is adapted and simplified from [ 3 ]. Predicate customerId is assumed to be a key. Let P be a pattern of the form: customerId (X), age(X, Q1), marriedTo(X, Y ), income(Y, Q2), whose meaning is obvious. Then, answerset (P ; DB ) contains substitutions {X/c1, Q1/30, Y /c2, Q2/120} and {X/c2, Q1/25, Y /c1, Q2/200}, for example.

The following rule is an example of association rules: customerId (X), age(X, Q1), marriedTo(X, Y ), income(Y, Q2) → bigSpender (X). (1) In the above, Q1 and Q2 are quantitative attributes, while the others are considered to be categorical ones. 2

We call a variable corresponding to a quantitative (resp., categorical) attribute a quantitative (resp., categorical ) variable. We also call a variable occurring in a key predicate a key variable.

Relational Association Rules with Interval Patterns We use interval

patterns to specify constraints on quantitative variables in an association rule. In the aforementioned association rule (1), for example, we consider the following association rule with constraints consisting of interval patterns: customerId (X), age(X, Q1), marriedTo(X, Y ), income(Y, Q2), hQ1, Q2i ∈ h[l1, u1], [l2, u2]i → bigSpender (X). (2) where li (ui) is a value of the domain of the attribute Qi (i = 1, 2), respectively.

Formally, let A be a conjunction such that A contains quantitative variables Q1, . . . , Qk (k ≥ 1). Then, we call an expression c of the form “hQ1, . . . , Qki ∈ hI1, . . . , Iki” an interval constraint of A, where Ii (1 ≤ i ≤ k) is an interval pattern for Qi.

Let θ be a substitution for Var (A) and Ii = [ei, fi] for some ei and fi. Then, DB |= (A, c)θ iff DB |= Aθ and Qiθ ∈ [ei, fi] for i = 1, . . . , k.

For simplicity, we write simply “A, I” instead of A, hQ1, . . . , Qki ∈ hI1, . . . , Iki, where I = hI1, . . . , Iki and we call I an interval pattern of A.

For a conjunction A which has no categorical variables except a key variable, we can define a closed pattern in the same way as [ 8 ]; let S = {θ1, · · · , θn} (n ≥ 0) be a set of substitutions for variables in A and let Q1, . . . , Qk be quantitative variables in A. Then, we define a mapping δ(·) as follows: for a substitution θ ∈ S such that θ ⊇ {Q1/a1, . . . , Qk/ak}, δ(θ) = h[a1, a1], . . . , [ak, ak]i. Namely, the mapping δ maps θ into an k-dimensional interval pattern h[a1, a1], . . . , [ak, ak]i.

Definition 1 (closed pattern).

For a conjunction A such that A has no categorical variables except a key variable, let S = {θ1, . . . , θn} (n ≥ 0) be a set of substitutions for variables in A, and I an interval pattern of A. We consider the following two operators (·)2: I2 = {θ | θ ∈ answerset (A, I; DB )}, S2 = hδ(θ1) u · · · u δ(θn)i.1.

Let I and J be an interval pattern of A. Then, I and J are equivalent if I2 = J 2 and we write it by I ≡ J . We call I closed if there does not exist any other interval pattern J such that I ≡ J and I v J . 2 2.2

Correlation Measures

Since the framework using the support/confidence only generates too many rules, we usually use another measure to find “interesting” ones among the generated rules. χ2-value is such a measure to find correlated rules; it is defined as a normalized derivation of observation from expectation. Given a contingency table in Table 1, where given m and n are both assumed to be constants, χ2 values are determined by x and y, and we thus denote it by χ2(x, y). The following property of χ2-value is shown by Morishita et al. [ 11 ].

Lemma 1 (Morishita et al.). [ 11 ] Let r(I0) be a rule with an interval pattern I0, and let a (b, resp.) be the number of (positive) tuples that satisfy the antecedent of r(I0). Let r(I) be a rule with an interval pattern I, and let p (q, resp.) be the number of (positive) tuples that satisfy the antecedent of r(I) such that 0 ≤ p ≤ a, 0 ≤ q ≤ b, q ≤ p and (p − q) ≤ (a − b). Then, we have 1 For I1 = h[ai, bi]ii∈{1,,˙k} and I2 = h[ei, fi]ii∈{1,,˙k}, u is the infimum operator defined by I1 u I2 = h[min(ai, ei), max(bi, fi)]ii∈{1,...,k}, and I2 v I1 ⇐⇒ [ei, fi] ⊆ [ai, bi], ∀i ∈ {1, . . . , k}.

χ2(p, q) ≤ max{χ2(b, b), χ2(a − b, 0)}. (3) 2

The right-hand side of (3) gives an upper bound of χ2(p, q), and we thus denote it by ub(r(I)).

QuantMiner [ 14 ], a GA-based algorithm, searches rules with high fitness function rules. The fitness function Fitness(·) is an evaluation measure for a rule, and it is based on the Gain measure proposed in [ 4 ]: Gain(A → C) = supp(A ∧ C) − min conf · supp(A). The Gain value is a measure giving a trade-off between support and confidence. Using x and y in Table 1, we write Gain(A → B) = G(x, y), and G(x, y) is also a convex function. 3

Mining Quantitative ARs with IPs from a MRDB Algorithm 1 shows the outline of our algorithm for mining correlated ARs with interval patterns from a MRDB.

Given a MRDB DB , the user first specifies a rule template of the form: A → C; it specifies conjunctions occurring in the left-hand side and the righthand side, and the right-hand side contains a single target atom C. The user also specifies values of categorical variables occurring in A and C. In case the values of the categorical variables in the rule template are not given, its possible values will be computed in the algorithm so that each of the categorical variable is instantiated to some value in its domain.

Next, we compute the answersets of A and A ∧ C, and we make the initial association rule rinit of the form: A, I⊥ → C, where I⊥ is the minimal interval constraint of A. In the aforementioned rule (2), for example, I⊥ = h[l1, u1], [l2, u2]i, where li (ui) (i = 1, 2) is the minimum (maximum) value of the domain of the attribute Qi, respectively.

If rinitis infrequent (i.e., its support supp(A ∧ C) is less than minsupp), then eDxBit,. bOythcearlwlinisge,awfuenccotmiopnuMteItCh+eps(eχt2)R(rionfits,t0r)o.ng rules with the best χ2 value on 8 9 10 11 12 13 14 15 16 17 18 19 end

end Algorithm 1: Correlated AR Mining from a MRDB input : a MRDB DB , minsupp, minconf . output: a set R of rules rb with the best χ2 value on DB .

MIC+p(χ2)(rinit, 0) ; 6 return R 1 A rule template of the form: A → C is specified by the user; // initial step 2 Compute answersets answerset (A; DB ) and answerset (A ∧ C; DB ); // mining step for categorical attributes 3 Make an initial association rule rinit of the form: A, I⊥ → C; 4 if A ∧ C is infrequent then return; 5 Initialize R ← ∅; τ ← −∞, and compute correlated ARs by calling 7 Function MIC+p(χ2)(a rule r(I) : A, I → C, an integer j) : a set R of rules with the best χ2 value on Database DB is

call MIC+p(χ2)(r(I1), i); A ← {μi,α | μi,α is applicable to I for some i ≥ j, α ∈ {l, r}}; foreach μi,α ∈ A do

I0 ← μi,α(I); if sup(r(I0)) < minsup or ub(r(I0)) < τ then continue I1 ← I022; if I1 fails the canonicity test then continue τ1 ← χ2 value of r(I1); if τ1 > τ and conf (r(I0)) ≥ minconf then τ ← τ1; R ← {r(I1)} else if τ1 = τ and conf (r(I0)) ≥ minconf then R ← R ∪ {r(I1)}; The function MIC+p(χ2)(r(I), j) is essentially the same as the MinIntChange algorithm by Kaytoue et al. [ 8 ]; the enumeration of closed IPs is done in the same way as the original MinIntChange. Namely, the algorithm generates its direct subsumers whose supports are strictly lower than its support. New interval patterns are generated by applying minimal changes to a given interval pattern (line 10). Since a closed interval pattern may be generated several times, we employ the canonicity test due to CloseByOne [ 10 ] (line 13).

Definition 2 (minimal change). [8]

Let I be an interval pattern of a conjunction A, where I = h[a1, b1], . . . , [ak, bk]i for some k ≥ 1.

A right minimal change μi,r(I) (1 ≤ i ≤ k) is defined as I0, where I0 is I with its i-th interval replaced by [ai, v] such that v = max{x ∈ Vi | x < bi} and Vi is the set of values which the quantitative variable Qi will take in a given database. A left minimal change μi,l(I) is defined dually.

A right minimal change μi,r is applicable to I if the resulting interval [ai, v] does not collapse, i.e., v − ai > 0. An applicable left minimal change μi,l(I) is defined dually. 2

MIC+p(χ2) incorporates into the original MinIntChange a pruning mechanism (line 11) based on Lemma 1 and a mechanism storing rules with currently best χ2 value (line 15–16). We then have the following properties:

Theorem 1 (Correctness of Algorithm 1). Let minsup be a given minimum

support and minconf a given minimum confidence. Let DB be a given database. Then, [Soundness] All output rules in R of Algorithm 1 give the best χ2 value on

DB , and satisfy both minsup and minconf . [Completeness] Let r(I) be an association rule of the form: A, I → C for some interval I. Then, if r gives the best χ2 value on DB and satisfies minsup and minconf , then Algorithm 1 outputs a rule r1 in R of the form: A, I1 → C such that I1 = I22.

Proof. Since the soundness is rather obvious, we omit its proof.

For the completeness, we have from the assumption and the completeness of MinIntChange that there exists a sequence s of minimal changes from the root to I1 = (I)22 such that I1 passes the canonicity test, i.e., it is not pruned in line 13. Furthermore, since the closure operator I022 (line 12) does not change its support, the computation corresponding to s is not pranced in line 11, either. Therefore, we have that r(I1) ∈ R. 2

We note that Algorithm 1 is generic in a sense that it works for another correlation measure, m, by replacing MIC+p(χ2) by MIC+p(m), provided that the correlation measure m has a property such as convexity so that it allows us to compute an upper bound ub(r(I)) (line 11). Such correlation measures include information gain, gini index and Gain, to mention a few.

Although the proposed pruning method makes the IP search space smaller, the search space becomes larger, when a given database has many quantitative attributes. To handle such cases, we utilize Super CWC [ 15 ], an off the shelf feature selection algorithm to reduce the number of attributes to use. We will show some experimental results in the following section. 4

Experimental Results

We show in Table 2 some datasets used in our experiments; one is from the UCI Machine Learning 2 and the others are from the CTU Prague Relational Learning Repository [ 12 ]3. We also show in Table 3 rule templates for those datasets used to compute correlated association rules, where Qi (i = 1, 2) are quantitative variables, while the other variables are categorical ones.

We have implemented our proposed method by using Java 8 on a PC with an Intel Core i7 processor running at 2.30GHz, 8GB of main memory, working under Windows 7 (64 bit). We have performed the following experiments varying the thresholds min sup at fixed min conf = 0.6.

Effects of the Pruning Method To see the effects of the pruning method, we present some results for the two datasets (mutagenesis and financial) in Figure 2. The figures (left) show the numbers of strong rules generated in computing correlated rules for the rule templates in Table 3, where those categorical attributes in each rule template take some values in their domains. The figures (right) show the corresponding execution times in milliseconds.

We have observed that the pruning method based on the branch-and-bound heuristics enables us to generate much less rules compared with the naive approach (i.e., without pruning). The execution time of both cases are also reduced accordingly.

Effects of the Number of Quantitative Attributes Next, to see the effects of the number of quantitative variables in mining correlated rules, we consider two more rule templates from the rule template, R2, for the Mondial dataset in Table 3, by varying the number of quantitative variables from 2 to 4; namely, one is a rule template R3 obtained by adding industry (X, Q3) to the antecedent of R2, while the other rule template, R4, is obtained similarly by adding inflation(X, Q4) to the antecedent of R3. 2 http://archive.ics.uci.edu/ml/datasets/statlog+(heart). 3 https://relational.fit.cvut.cz/. 0 0 800 ] 3^600 0 1 x [s400 n r e t tap200 # 0 Naive Pruning Naive Pruning 5

The number of generated strong rules and the execution time of computing the association rules are shown in Fig. 3. The number of different values in the domain of each quantitative variable Qi are shown in Table 4. The numbers of possible interval patternss made from the values of Q1, Q2, Q3 and Q4 thus could become very large. However, the figure shows that the numbers of generated rules using the pruning method only moderately increase. The pruning method thus works well also in this case. ]3600 ^ 0 1 [sx400 n r e t ta200 p # 0

Naive2 Naive3 Naive4 Pruning2 Pruning3 Pruning4 2000 ] s [m1500 e m iT1000 . c e xE 500 # 0

Naive2 Naive3 Naive4 Pruning2 Pruning3 Pruning4 2

The heart dataset in Table 2 contains 13 attributes; among them, 6 attributes take numerical values. In this case, the MinIntChange algorithm generates a large number of CIPs. In fact, we could not obtain outputs of our algorithm within a reasonable time when applying it directly to the complete dataset. We alleviate this problem by first choosing some of the numerical attributes from the dataset by using Super CWC mentioned in Sect. 3, and then applying our algorithm to this reduced dataset. Figure 4 shows the numbers of strong rules generated and the execution time in computing correlated rules for an initial association rule.

Naive Pruning 3000 In this paper, we have considered the problem of mining relational association rules, especially focusing on the use of closed interval patterns (CIPs) for finding correlated rules with the best χ2 values. Since the number of mined CIPs increases as the number of attributes and values in the domain of each attribute increases, we have examined the effectiveness of the original MinIntChange algorithm [ 8 ] with the pruning technique by Morishita et al. on the problem. We have also examined the effectiveness of the use of a feature selection algorithm, Super CWC , to reduce the search space of IPs.

Most of the work in the field of MRDM have handled numerical data by using data discretization. To the best of our knowledge, there has been no approach which uses closed interval patterns for mining correlated association rules in multi-relational data.

For future work, we will examine another correlation measure m in MIC+p(m), since our algorithm is generic and will work for another measure. Since the search space for CIPs is still large in general and the computation of CIPs is costly, we will need some method for reducing the computational time and space to a manageable size.

Acknowledgment The authors would like to thank anonymous reviewers for their useful comments on the previous version of the paper. This work was partially supported by JSPS Grant-in-Aid for Scientific Research (C) 15K00305 and 18K11432.

1. De Raedt , L. , Ramon , J.: Condensed representations for Inductive Logic Programming . In: Proc. KR'04 , pp. 438 - 446 ( 2004 )

2. Dehaspe , L. : Frequent pattern discovery in first-order logic . Ph.D. thesis , Dept. Computer Science, Katholieke Universiteit Leuven ( 1998 )

3. Dzeroski, S.: Multi-Relational Data Mining: An Introduction . SIGKDD Explorations Newsletter 5 ( 1 ), 1 - 16 ( 2003 )

4. Fukuda , T. , Morimoto , Y. , Morishita , S. , Tokuyama , T. : Data mining using twodimensional optimized association rules: Scheme, algorithms, and visualization . In: Proc. of ACM SIGMOD '96 . pp. 13 - 23 ( 1996 )

5. Ganter , B. , Kuznetsov , S. : Pattern Structures and Their Projections . In: ICCS-01, LNCS , vol. 2120 , pp. 129 - 142 . Springer ( 2001 )

6. Guyet , T. , Quiniou , R. , Masson, V.: Mining relevant interval rules . In: Supplementary Proceedings of ICFCA . pp. 79 - 82 ( 2017 )

7. Han, J ., Kamber , M. : Data Mining: Concepts and Techniques . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA ( 2006 )

8. Kaytoue , M. , Kuznetsov , S.O. , Napoli , A. : Revisiting numerical pattern mining with formal concept analysis . In: International Conference on Artificial Intelligence (IJCAI 2011 ). pp. 1342 - 1347 ( 2011 )

9. Kurgan , L.A. , Cios , K.J.: CAIM discretization algorithm . IEEE Transactions on Knowledge and Data Engineering 16 ( 2 ), 145 - 153 ( Feb 2004 )

10. Kuznetsov , S.O. , Obiedkov , S.A.: Comparing performance of algorithms for generating concept lattices . J. Exp. Theor. Artif. Intell . 14 ( 2-3 ), 189 - 216 ( 2002 )

11. Morishita , S. , Nakaya , A. : Parallel branch-and-bound graph search for correlated association rules . In: Proc. of ACM SIGKDD Workshop on Large-Scale Parallel KDD Systems . pp. 265 - 276 ( 1999 )

12. Motl , J. , Schulte , O. : The CTU Prague Relational Learning Repository . ArXiv e-prints ( Nov 2015 )

13. Nagao , M. , Seki , H.: On mining quantitative association rules from multi-relational data with FCA . In: Proc. IEEE IWCIA2016 . pp. 81 - 86 ( 2016 )

14. Salleb-Aouissi , A. , Vrain , C. , Nortet , C. : QuantMiner: A genetic algorithm for mining quantitative association rules . In: IJCAI . pp. 1035 - 1040 ( 2007 )

15. Shin , K. , Kuboyama , T. , Hashimoto , T. , Shepard , D. : Super-cwc and super-lcc: Super fast feature selection algorithms . In: proceedings of Big Data . pp. 61 - 67 ( 2015 )

16. Srikant , R. , Agrawal , R.: Mining quantitative association rules in large relational tables . In: Proc. of ACM SIGMOD '96 . pp. 1 - 12 ( 1996 )