=Paper=
{{Paper
|id=Vol-1649/147
|storemode=property
|title=Evaluating Association Rules in Boolean Matrix Factorization
|pdfUrl=https://ceur-ws.org/Vol-1649/147.pdf
|volume=Vol-1649
|authors=Jan Outrata, Martin Trnecka
|dblpUrl=https://dblp.org/rec/conf/itat/OutrataT16
}}
==Evaluating Association Rules in Boolean Matrix Factorization==
ITAT 2016 Proceedings, CEUR Workshop Proceedings Vol. 1649, pp. 147–154 http://ceur-ws.org/Vol-1649, Series ISSN 1613-0073, c 2016 J. Outrata, M. Trnecka Evaluating Association Rules in Boolean Matrix Factorization Jan Outrata and Martin Trnecka* Department of Computer Science Palacký University Olomouc, Czech Republic jan.outrata@upol.cz, martin.trnecka@gmail.com Abstract: Association rules, or association rule mining, difference (error) of the Boolean product of A and B from I is a well-established and popular method of data mining (usualy zero), and the discrete basic problem (DBP) [11], and machine learning successfully applied in many dif- where the error is sought to be minimal for a prescribed ferent areas since mid-nineties. Association rules form (fixed) k. The formal definitions of the problems are given a ground of the Asso algorithm for discovery of the first below. For the former problem, greedy approach algo- (presumably most important) factors in Boolean matrix rithms seem popular. One of the most efficient ones, G RE - factorization. In Asso, the confidence parameter of associ- C ON D, is proposed in [3] (where it is called Algorithm 2). ation rules heavily influences the quality of factorization. Another one, G RE E SS [4], refines G RE C ON D based on a However, association rules, in a more general form, appear deeper insight into the problem. For the latter problem, already in GUHA, a knowledge discovery method devel- probably the most known (and a basic one) algorithm is oped since mid-sixties. In the paper, we evaluate the use A SSO, proposed in [11], together with a few of its vari- of various (other) types of association rules from GUHA ants. Other BMF algorithms proposed in the recent data in Asso and, from the other side, a possible utilization of mining literature that can be tailored for either of the two (particular) association rules in other Boolean matrix fac- problems are e.g. H YPER [16] or PA NDA [10]. torization algorithms not based on the rules. We compare G RE C ON D (and G RE E SS and their variants) finds fac- the quality of factorization produced by the modified algo- tors as maximal rectangles, Boolean matrices whose en- rithms with those produced by the original algorithms. tries with 1 form a maximal rectangular area (full of 1s), upon a suitable permutation of rows and columns. This 1 Introduction concept comes from the geometric view on BMF and for- mal concept analysis (FCA). The view tells us that finding 1.1 The problem and algorithms a decomposition of I means finding a coverage of 1s in I by rectangles full of 1s [3, 4] and in FCA maximal rectan- Boolean matrix factorization (BMF), called also Boolean gles full of 1s correspond to so-called formal concepts [5]. matrix decomposition, aims to find for a given n × m We will use maximal rectangles as factors to describe the object-attribute incidence Boolean matrix I a n × k object- algorithms now. The G RE C ON D algorithm, in its seek for factor Boolean matrix A and a k × m factor-attribute a factor, starts with the empty set of attributes to which a Boolean matrix B such that the Boolean matrix product selected attribute with possibly other attributes are repeat- (see below) of A and B (approximately) equals I. A de- edly added. The constructed set of attributes together with composition of I into A and B may be interpreted as a the set of all objects having all the attributes determine a discovery of k factors exactly or approximately explaining maximal rectangle (form a formal concept). The selected the data. Interpreting the matrices I, A and B as the object- attribute is such that the rectangle with the attributes added attribute, object-factor and factor-attribute incidence ma- covers as many still uncovered 1 in I as possible. The at- trices, respectively, A and B explain I as follows: the ob- tributes are added repeatedly as long as the number of still ject i has the attribute j (the entry Ii j corresponding to the uncovered 1s in I covered by the rectangle grows (note row i and the column j is 1) if and only if there exists fac- the greedy approach here). Further factors are sought the tor l such that l applies to i and j is one of the particular same way and the algorithm stops when the prescribed manifestations of l. The optimization version of this ba- number of 1s in I is covered by the maximal rectangles sic BMF problem demands k to be as small as possible. corresponding to factors (i.e. the algorithm is designed for The least k for which an exact decomposition of I exists is the AFP). Characteristic vectors of object sets determin- called the Boolean rank (or Schein rank) of I. ing found maximal rectangles then constitute columns of In the literature, two common variants of the optimiza- matrix A and characteristic vectors of attribute sets of the tion problem are defined and dealt with: the approximate maximal rectangles constitute rows of matrix B. factorization problem (AFP) [3], where the number k of factors is sought to be minimal for a prescribed maximal A SSO, on the other hand, starts in its seek for each of (at most) the prescribed number k of factors with a selected set * Support by grant No. GA15-17899S of the Czech Science Founda- of attributes and searches for a set of objects such that the tion is acknowledged. M. Trnecka also acknowledges partial support by Boolean product of the characteristic vectors of the two grant No. PrF_2016_027 of IGA of Palacký University Olomouc. sets, as a rectangle (not necessarily maximal) correspond- 148 J. Outrata, M. Trnecka ing to the factor, covers as many 1s in I as possible. At the obtained from the modified algorithms with those obtained same time, however, the rectangle must also overcover as from their respective original versions. few 0 in I by 1 as possible (i.e. the algorithm is designed The rest of the paper is organized as follows. In the fol- for the DBP). Note here that A SSO is admitted to over- lowing section 2 we briefly precise the BMF problems in- cover 0 in I in its aim to cover 1s, while G RE C ON D does troduced above and recall GUHA, namely the quantifiers not do that (thereby it is said it performs a from-bellow and general association rules. Then, in section 3 the mod- decomposition of I). Characteristic vectors of the selected ification of A SSO with GUHA association rules and the sets of attributes constitute rows of matrix B while charac- modification of G RE C ON D with rows of particular asso- teristic vectors of the corresponding found sets of objects ciation matrix as initial attributes sets are presented. The constitute columns of matrix A. The sets of attributes are modified algorithms are experimentally evaluated in sec- selected, under the same conditions as for the search of tion 4 and section 5 draws a conclusion and future research objects, among candidate sets represented by rows of the directions. attribute-attribute matrix called association matrix. This m × m Boolean matrix, created in the first step of the algo- 2 Basic notions of BMF and GUHA rithm, contains 1 in the row corresponding to an attribute i and the column corresponding to an attribute j if the asso- 2.1 Boolean Matrix Factorization ciation rule i ⇒ j has the (so-called) confidence – the ratio We precise the basic BMF notions and problems recalled of the number of objects having both i and j to the number in the beginning of the previous section. Denote by of objects having i – greater than a user-defined parameter {0,1}n×m the set of all n × m Boolean matrices (i.e. with τ; otherwise it contains 0 in the row and the column. entries either 1 or 0). The basic problem in BMF is to find for a given I ∈ {0,1}n×m matrices A ∈ {0,1}n×k and 1.2 Associations in BMF B ∈ {0,1}k×m for which The association rules in A SSO, also known from asso- I (approximately) equals A ○ B, (1) where ○ is the Boolean matrix product, i.e. ciation rule mining [1], are of one type of relationship between two (Boolean) attributes in (Boolean) object- (A ○ B)i j = max min(Ail ,Bl j ). attribute data. There are other types, in general between k two logical formulas above attributes instead of between l=1 just two single attributes, introduced in the literature. After The approximate equality in (1) is assessed by means of all, general association rules whose validity in data is de- the well-known L1 -norm ∣∣I∣∣ = ∑m,n i, j=1 ∣Ii j ∣ and the corre- fined through a function of the numbers of objects having sponding distance (error) function E(I,A ○ B) defined by and not having both or one of the attributes (or, in general, m,n satisfying and not satisfying the formulas above attributes) E(I,A ○ B) = ∣∣I − A ○ B∣∣ = ∑ ∣Ii j − (A ○ B)i j ∣. are introduced and operated with in GUHA [7, 8, 9, 13], i, j=1 one of the oldest, less known but most sophisticated, In BMF, one is usually interested in two parts of the method of knowledge discovery. In GUHA, several log- error function E, Eu corresponding to 1s in I that are 0s (and hence uncovered) in A ○ B and Eo corresponding to 0s ical and statistical functions, called quantifiers, used to in- in I that are 1s (and hence overcovered) in A ○ B: terpret several different types of association rules are in- troduced. Actually, one of the quantifiers (the basic one), founded implication, interprets the association rule used in E(I,A ○ B) = Eu (I,A ○ B) + Eo (I,A ○ B), where A SSO (and association rule mining, see below). Eu (I,A ○ B) = ∣{⟨i, j⟩; Ii j = 1,(A ○ B)i j = 0}∣, Eo (I,A ○ B) = ∣{⟨i, j⟩; Ii j = 0,(A ○ B)i j = 1}∣, The topic of this paper is twofold. First, we pick up several (other) types of association rules, interpreted by selected GUHA quantifiers, and use them in place of the or, more often, in the relative errors eu (l) = Eu (I,A ○ B)/∣∣I∣∣, eo (l) = Eo (I,A ○ B)/∣∣I∣∣. (2) A SSO’s association rule in the A SSO algorithm. Sec- ond, vice versa, we take the concept of association matrix from A SSO and utilize it in greedy-approach algorithms. eu and eo are functions of the first l factors delivered by Namely, we take a particular association matrix and use the particular algorithm and measure how well the data is the rows of the matrix as characteristic vectors of candi- explained by the l factors. The value of 1 − eu represents dates to initial sets of attributes to which further attributes the (pure) coverage of data by the observed factors. We are added in G RE C ON D instead of the empty set. Both will use 1 − eu and eo in the experiments section 4 below. modifications of the algorithms are novel ideas not pre- Note that the value of c = 1 − eu − eo represents the overall viously discussed in the literature. The main purpose of coverage of data by the factors and is commonly used to the paper is to evaluate the use of various types of associ- assess quality of decompositions delivered by BMF algo- ation rules in A SSO algorithm and the use of association rithms [3, 4, 6, 11]. matrix in G RE C ON D algorithm. The evaluation is done The two optimization BMF problems are defined as fol- by experimental comparison of quality of decompositions lows: Evaluating Association Rules in Boolean Matrix Factorization 149 Definition 1 (Approximate Factorization Problem, Function q which assigns to any four-fold table 4ft(i, j, AFP [3]). Given I ∈ {0,1}n×m and prescribed error ε ≥ 0, I) a logical value 0 or 1 defines a so-called (generalized, find A ∈ {0,1}n×k and B ∈ {0,1}k×m with k as small as GUHA) quantifier. There are several different quantifiers, possible such that ∣∣I − A ○ B∣∣ ≤ ε. summarized e.g. in [13], logical and statistical, which in- terpret different types of association rules (with different meaning of the association ≈ between attributes): AFP emphasizes the need to account for (and thus to ex- plain) a prescribed (presumably reasonably large) portion of data, which is specified by ε. • founded (p-)implication, ⇒ p (for ≈) ≥ p, Definition 2 (Discrete Basis Problem, DBP [11]). Given a 1 if a+b I ∈ {0,1}n×m and a positive integer k, find A ∈ {0,1}n×k and q(a,b,c,d) = { 0 otherwise. B ∈ {0,1}k×m that minimize ∣∣I − A ○ B∣∣. DBP emphasizes the importance of the first k (pre- Parameter 0 < p ≤ 1 has a meaning of threshold for the sumably most important) factors. A throughout study of confidence of the association rule i ⇒ p j, i.e. the ratio Boolean matrix factorization from the point of views of of the number of objects having in I both attributes i the two problems is provided in [4]. and j to the number of objects having i. Founded implication interprets the association rule used in the original A SSO algorithm (with p denoted as τ in- 2.2 GUHA stead) and association rule mining, which is thus a We will only recall the necessary notions from the GUHA particular case of GUHA general association rules. (General Unary Hypotheses Automaton) method [7, 8] • double founded implication, ⇔ p which are required to describe the various types of asso- ciation rules used in the modified A SSO algorithm. For a 1 if a+b+c ≥ p, throughout treatise of the foundations of the method (of q(a,b,c,d) = { 0 otherwise. mechanized formation of hypotheses, in particular its ob- servational predicate calculi), see books [9] or [13]. Compared to founded implication, double founded GUHA, more precisely its A SSOC procedure [9] (do not implication, to evaluate to 1, requires that the number confuse with the A SSO algorithm!) or more enhanced of objects having in I both i and j is at least 100 ⋅ p% 4 FT-M INER procedure [14] for Boolean data, inputs of the number of objects having i or j. (Boolean) object-attribute incidence data with Boolean at- • founded equivalence, ≡ p tributes which we represent by a n × m Boolean matrix I. (General) association rule (over a given set of attributes) is a+d ≥ p, q(a,b,c,d) = { 1 if a+b+c+d an expression 0 otherwise. i≈ j where i and j are attributes. (Note that in its full form, Meaning: At least 100 ⋅ p% among all objects in I GUHA general association rule is an expression ϕ ≈ ψ have the same attributes. where ϕ and ψ are arbitrary complex logical formulas • E-equivalence, ∼δE above the attributes. We consider the simplified case with just single attributes for the formulas, as in the associa- b c ) < δ, q(a,b,c,d) = { 1 if max( a+b , c+d tion rules used in A SSO and association rule mining.) i ≈ j 0 otherwise. is true in I if there is an association between i and j inter- preted by a function q assigning to the four-fold table 4ft(i, Meaning: Less than 100 ⋅ δ % (0 < δ ≤ 0.5) of objects j, I) corresponding to i ≈ j and I the value 1 (logical true). among the objects having i do not have j and among 4ft(i, j, I) is the quadruple the objects not having i have j. ⟨a,b,c,d⟩ = • negative Jaccard distance ⟨ f r(i ∧ j), f r(i ∧ ¬ j), f r(¬i ∧ j), f r(¬i ∧ ¬ j)⟩ b+c ≥ p, q(a,b,c,d) = { 1 if b+c+d where f r(i ∧ j) is the number of objects having both i and 0 otherwise. j in I (rows in I in which there is 1 in both columns corre- This is our new quantifier resembling Jaccard dis- sponding to i and j) and ¬i is an attribute corresponding to tance dissimilarity measure used in data mining the negation of attribute i (i.e. the column in I correspond- (which is one minus Jaccard similarity coeffi- ing to i in which 1s are replaced by 0s and vice versa). cient [15] which in turn is equal to double founded 4ft(i, j, I) is usually depicted as implication above). Compared to double founded im- I j ¬j plication, this quantifier, to evaluate to 1, requires that i a = f r(i ∧ j) b = f r(i ∧ ¬ j) at least 100⋅ p% objects have i or j among the objects ¬i c = f r(¬i ∧ j) d = f r(¬i ∧ ¬ j). not having i or j. 150 J. Outrata, M. Trnecka In fact, the above presented quantifiers, except for the (generic) version of the original algorithm in [11] is com- last one, are simplified versions of quantifiers defined puting the association matrix Q (lines 1–5) using the given in [9] where additional parameter s > 0 is included: quantifier qτ (a,b,c,d) interpreting a (general) association rule i ≈ j instead of using the confidence of the association • founded implication, ⇒ p,s rule i ⇒τ j. a 1 if a+b ≥ p and a ≥ s, q(a,b,c,d) = { 3.2 G RE C ON D using association matrix 0 otherwise. Due to the particular way of finding factors in the G RE - and similarly for the other quantifiers. For association rule C ON D algorithm, namely as maximal rectangles, we need i ≈ j, s has a meaning of threshold for the so-called sup- to use a particular association matrix of which the rows port of the rule – the number of objects having in I both are used as characteristic vectors of candidates to initial at- attributes i and j (or, if normalized as in association rule tribute sets in the algorithm. The matrix used is computed mining, the ratio of the number to the number of all objects using the GUHA quantifier founded implication with pa- in I). rameter p set to 1; hence the confidence of the interpreted association rule i ⇒ p j must be 1 for the rule to be true 3 The modified algorithms (which precisely coincides with the notion of attribute im- plication between attributes i and j in formal concept anal- 3.1 A SSO with GUHA association rules ysis, see [5]). This, at the same time, means that the asso- ciation matrix is the special case of the association matrix of the A SSO algorithm with τ = 1. The modified A SSO algorithm involving GUHA (general) association rule interpreted by a GUHA quantifier is de- The modified G RE C ON D algorithm using the associa- picted in Algorithm 1. tion matrix is depicted in Algorithm 2. Algorithm 1: Modified A SSO algorithm Algorithm 2: Modified G RE C ON D algorithm Input: A Boolean matrix I ∈ {0,1}n×m , a positive Input: A Boolean matrix I ∈ {0,1}n×m and a integer k, a threshold value τ ∈ (0,1], prescribed error ε ≥ 0 real-valued weights w+ , w− and a quantifier qτ Output: Boolean matrices A ∈ {0,1}n×k and (with parameter τ) interpreting i ≈ j B ∈ {0,1}k×m Output: Boolean matrices A ∈ {0,1}n×k and B ∈ {0,1}k×m 1 Q ← empty m × m Boolean matrix for i = 1,...,m do for i = 1,...,m do 2 for j = 1,...,m do 1 for j = 1,...,m do 3 if i ⇒1 j is true in I then 2 Qi j = qτ (a,b,c,d) 4 Qi j = 1 3 5 4 end 6 end 5 end 6 A ← empty n × k Boolean matrix 7 end 7 B ← empty k × m Boolean matrix 8 end 9 A ← empty n × k Boolean matrix 8 for l = 1,...,k do 10 B ← empty k × m Boolean matrix (Qi_ ,e) ← argmaxQi_ , e∈{0,1}n×1 11 while ∣∣I − A ○ B∣∣ > ε do 9 cover([ ],[A e],I,w+ ,w− ) B 12 D ← argmaxQi_ cover(Qi_ ,I,A,B) V ← cover(D,I,A,B) Qi_ 13 A ← [A e], B ← [ ] B 14 while there is j such that D j = 0 and cover(D + [ j],I,A,B) > V do 10 Qi_ 11 end 15 j ← argmax j,D j =0 cover(D + [ j],I,A,B) 12 return A and B 16 D ← (D + [ j])↓↑ 17 V ← cover(D,I,A,B) Qi_ denotes the ith row vector of the (Boolean) associ- 18 end ation matrix Q and the function cover(B,A,I,w+ ,w− ) is A ← [A D↓ ], B ← [ ] B 19 defined as D 20 end w+ ∣{⟨i, j⟩; Ii j = 1,(A ○ B)i j = 1}∣ 21 return A and B −w− ∣{⟨i, j⟩; Ii j = 0,(A ○ B)i j = 1}∣. The original algorithm was described in the introduc- D j denotes the jth item of (row Boolean) vector D ∈ tion section 1. The only modification in Algorithm 1 to the {0,1}1×m , [ j] ∈ {0,1}1×m denotes the (row Boolean) vec- Evaluating Association Rules in Boolean Matrix Factorization 151 Dataset k dens A dens B dens I Dataset Size ∣∣I∣∣ Set C1 40 0.07 0.04 0.10 DNA 4590×392 26527 Set C2 40 0.07 0.06 0.15 Mushroom 8124×119 186852 Set C3 40 0.11 0.05 0.20 Zoo 101×28 862 Table 1: Synthetic data Table 2: Real data tor with jth item equal to 1 and all other items equal to 0, 4.1 A SSO with GUHA association rules the function cover(D,I,A,B) is defined as We observe the values of 1−eu (l) (2) for l = 0,...,k where ∣∣(D↓ × D↓↑ ) ⋅ (I − A ○ B)∣∣ k is the number of factors delivered by a particular algo- rithm. Clearly, for l = 0 (no factors, A and B are “empty”) we have 1 − eu (l) = 0. In accordance with general require- and the (formal concept-forming [5]) operators C↑ and D↓ for (column Boolean) vector C ∈ {0,1}n×1 and vector D, ments on BMF, for a good factorization algorithm 1−eu (l) respectively, are defined as should be increasing in l, should have relatively large val- C↑ = +[ j] ∈ {0,1}1×m ; for each i,Ci = 1 ∶ Ii j = 1, ues for small l, and it is desirable that for l = k we have I = A ○ B, i.e. the data is fully explained by all k fac- = +[i] ∈ {0,1}n×1 ; for each j,D j = 1 ∶ Ii j = 1. tors computed (in which case 1 − eu (l) = 1). For synthetic D↓ Again, the original algorithm was described in the intro- datasets C1 , C2 and C3 , values of 1 − eu (l) are shown in duction section 1. The only modifications in Algorithm 2 Figures 1, 2 and 3, respectively. to the (generic) version of the original algorithm in [3] are As we mentioned above the A SSO algorithm admits computing the particular association matrix Q (lines 1–8) overcovering of 0s of input data matrix. The number of using the quantifier founded implication with p = 1 inter- overcovered 0s is a very important value and the values of preting the association rule i ⇒1 j and using the rows of eo (l) (2) for synthetic datasets C1 ,C2 and C3 are shown in the matrix as characteristic vectors of candidates to initial Figures4, 5 and 6. Let us note that the results marked as attribute sets (line 12) in the factor construction (lines 12– “founded implication” are in fact results for the original 19). A SSO algorithm. Note also that all variants of A SSO re- quire us to set τ and (one of) w+ and w−, see Algorithm 1. Based on some previous experimental results (see [4]) we 4 Experimental evaluation fixed w+ = w− = 1 and used the value of τ for which the particular algorithm produces the best results. In most cases, the best choice was 0.8 < τ < 1. This observation In this section, we provide an experimental evaluation of the modified algorithms and their comparison to the orig- corresponds with results in [4]. inal versions, the A SSO algorithm and the G RE C ON D al- We can see that the original algorithm is outperformed in terms of both coverage (1 − eu ) and overcoverage (eo ) gorithm. Due to the lack of space we do not present a comparison with other algorithms and approaches to the by the modification utilizing double founded implication. general BMF. A comparison that includes both the algo- This modification produces large values of coverage and rithms can be found for example in [4]. compared to the original A SSO algorithm commits smaller As in the typical experiment scenario—which occurs overcoverage error. This is true for both synthetic and real in various BMF papers—we use both synthetic and real datasets. Very promising is also the modification utilizing datasets. Experiments on synthetic datasets enable us to the negative Jaccard distance quantifier. analyze performance of the algorithms on data with the Modifications utilizing founded equivalence and E- same and known characteristics—we can analyze results equivalence, however, do not perform well, for synthetic in average case. On real data, we can study meaning of datasets. In case of dataset C1 —the most sparse one— obtained results. Let us also note, that synthetic data are both modifications commit extremely large overcover er- artificial while real data are influenced by real factors. ror, the values are beyond the scale of Figure 4. In cases Synthetic data. We created 1000 of randonly generated of C2 and C3 , Figures 2 and 3, they produce poor coverage datasets. Every dataset Xi has 500 rows and 250 columns while the overcoverage error is not much different from the and was obtained as a Boolean product Xi = Ai ○ Bi of ma- modifications utilizing other quantifiers, for higher num- trices Ai and Bi that were generated randomly with param- ber of factors. On the other side, for real datasets the re- eters shown in Table 1. The inner dimmension k was for sults are comparable with the other modifications (Figures all Xi set to 40, i.e. the expected number of factors is 40. 7, 8), with significantly smaller overcoverage errors (Fig- ures 10, 11). The only exception is the Mushroom dataset Real data. We used datasets DNA [12], Mushroom [2], where founded equivalence is again beyond the scale of and Zoo [2], the characteristics of which are shown in Ta- Figure 10. Due to the limited space we do not include ble2. All of them are well known and widely used in the results for the DNA dataset which are very close to the literature on BMF. results obtained for the Zoo dataset. 152 J. Outrata, M. Trnecka 1 2 0.9 1.8 0.8 1.6 0.7 1.4 overcoverage 0.6 1.2 coverage 0.5 1 0.4 0.8 0.3 0.6 founded implication founded implication 0.2 double founded implication 0.4 double founded implication founded equivalence founded equivalence 0.1 negative Jaccard distance 0.2 negative Jaccard distance E−equivalence E−equivalence 0 0 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 number of factors number of factors Figure 1: Coverage for synthetic dataset C1 Figure 4: Overcoverage for synthetic dataset C1 1 2 0.9 1.8 0.8 1.6 0.7 1.4 overcoverage 0.6 1.2 coverage 0.5 1 0.4 0.8 0.3 0.6 founded implication founded implication 0.2 double founded implication 0.4 double founded implication founded equivalence founded equivalence 0.1 negative Jaccard distance 0.2 negative Jaccard distance E−equivalence E−equivalence 0 0 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 number of factors number of factors Figure 2: Coverage for synthetic dataset C2 Figure 5: Overcoverage for synthetic dataset C2 1 2 0.9 1.8 0.8 1.6 0.7 1.4 overcoverage 0.6 1.2 coverage 0.5 1 0.4 0.8 0.3 0.6 founded implication founded implication 0.2 double founded implication 0.4 double founded implication founded equivalence founded equivalence 0.1 negative Jaccard distance 0.2 negative Jaccard distance E−equivalence E−equivalence 0 0 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 number of factors number of factors Figure 3: Coverage for synthetic dataset C3 Figure 6: Overcoverage for synthetic dataset C3 Evaluating Association Rules in Boolean Matrix Factorization 153 1 1 0.9 0.9 0.8 0.8 0.7 0.7 overcoverage 0.6 0.6 coverage 0.5 0.5 0.4 0.4 0.3 0.3 founded implication founded implication 0.2 double founded implication 0.2 double founded implication founded equivalence founded equivalence 0.1 negative Jaccard distance 0.1 negative Jaccard distance E−equivalence E−equivalence 0 0 0 20 40 60 80 100 0 20 40 60 80 100 number of factors number of factors Figure 7: Coverage for Mushroom dataset Figure 10: Overcoverage for Mushroom dataset 1 1 0.9 0.9 0.8 0.8 0.7 0.7 overcoverage 0.6 0.6 coverage 0.5 0.5 0.4 0.4 0.3 0.3 founded implication founded implication 0.2 double founded implication 0.2 double founded implication founded equivalence founded equivalence 0.1 negative Jaccard distance 0.1 negative Jaccard distance E−equivalence E−equivalence 0 0 0 5 10 15 20 0 5 10 15 20 number of factors number of factors Figure 8: Coverage for Zoo dataset Figure 11: Overcoverage for Zoo dataset 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 coverage coverage 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 GreConD 0.1 GreConD GreConD implication GreConD implication 0 0 0 20 40 60 80 100 120 0 50 100 150 200 250 300 350 400 450 number of factors number of factors Figure 9: Original and modified G RE C ON D on Mushroom Figure 12: Original and modified G RE C ON D on DNA dataset dataset 154 J. Outrata, M. Trnecka Presented results are representative. We performed the which computes, or updates, the matrix in the process of same evaluations for several other datasets which we have searching for factors instead of computing it once before not included in the paper and observed the same type of the search. Second, we will look for a way how to use in results described above. the utilization of association matrix the so-called essential elements of the input data matrix, which are crucial for the G RE E SS algorithm [4] (which improves the G RE C ON D 4.2 G RE C ON D using association matrix algorithm). Do to the limited space we do not present a comparison of the original G RE C ON D algorithm and the modifica- References tion utilizing the association matrix (Algorithm 2) on the synthetic datasets. On each Xi the modified G RE C ON D [1] Agrawal R., Imieliński T., Swami A.: Mining association slightly outperforms the original algorithm from the stand- rules between sets of items in large databases, Proc. ACM SIGMOD 1993, 207–216. point of coverage. Moreover, the modified algorithm also tends to produce less factors, i.e. outperforms the origi- [2] Bache K., Lichman M., UCI Machine Learning Reposi- tory [http://archive.ics.uci.edu/ml], Irvine, CA: University nal G RE C ON D from both the AFP and DBP views (see of California, School of Information and Computer Sci- Section 2). Values of 1 − eu for the Mushroom and DNA datasets ence, 2013. [3] Belohlavek R., Vychodil V., Discovery of optimal factors in are shown in Figures 9 and 12, respectively. We can see binary data via a novel method of matrix decomposition, J. that the modified algorithm first looses for small numbers Comput. Syst. Sci. 76(1)(2010), 3–20 (preliminary version of factors but in the end, it outperforms the original G RE - in Proc. SCIS & ISCIS 2006). C ON D algorithm—i.e. it outperforms G RE C ON D from [4] Belohlavek R., Trnecka M., From-below approximations the AFP view. We observed similar behavior also for the in Boolean matrix factorization: Geometry and new algo- other real datasets. rithm, J. Comput. Syst. Sci. 81(8)(2015), 1678–1697. [5] Ganter B., Wille R., Formal Concept Analysis: Mathemat- Time complexity. We implemented all used algorithms in ical Foundations, Springer, Berlin, 1999. MATLAB with critical parts written in C. Theoretical time [6] Geerts F., Goethals B., Mielikäinen T., Tiling databases, complexity is not of our primary concern. Practically, it Proc. Discovery Science 2004, 278–289. follows from our observations that the modification of the [7] Hájek P., Holeňa M., Rauch J.: The GUHA method and its G RE C ON D algorithm is slightly faster than the original meaning for data mining. Journal of Computer and System algorithm and all modifications of the A SSO algorithm are Science 76(1)(2010), 34-–48. equally fast as the original. [8] Hájek P., Havel I., Chytil M.: The GUHA method of automatic hypotheses determination. Computing 1(1966), 293—308. 5 Conclusion [9] Hájek P., Havránek T.: Mechanizing Hypothesis Forma- tion (Mathematical Foundations for a General Theory), We evaluated the use of various types of (general) associa- Springer-Verlag 1978, 396 pp. New edition available on- tion rules from the GUHA knowledge discovery method in line at http://www.cs.cas.cz/~hajek/guhabook/. the Boolean matrix factorization (BMF) algorithm A SSO [10] Lucchese C., Orlando S., Perego R., Mining top-K patterns and an utilization of (particular) association rules in the from binary datasets in presence of noise, SIAM DM 2010, greedy-approach BMF algorithm G RE C ON D which is not 165–176. based on association rules. The comparison of the qual- [11] Miettinen P., Mielikäinen T., Gionis A., Das G., Mannila ity of factorization produced by the modified algorithms H., The discrete basis problem, IEEE Trans. Knowledge with those produced by the original algorithms on both and Data Eng. 20(10)(2008), 1348–1362 (preliminary ver- synthetic and selected real data showed that our modified sion in Proc. PKDD 2006). algorithms outperform, for some types of rules, the origi- [12] Myllykangas S. et al, 2006, DNA copy number am- nal ones. Namely the double founded implication and (our plification profiling of human neoplasms, Oncogene new) negative Jaccard distance quantifiers interpreting the 25(55)(2006), 7324–7332. association rules in A SSO perform better than the founded [13] Rauch J.: Observational Calculi and Association Rules. implication quantifier used in the original A SSO. Also the Springer-Verlag, 2013. utilization of association matrix in the factor search initial- [14] Rauch J., Šimůnek M.: Mining for 4ft rules, in: Proceed- ization stage of the G RE C ON D algorithm improves factor- ings of Discovery Science, Springer-Verlag, 2000. ization results produced by the algorithm. [15] Tan P.-N., Steinbach M., Kumar V.: Introduction to Data The observed results encourage us to the following fu- Mining. Addison Wesley, Boston, MA, 2006. ture research directions. First, as the role of association [16] Xiang Y., Jin R., Fuhry D., Dragan F. F., Summarizing matrix is crucial for the A SSO algorithm (and, as we have transactional databases with overlapped hyperrectangles, seen, the algorithm can be improved by using other types Data Mining and Knowledge Discovery 23(2011), 215–251 (preliminary version in Proc. ACM KDD 2008). of association rules), we have an idea about algorithm