Characteristics Characteristics of of cosymmetric cosymmetric association association rules rules Michal Burda, Marian Mindek, Jana Šarmanová Michal Burda, Marian Mindek, and Jana Šarmanová Dept. of Computer Science, FEI, VŠB – Technical university of Ostrava, Department of Computer 17. listopadu Science, 15, 708 VŠB – Technical 33 Ostrava–Poruba, University Czech of Ostrava Republic 17. listopadu 15,Marian.Mindek, {Michal.Burda, 708 33 Ostrava–Poruba, Czech Republic Jana.Sarmanova}@vsb.cz {michal.burda, marian.mindek, jana.sarmanova}@vsb.cz Abstract. Association rules are essential data mining tool and as such has been well researched. Many new types of association rules based on both categorial or quantitative data have been founded ([8], [7], [2], [4]). Our work is directed to the theoretical features of association rules; espe- cially, we study a specific class of association rules called δ-cosymmetric rules. We present here some interesting properties of such rules and pro- vide a definition of rules expressing the significant difference in position, as an example. We show here that even the usual implicational rules are special cases of δ-cosymmetric rules. Key words: Cosymmetric rules, association rules, typed relations, data mining 1 Preface This paper is intended to motivate the rise of a new class of association rules called δ-cosymmetric rules. First of all, we describe here briefly the notions of the Logic of typed relations used to write the association rules down (for more information see [6]). After that, we provide some motivating examples of the representative cosymmetric rule types. We also study several features of the δ- cosymmetric rules and define the δ-cosymmetric rule of significant difference in position. The end of this paper is dedicated to some notes on how to mine the δ-cosymmetric rules. 2 Logic of typed relations In [6], we have developed the Probabilistic Logic of Typed Relations (PLTR) suitable for the formal association rules representation. In this section we briefly and informally describe main notions of that logic to understand the meaning of its formulae. The main notion of PLTR is typed relation. Typed relation can be simply viewed as a data table with finite number of columns and rows. Each column represents one attribute and a set of such attributes is a type of the relation. A typed relation is similar to classical concept of mathematical relation. We can perform usual set operations as union (∪), intersection (∩) or difference K. Richta, V. Snášel, J. Pokorný (Eds.): Dateso 2005, pp. 20–31, ISBN 80-01-03204-3. Characteristics of cosymmetric association rules 21 Fig. 1. Selection and projection on the relation R. (−). Furthermore, there exist two crucial relational operations: selection and projection. Selection is an unary operation of the form R(c1 ∧ c2 ∧ ¬c3 ) where R is typed relation and c1 ∧ c2 ∧ ¬c3 is a formula called selection con- dition. Selection is used to select only the rows satisfying the given condition. For example, when R is a data table (typed relation) of university students, the selection R(age > 25) picks only the students older than 25. The projection on relation R is an unary operation of the form R[A1 , A2 , . . . , An ]. The projection is used to take out only several columns (attributes) of the rela- tion R. The choosed attributes are simply written in the comma-separated list in the brackets. The projection R[name, date of birth] results simply in the two-column data table with student’s basic personal infor- mation. Obviously, we can combine selection and projection together to pick up an arbitrary sub-relation of the original typed relation, e.g. R(age > 25)[name, date of birth], which results in a relation of basic personal information of students older than 25. (See also figure 1.) The rules written in PLTR use the relational operations described above to explicitly express a knowledge. For example, R(age > 65)[blood pressure] >?mean R(age < 21)[blood pressure] tells that the blood pressure of people older than 65 is in average significantly higher than for people younger than 21. In the above rule we use a mapping >?mean to express the strong difference in the mean value between two “data columns”. (See also figure 2.) The mapping >?mean is simply a function, which computes a truth value of the strong difference in mean from the given two typed relations. 22 Michal Burda, Marian Mindek, Jana Šarmanová Fig. 2. Comparison of the two disjunctive sub-tables. 3 The δ-cosymmetric rules There exist a wide variety of the association rule types. The best-known are the rules in the implicational form, which say that when the object satisfies some condition (called antecedent), it (very probably) gratifies some other condition (succedent), e.g.: tequila ∧ salt ⇒ lemon. (1) This rule simply says that customers who buy tequila and salt often buy lemons, too. However, there are many other rule types (e.g. associational, correlational etc. – see [1], [2], [5], [7], [8], [9]). It is not our goal to mention each of them. We preferable move the focus to the rules, which we later name δ-cosymmetric. Consider the subsequent rule from [4]: sex = “female” ⇒ wage: mean = $7.90/hr (overall mean wage = $9.02). (2) It indicates that the women’s wage mean is significantly different to the rest of examined objects. That is, the rule says that women earn in average less than men. (The overall wage is in the rule for information only. To be statistically consistent, we must compare two disjoint sets of values, e.g. female againts male – see [4].) In general, the statistical test in the background of the rule compares two sets of quantitative data – women’s wage against the wage of the remaining data table (in fact, against men’s wage). We can apply the same mechanism and mine similar rules, e.g.: non-smoker ∧ wine-drinker ⇒ life-expectancy = 85 (overall = 80). (3) Such rule says that people who drink wine and do not smoke live in average longer than the other people. One can see, we compare the life expectancy of people who don’t smoke and drink wine against the rest of the data table. Such property is more visible when re-writing the original rules (2) and (3) (see also [4]) into PLTR: R(sex = “female”)[wage] ?mean R(¬(non-smoker ∧ wine-drinker))[life-expectancy]. (5) Characteristics of cosymmetric association rules 23 Our research shows that many types of the associational rules can be trans- formed to the fashion of comparing “something” against “something else” (later in this paper, we mention some of them). Thus, it is natural to expect that such rules will have some equal properties and that it will behave similarly in alike situations. Therefore it is reasonable to identify the common features and use them in general definition of a new class of association rules. Later in this paper we try to do so and name the class of such association rules the δ-cosymmetric rules. Moreover, it is obvious to contemplate rules of type (4) or (5) as formulae of PLTR. That is, one can treat the symbol ?δ R(C2 )[A] (6) 24 Michal Burda, Marian Mindek, Jana Šarmanová or prefixually: >?δ R(C1 )[A], R(C2 )[A] .  (7) E.g. see the rule of the difference in wage of at least $5: R(sex = “female”)[wage] ? (A, B). What can we say about the truth value of the rule >? (B, A)? It is clear, if values of relation A are significantly higher than values of relation B, the contrary statement can’t be true as well (so the formula (10) holds). More generally, the truth value of a statement “objects of relation B are minimally over δ less than objects of relation A” equals to a negation of the statement “objects of relation A are minimally over (−δ) less than objects of the relation B”. Formally written: ?δ1 A, B = hl1 , h1 i and >?δ2 A, B = hl2 , h2 i where >? is a predicate similar to the previously discussed. One can observe that the following holds all the time:   δ1 < δ2 ⇒ (l1 ≥ l2 ) ∧ (h1 ≥ h2 ) . (11) Informally, this property says that the increase of the minimum difference δ leads to the reduction of the rule’s probability. 3.5 Quasi-transitivity We name probable the rule, which truth value i = hl, hi satisfies the condition 0, 5 < l. Let AW ;0 D(sex = “female”)[pressure]. Now we can take a closer look at the Aspin–Welch predicate ?AW ;δ (X, Y ) = hp1 , p1 i and >?AW ;−δ (Y, X) = hp2 , p2 i. We are going to show that p1 = 1 − p2 . Computing the values of p1 and p2 means accordingly to the definition 2 computing the T characteris- tics. Thus, X̄ − Ȳ − δ Ȳ − X̄ − (−δ) T1 = = tf (p1 ) and T2 = = tf (p2 ). S S We see that T1 = −T2 , so tf (p1 ) = −tf (p2 ). It is commonly known that tf (p) = −tf (1 − p), so p1 = 1 − p2 . (b) Monotony. It is commonly known that tf (p) is monotone, so when we increase δ, the value of characteristics T gets lower and so does the value of the resultant probability p. (c) Quasi-transitivity. the validity of quasi-transitivity condition is evident from the fact that ∀f ∈ N : tf (0, 5) = 0. Characteristics of cosymmetric association rules 27 4.2 Funded cosymmetric rules We can go on and define various other δ-cosymmetric predicates similar to the definition of Aspin–Welch predicate. We don’t have enough space for such defi- nitions, so let us leastwise mention some possibilities. We can define many other predicates for determining the significant difference in position. Such definitions could be based on various existing statistical tests – it is possible e.g. to employ the rank tests to achieve robust cosymmetric predicates etc. Similarly to the significant difference in position, we can define predicates deciding of the difference in variance (dispersion). For example, we can mine rules telling us whether the presence of some attribute puts there significant increase of dispersion of some other attribute etc. We can employ the two-sample tests on binomial distribution to generate rules about discrete attributes. Generally said, almost every two-sample statisti- cal test may be considered to be used in a definition of appropriate cosymmetric predicate. Let’s have a look on the implicational rules of type (1). We show that we can define δ-cosymmetric rules that are analogous to them. Before doing so, we should describe shortly the meaning of the implicational rules. The GUHA method ([8], [7]) works with the so-called generalized quantifiers. These quantifiers form the base for the association rule creation. The rules are of the form ϕ ∼ ψ, where ϕ and ψ are formulae and ∼ the generalized quantifier. The truth of the rule is determined from a 4-field table (see table 1), which summarizes the amount of objects satisfying ceratin configurations. Table 1. 4-field table of ϕ and ψ ψ ¬ψ ϕ a b ¬ϕ c d The a value denotes the number of objects satisfying both ϕ and ψ, b is the number of objects satisfying ϕ and not ψ etc. The quantifier ⇒p,base called also the funded implication is defined for 0 < p ≤ 1 and base ≥ 0 as follows. The rule ϕ ⇒p,base ψ is true if and only if (iff ) a ≥ p ∧ a ≥ Base. a+b More on such rules can be read from [8], [7] or [10]. The example of the rule based on the funded implication is (1). Now, we provide a definition of a predicate that is similar to the quantifier of funded implication. After that, we show that it is δ-cosymmetric. 28 Michal Burda, Marian Mindek, Jana Šarmanová Definition 3. Let A and B be the typed relations, each containing exactly one column with values from the set {0, 1} and let δ ∈ [−1, 1]. Let us denote sum(A) the number of A’s rows possessing “1”. We define the Funded relationship pred- icate ?f nd;δ (A, B) = h1, 1i iff > , sum(A) + sum(B) 2 sum(A) 1+δ >?f nd;δ (A, B) = h0, 5, 0, 5i iff = , sum(A) + sum(B) 2 sum(A) 1+δ >?f nd;δ (A, B) = h0, 0i iff < . sum(A) + sum(B) 2 Theorem 2. The set of all funded relationship predicates 1+δ b 1−δ 2 iff a+b < 2 . a 1+δ 2a 2a+2b−2b a+b > 2 ⇔ a+b − 1 > δ ⇔ a+b −1>δ ⇔ 2b b 1−δ ⇔ 1 − a+b >δ ⇔ a+b < 2 . (b) Monotony and (c) Quasi-transitivity are obvious. If we omit the minimum support constraint in the definition of the funded implication, we get the same rules as with the funded δ-cosymmetric predicate. In the other words, the rule ϕ ⇒p,0 ψ is true on data table R iff the following rule has truth value equal to h1, 1i: R(ψ)[ϕ] >?f nd;(2p−1) R(¬ψ)[ϕ]. As a result we can say that implicational GUHA rules are just special cases of δ-cosymmetric rules. This surprising result convinced us of the importance of the δ-cosymmetric rules research. 5 Schemes of δ-cosymmetric association rules Consider the general pattern of a δ-cosymmetric rule: R(C1 )[A] >? R(C2 )[A]. (13) When mining such rules, we can generate and test virtually every combination of C1 , C2 , A, but doing so makes not much sense. It is because the association rule mining process results often in a wide range of association rules and it is sometimes hard to be acquainted with it. Moreover, only several combinations of conditions C1 and C2 are easy to interpret. Consider the following rule – although it may be true, the analyst has probably no usage for it. R(eyes = “blue” ∧ sex = “male”)[fat] >? R(age > 30 ∧ wage < $200)[fat] (14) In the following, we try to recognize the patterns of δ-cosymmetric rules of better interest than general pattern (13). Characteristics of cosymmetric association rules 29 5.1 Scheme “one-against-the-rest” The easiest pattern of interesting δ-cosymmetric rules is R(C)[A] >? R(¬C)[A]. (15) We take one condition C and compare values of some quantitative attribute A for two sub-tables where the first satisfies the given condition C and the second doesn’t. Such rules express the condition at which the values of attribute A are “somehow” significantly higher (or lower) than the rest of the data table. This basic pattern we name one-against-the-rest. A pattern similar to (15) is conditional one-against-the-rest: R(C1 ∧ C2 )[A] >? R(¬C1 ∧ C2 )[A]. (16) This pattern stands for “when considering only values fulfilling the condition C2 , the additional condition C1 indicates the significant increase of value A (in the sense of >? ).” That is, we fisrt restrict ourselves on data rows satisfying C2 only and then we search simply the one-against-the-rest rules on them. 5.2 Scheme “one-on-one” The pattern one-on-one is a little more tricky. It is good in situations, when we want to compare groups created accordingly to one categorial attribute. Suppose attribute B be categorial with domain {b1 , b2 , . . . , bn }. Let moreover attribute A be quantitative. The pattern one-on-one is as follows: R(B = bi )[A] >? R(B = bj )[A] (for i 6= j). (17) The rule of such type means: “The objects with value bi in attribute B involve significantly higher values of attribute A than  objects with value bj in attribute B.” Generally, we can generate and test n2 different hypotheses for a categorial attribute with n various values. We can add an additional condition C to form conditional one-on-one pat- tern, too: R(B = bi ∧ C)[A] >? R(B = bj ∧ C)[A] (for i 6= j). (18) 6 Some notes of how to reduce the number of resultant rules The large size of the association rule mining results is the common problem. Analyst hardly orientates himself or herself in a big list of mined rules. Therefore, we enumerate here some hints of how to prune the result from less-interesting rules and so to restrict the resultant δ-cosymmetric formulae to the reasonable amount. 30 Michal Burda, Marian Mindek, Jana Šarmanová 1. The significance level – the basic restriction on eventual rules is stating the minimum probability of its validity – in the other words, one may set a number pmin and throw away every rule, which truth value is below that threshold. Significance level can be pre-set to any of the usual values as 0,95 or 0,99. 2. Contradictory conditions – the conditions appearing in the rule should be contradictory. That is, when considering the rule R(C1 )[A] >? R(C2 )[A] then the formula C1 ∧ C2 should be contradiction. Rules satisfying that cri- terion are more easily interpretable and we avoid the uncorrect statistical comparing of non-disjunctive samples. (Compare with rule (14). Note also that the rules based on one-against-the-rest or one-on-one are all of contra- dictory conditions.) 3. Minimum support – minimum support is the best-known instrument for pruning away the non-interesting conditions from which the association rules are going to be formed. The minimum support criterion simply says that there must exist minimally minsup objects satisfying condition that appears in the rule. If not, such condition isn’t used in the association rule generating process. The definition of the minsup value greatly improves the efficience of association rule mining algorithms (see [1], [9], [11] for more information). Minimum support should be set by expert only. 4. Minimum difference – setting the minimum difference δ is analogous to the stating of minimum rule probability. Doing so we express that we are inter- ested in the rules, which confirm the dissimilarity to be at least of size δ. Minimum difference should be set by expert only. 5. Easy-to-interpret rules only – in section 5 we have shown that generating all possible rules makes no sense. One may to generate only the rules, which are easy to interpret. That is, we should generate rules conforming to the patterns discussed in section 5. A similar criterion on that topic is to use conditions in conjunctive form only. 7 Conclusion and future work In this paper, we have introduced the new class of association rules – the δ- cosymmetric rules. We are the first who has shown, how to use the Probability logic of typed relations (PLTR, see [6]) to express rules of such type. This paper also shows the benefit of using PLTR as a language for writing the association rules in, too. We have identified the basic properties of δ-cosymmetric rules and provided the definition of rules of significant difference in position, as an example. The second part of this paper was dedidacted to some notes on how to generate the δ-cosymmetric rules to obtain the interesting rules only. This paper also presents two basic examples of concrete δ-cosymmetric rules: the Aspin–Welch predicate and the Funded predicate. The second is surprise for Characteristics of cosymmetric association rules 31 us, since it shows that GUHA’s implicational rules are just the special cases of more general δ-cosymmetric rules. Our future work will address the deeper research of δ-cosymmetric rules. We will try to unhide more interesting features of that rule class. For example, our actual research shows that the cosymmetric rules can be used in the definition of a function that is metric. An interesting task will be undisputably the clustering using such metrics, etc. We are also focused on finding the fast and efficient algorithm to mine the δ-cosymmetric rules. A lot of work was done in [4] by Aumann and Lindell. (However, they didn’t know that they are mining cosymmetric rules – their algorithm should be slightly modified to comply the wide range of possible rule types.) We are also interested in the methods of visualisation of δ-cosymmetric rules. The properties of δ-cosymmetric rules make rational to use the slightly modified Hasse’s diagrams to visualize the rules mined according to the pattern “one- on-one” discussed above. We also work on employing the conceptual lattices to represent mined δ-cosymmetric rules. References 1. Agrawal, R. Fast discovery of association rules. In Advances in knowledge dis- covery and data mining (1996), AAAI Press / MIT Press, pp. 307–328. 2. Agrawal, R., Imielinski, T., and Swami, A. Mining associations between sets of items in massive databases. In ACM SIGMOD 1993 Int. Conference on Man- agement of Data (Washington D.C., 1993), pp. 207–216. 3. Anděl, J. Statistické metody. MATFYZPRESS, Praha, 1998. 4. Aumann, Y., and Lindell, Y. A statistical theory for quantitative association rules. In Knowledge Discovery and Data Mining (1999), pp. 261–270. 5. Berka, P. Dobývánı́ znalostı́ z databázı́. Academia, Praha, 2003. 6. Burda, M., Hynar, M., and Šarmanová, J. Pravděpodobnostnı́ logika typo- vaných relacı́. In Znalosti, poster proceedings (2005). 7. Hájek, P., and Havránek, T. Mechanizing Hypothesis Formation. Springer– Verlag, Berlin, 1978. Internet: http://www.cs.cas.cz/~hajek/guhabook/ (May 2004). 8. Hájek, P., Havránek, T., and Chytil, M. K. Metoda GUHA – automatická tvorba hypotéz. Academia, Praha, 1983. 9. Han, J., and Kamber, M. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, USA, 2000. 10. Rauch, J. Asociačnı́ pravidla a matematická logika. In Znalosti (Brno, 2004), pp. 114–125. 11. Rauch, J., and Šimůnek, M. Alternative approach to mining association rules. In FDM (Japan, 2002), pp. 157–162.