Typology of rules extraction methods

Measures of Quality of Rulesets Extracted from Data

Martin Holenˇa

martin@cs.cas.cz 0 0 Institute of Computer Science, Academy of Sciences of the Czech Republic , Pod vod ́arenskou vˇeˇz ́ı 2, 18207 Praha 8 , Czech Republic

The paper deals with quality measures of whole sets of rules extracted from data, as a counterpart to more commonly used measures of individual rules. This research has been motivated by increasingly frequent extraction of non-classification rules, such as association rules and rules of observational logic, in real-world data mining tasks. The paer sketches the typology of rules extraction methods and of their rulesets, and recalls that quality measures for whole sets of rules have been so far used only in the case of classification rulesets. It then proposes three possible ways how such measures can be extended to general rulesets. The paper also recalls the possibility to measure the dependence of classification ruleset on parameters of the classification method by means of ROC curves, and proposes a generalization of ROC curves to general rulesets. Finally, a brief illustration on rulesets extracted by means of the method GUHA is given.

Typology of rules extraction methods The most natural base for differentiating between ex

isting rules extraction methods is the syntax and se1 Introduction mantics of the extracted rules. Syntactical differences between them are, however, not very deep since prinLogical formulas of specific kinds, usually called rules, cipally, any rule r has one of the forms Sr ∼ Sr0, or are a traditional way of formally representing knowl- Ar → Cr, where Sr, Sr0, Ar and Cr are formulas of edge. Therefore, it is not surprising that they are also the considered logic, and ∼, → are symbols of the the most frequent representation of the knowledge dis- language of that logic. The difference between both covered in data mining. Existing methods for rules ex- forms concerns semantic properties of the symbols ∼ traction are based on a broad variety of paradigms and →: Sr ∼ Sr0 is symmetric with respect to Sr, Sr0 in and theoretical principles. However, methods relying the sense that its validity always coincides with that on different underlying assumptions can lead to the of Sr ∼ Sr0 whereas Ar → Cr is not symmetric with extraction of different or even contradictory rulesets respect to Ar, Cr in that sense. In the case of a proposifrom the same data. Moreover, the set of rules ex- tional logic, ∼ and → are the connectives equivalence tracted with a particular method can substantially de- and implication, respectively, whereas in the case of pend on some tunable parameter or parameters of the a predicate logic, they are generalized quantifiers. To method, such as significance level, thresholds, size pa- distinguish the formulas involved in the asymmetric rameters, trade-off coefficients etc. For that reason, it case, Ar is called antecedent and Cr consequent of r. is desirable to have measures of various qualitative as- The more important is the semantic of the rules pects of the extracted rulesets. So far, such measures (cf. [6]), especially the difference between rules of the are available only for sets of classification rules, and Boolean logic and rules of a fuzzy logic. Due to the their dependence on tunable parameters can be de- semantics of Boolean and fuzzy formulas, the former scribed only for classification into two classes [10, 15]. are valid for crisp sets of objects, whereas the validity As far as more general kinds of rules are concerned, of the latter is a fuzzy set on the universe of all considmeasures of quality have been proposed only for in- ered objects. Boolean rulesets are extracted more fredividual rules [6, 11, 24, 26, 29], or for contrast sets of quently, especially some specific types of them, such as rules, which finally can be replaced with a single rule classification rulesets [11, 15]. Those are sets of impli[2, 16]; if a whole ruleset is taken into consideration, cations such that (Ar)r∈R and {Cr}r∈R partition the then only as a context for measuring the quality of an set O of considered objects, where R is the considered individual rule [27, 28]. ruleset, and {Cr}r∈R stands for the set of distinct for

The research reporeted in this paper has been mo- mulas in (Cr)r∈R. Abandoning the requirement that tivated by increasingly frequent extraction of non-clas- (Ar)r∈R partitions O (at least in the sense of a crisp Sr Ar ¬Sr ¬Ar c

Sr0 ¬Sr0 Cr ¬Cr . a b

d k(Qx)(ϕ1(x), . . . , ϕm(x))k = TfQ(kϕ1k, . . . , kϕmk),

( 1 ) partitioning) allows to generalize those rulesets also to framework of observational logic, the terminology is fuzzy antecedents. For Boolean antecedents, however, a bit confusing here: although associational rules are this requirement entails a natural definition of the va- asymmetric, their name evokes the quantifier for the lidity of a whole classification ruleset R for an object symmetric ones). x. Assuming that all information about x conveyed by Orthogonally to the typology according to the seR is conveyed by the single rule r covering x (i.e., with mantics of the extracted rules, all extraction methods Ar valid for x), the validity of R for x can be defined can be divided into two large groups: to coincide with the validity of Ar → Cr for that r, which in turn equals the validity of Cr for x.

As far as the Boolean predicate logic is concerned, generalized quantifiers both for symmetric and for asymmetric rules were studied in the 1970s within the framework of the observational logic [13], which is a Boolean predicate logic with generalized quantifiers.

For a set of data about n objects, the truth evaluation of the Boolean predicate ϕ on those objects is a vector kϕk ∈ {0, 1}n, whereas the truth evaluation of a sentence (Qx)(ϕ1(x), . . . , ϕm(x)) consisting of m Boolean predicates ϕ1, . . . , ϕm and an m-ary generalized quantifier Q is the function value – Methods that extract logical rules from data directly, without any intermediate formal representation of the discovered knowledge. Such methods have always formed the mainstream of the extraction of Boolean rules: from the observational logic methods [13] and the method AQ [30, 31] in the late 1970s, through the extraction of association rules [1, 40] and the method CN2 [4], relying on a paradigm similar to that of AQ, to recent methods based on inductive logic programming [5, 33] and genetic algorithms [9]. They include also important methods for fuzzy rules, in particular ANFIS [22, 23] and NEFCLASS [34, 35], fuzzy generalizations of observational logic [18, 19] and a recent method based on fuzzy transform [36]. – Methods that employ some intermediate representation of the extracted knowledge, useful by itself.

This group includes two important kinds of methods: classification trees [3, 37] and methods based on artificial neural networks (ANN). The latter are used both for Boolean and for fuzzy rules [7, 21, 39] (cf. also the survey papers [32, 38]). of a {0, 1}-valued function TfQ on the set of m-column binary matrices, which is called truth function of the quantifier Q. Observational logic underlies one of the earliest methods for the extraction of general rules from data, called General Unary Hypotheses Automaton (GUHA). In GUHA, the truth function TfQ of a generalized quantifier Q is always a function of the 4-fold table ( 2 ) 3

Existing measures for classification rulesets A survey of measures of quality for classification rule

sets (with possibly fuzzy antecedents) has been given Hence, TfQ is a {0, 1}-valued function on quadruples in the monograph [15]. All measures have been divided of nonnegative integers. For symmetric rules, GUHA there into four groups: inaccuracy, imprecision, insepuses quantifiers fulfilling arability and resemblance. Space limitation allows to recall here only the main representatives of the more a0 ≥ a & b0 ≤ b & c0 ≤ c & d0 ≥ d & important groups: & TfQ(a, b, c, d) = 1 → TfQ(a0, b0, c0, d0) = 1. ( 3 ) Inaccuracy measures the discrepancy between the true class of the considered objects and the class preThey are called associational quantifiers. For asym- dicted by the ruleset. Its most frequently encountered metric rules, it uses quantifiers fulfilling the stronger representative is the quadratic score (also called Brier condition score): a0 ≥ a & b0 ≤ b &

& TfQ(a, b, c, d) = 1 → TfQ(a0, b0, c0, d0) = 1. ( 4 ) which are called implicational quantifiers. This condition covers also the frequently encountered association rules [1, 6, 40] (since methods for the extraction of association rules have been developed outside the Inacc = δC (x) − δˆC (x) 2

, ( 5 ) where | | denotes cardinality, O is the considered set of objects, δC (x) ∈ {0, 1} is the validity of the proposition C for x ∈ O, and δˆC (x) is the agreement between C and the class predicted for x by R. In the general where

O+ = {x ∈ O : R is valid for x}, O− = {x ∈ O : R is not valid for x}.

( 8 ) case of a fuzzy logic, δˆC (x) = maxCr=C kArkx, with kArkx ∈ h0, 1i denoting the truth grade of Ar for x.

Imprecision measures the discrepancy between the probability distribution of the classes, conditioned on the values of attributes occurring in antecedents, and the class predicted by the ruleset. Its most common representative is

This not only shows that, in the case of Boolean an

tecedents, the quadratic score is sufficient to describe also the imprecision, but also suggests an approach Impr = how to extend those measures to general rulesets: to = 1 X X “δC (x) − δˆC (x)” “1 − δˆC (x)”2 . use ( 7 )–( 8 ) as the definition of measures ( 5 )–( 6 ). More generally, any measure of quality of classification rule|O| x∈O C∈{Cr}r∈R ( 6 ) sets with Boolean antecedents (e.g., any measure sur

veyed in [15]) that can be reformulated by means of

As was already mentioned in the introduction, the O+ and O−, can be extended in such a way that the extracted ruleset can substantially depend on tunable reformulation is used as the definition of that measure parameters of the employed method. This was so far for general rulesets. systematically studied only for dichotomous classifica- For sets of asymmetric rules, also the notion of tion with R = {A → C, ¬A → ¬C}. In that case, covering an object by a rule, which was recalled in putting Ar = A, Cr = C allows the information about Section 2, can be generalized. Notice, however, that the validity of A and C for O to be again summarized for fuzzy antecedents, the validity of Ar, r ∈ R is a by means of the 4-fold table ( 2 ), which also depends fuzzy set on O. Consequently, the set OR of objects on the parameter values. The influence of the param- covered by R is a fuzzy set on O with the membership eter values on the result of dichotomous classification function is usually investigated by means of the measures sensitivity = a+ac and specificity = b+dd [15]. Connecting μR(x) = k(∃r ∈ R) Arkx = mr∈aRx kArkx. ( 9 ) points (1-specificity,sensitivity) = ( b+b d , a+ac ) for the considered parameter values forms a curve with graph in the unit square, called receiver operating characteristic (ROC), due to the area where such curves have first been in routine use. In machine learning, a modified version of those curves has been proposed, in which the points connected for considered parameter values are (b, a) [10]. The graph of such a curve then lies in the rectangle with vertices (0, 0) and (b+d, a+c), and is called coverage graph.

The graphs of ROC curves and coverage graphs can provide information about the influence of parameter values not only on the sensitivity and specificity, but also on other measures. It is sufficient to complement the graph with isolines of the measure and to investigate their intersections with the original curve [10].

Observe that according to ( 9 ), OR = O for classification rulesets with Boolean antecedents. Therefore, various generalizations of classification measures to general rulesets of asymmetric rules are possible: wherever O occurs in the definition of a measure for classification rulesets, either O or OR can occur in its general definition, provided OR 6= ∅. To allow unified treatment of symmetric and asymmetric rules, the concept of covering an object by a rule will be extended also to symmetric rules, in such a way that an object x is covered by Sr ∼ Sr0 if either Sr or Sr0 is valid for x.

Hence, a counterpart of ( 9 ) for a set R is a fuzzy set with the membership function μR(x) = k(∃r ∈ R)(Sr ∨ Sr0)kx = = max max(kSrkx, kSr0kx). (10) r∈R 4

Three extensions to more general kinds of rules According to (8), the proposed way of extending

measures of quality from classification rulesets with Boolean antecedents to general rulesets requires to In the particular case of classification rulesets with generalize the concept of validity of a general ruleset Boolean antecedents, some algebra allows to substan- for an object. However, there are multiple possibilities tially simplify ( 5 )–( 6 ): for such a generalization. Indeed, at least any of the following points of view is possible: Inacc = 2|O−| = 1 − |O+| − |O−| , muBltoanoleeoauns vvaalliiddiittyy ooff atlhlecorvuelreisnegt rbualeses.dAocncorsdi-Impr = |O|O−|| = 1 − |O+| ,|O| ( 7 ) ifnorg atocotvheisrepdooinbtjeocft vxieiws,a tBheoovlaealindiptyroopferatyruelxepsertesRs|O| |O| ing the simultaneous validity of all rules that cover x. |OR| are generalizations of ( 6 ).

X μ+(x) = 1 − xX∈O μR(x) x∈O (21) (22) r∈R r∈R O+ = {x ∈ O : μR(x) > 0 & & X kr covers x & r is valid for xk > r∈R > X kr covers x & ¬r is valid for xk}, (15) O− = {x ∈ O : μR(x) > 0 & & X kr covers x & r is valid for xk ≤ | r∈R X kr covers x & ¬r is valid for xk}, (16) 5

Extensions of ROC curves to more general kinds of rules

where the truth grade kr covers & ¬r is valid for xk is again evaluated according to (14), replacing r with

Observe that in the case of Boolean classification with

R = {A → C, ¬A → ¬C}, the information about the validity of R for objects x ∈ O can be also viewed as 6 Experimentally testing the information about the validity of a ruleset R0 = {A → approach C}. However, R0 is not any more a classification ruleset, but only a general one, which can be described The proposed approach has been so far experimentally only by means of the above introduced sets OR, O+, tested for six rules extraction methods on three benchO−. In particular, |O+| = a and |O−| = b, which mark data sets, as well as on data from one real-world suggests the possibility to generalize coverage graphs knowledge discovery task [20]. For each method, 1–3 introduced in Section 3 to general rulesets by means of parameters were tuned, the values of them being choa curve connecting points (|O−|, |O+|) for each of the sen among 2–10 possibilities. For some data sets, some values of the considered parameters. For a generaliza- combinations of parameter values did not extract any tion of ROC curves to general rulesets, those points rules. Whenever a particular combination of paramehave to be scaled to the unit square. Since the result- ter vaules extracted a nonempty ruleset from the coning curve will be used to investigate the dependence sidered data, it was tested on those data by means of on parameter values, the scaling factor itself must be a 10-fold crossvalidation. Consequently, the number of independent of those values. The only available fac- rulesets extracted from each data set varied between tor fulfilling this condition is the number of objects, 1000 and 1500. |O| (the other available factors, |OR|, |O+| and |O−| depend on the evaluations kSrk and kSr0k, or kArk and kCrk, which in turn depend on the parameter values). Consequently, the proposed generalization of ROC curves will connect points ( |O−| , |O+| ).

|O| |O|

For practical construction of the proposed generalization of ROC curves, the following proposition, proven in [17], can be quite useful: Proposition 1. Let the covering of individual objects with individual rules be a Boolean property (i.e., the set of rules covering a particular object x be a crisp subset of R). Then irrespectively of which of the above points of view of ruleset validity is adopted, there always exists a constant c ∈ (0, 1i and an increasing bijection g : h0, ci → h0, 1i such that |O+| + |O−| ≤ max(1, max x + g−1(1 − g(x)))|O|.

x∈h0,ci Moreover, in the particular cases of Boolean logic and of all three fundamental fuzzy logics (Lukasiewicz, G¨odel, product), (23) holds with c = 1 and g equal to identity,

|O+| + |O−| ≤ |O|.

Thus in those cases, the points ( |O−| , |O+| ), forming

|O| |O| the generalization of ROC curves, lie below the diagonal (h0, 1i, h1, 0i).

The proposition is illustrated in Figure 1, together with isolines of the three example measures introduced in (20)–(22). Observe that the isolines of Impr2 depend on the relationship between the three cardinalities |O+| = Px∈O μ+(x), |O−| = Px∈O μ−(x) and |1O(cR) | c=orPresxp∈oOndμRt(ox)t.hTehreelaistoiolinnsehsipde|pOicRte|d=in|OFi+gu|r+e |O−|, which is true in Lukasiewicz logic (thus in particular also in Boolean logic).

(23) (24)

As a very brief illustration, Figure 2 shows the proposed generalization of ROC curves for two rulesets extracted from the best known benchmark set, the iris

Fig. 2. Example of generalized ROC curves for rulesets extracted from the iris data by means of the GUHA quantifier founded implication data, originally used in 1930s by R.A. Fisher [8], by means of the GUHA quantifier founded implication. This quantifier, denoted →s,θ, s, θ ∈ (0, 1i has its truth function Tf→s,θ defined in such a way that the rule Ar →s,θ Cr is valid exactly for those data for which the conditional probability p(Cr|Ar) of the validity of Cr conditioned on Ar, estimated with the unbiased estimate a+ab , is at least θ, whereas Ar and Cr are simultaneously valid in at least the proportion s of the data a a [13]. Hence, Tf→s,θ = 1 iff a+b ≥ θ & a+b+c+d ≥ s. As was pointed out in [14], rules with this quantifier are actually association rules with support s and confidence θ. Each curve corresponds to changing only one of the parameters s, θ, the value of the other is fixed. 7

Conclusions

The paper has dealt with quality measures of rules extracted from data, though not in the usual context of individual rules, but in the context of whole rulesets. Three kinds of extensions of measures already in use for classification rulesets have been proposed. In addition, the concept of ROC-curves has been generalized, to enable investigating the dependence of general rulesets on the values of parameters of the extraction method.

The paper actuallly discusses some general aspects related to an ongoing investigation into the possibility to reflect uncertain validity of rulesets extracted from data when measuring their quality. The outcomes of that investigation are intended to be published elsewhere [17]. They comprise theoretical elaboration of the last proposed kind of extensions of ruleset quality measures, as well as results of extensive experimental tests on rulesets extracted from benchmark and realworld data sets by means of six methods attempting to cover a possibly broad spectrum of rules extraction methods. Those results indicate that the approach is feasible and can contribute to the ultimate objective of quality measures: to allow comparing the knowledge extracted with different data mining methods and investigating how the extracted knowledge depends on the values of their parameters.

Acknowledgment

The research reported in this paper has been supported by the grant No. 201/08/1744 of the Grant Agency of the Czech Republic and partially supported by the Institutional Research Plan AV0Z10300504.

10. J. Fu¨rnkranz and P.A. Flach. ROC ’n’ rule learning – 28. L. Lerman and J. Az`e. Une mesure probabitowards a better understanding of covering algorithms. liste contextuelle discriminante de qualite des r`egles Machine Learning, 58:39–77, 2005. d’association. In EGC 2003: Extraction et Gestion des 11. L. Geng and H.J. Hamilton. Choosing the right lens: Connaissances, pages 247–263. Hermes Science PubliFinding what is interesting in data mining. In F. Guil- cations, Lavoisier, 2003. let and H.J. Hamilton, editors, Quality Measures in 29. K. McGarry. A survey of interestingness measures for Data Mining, pages 3–24. Springer Verlag, Berlin, knowledge discovery. Knowledge Engineering Review, 2007. 20:39–61, 2005. 12. P. H´ajek. Metamathematics of Fuzzy Logic. Kluwer 30. R.S. Michalski. Knowledge acquisition through con

Academic Publishers, Dordrecht, 1998. ceptual clustering: A theoretical framework and algo13. P. H´ajek and T. Havr´anek. Mechanizing Hypothesis rithm for partitioning data into conjunctive concepts.

Formation. Springer Verlag, Berlin, 1978. International Journal of Policy Analysis and Informa14. P. H´ajek and M. Holenˇa. Formal logics of discovery and tion Systems, 4:219–243, 1980. hypothesis formation by machine. Theoretical Com- 31. R.S. Michalski and K.A. Kaufman. Learning patterns puter Science, 292:345–357, 2003. in noisy data. In Machine Learning and Its Applica15. D.J. Hand. Construction and Assessment of Classifi- tions, pages 22–38. Springer Verlag, New York, 2001.

cation Rules. John Wiley and Sons, New York, 1997. 32. S. Mitra and Y. Hayashi. Neuro-fuzzy rule generation: 16. R.J. Hilderman and T. Peckham. Statistical method- Survey in soft computing framework. IEEE Transacologies for mining potentially interesting contrast sets. tions on Neural Networks, 11:748–768, 2000. In F. Guillet and H.J. Hamilton, editors, Quality Mea- 33. S. Muggleton. Inductive Logic Programming. Acasures in Data Mining, pages 153–177. Springer Verlag, demic Press, London, 1992.

Berlin, 2007. 34. D. Nauck. Fuzzy data analysis with NEFCLASS. 17. M. Holenˇa. Measures of ruleset quality capable to rep- International Journal of Approximate Reasoning, resent uncertain validity. Submitted to International 32:103–130, 2002.

Journal of Approximate Reasoning. 35. D. Nauck and R. Kruse. NEFCLASS-X: A neuro-fuzzy 18. M. Holenˇa. Fuzzy hypotheses for Guha implications. tool to build readable fuzzy classifiers. BT Technology

Fuzzy Sets and Systems, 98:101–125, 1998. Journal, 3:180–192, 1998. 19. M. Holenˇa. Fuzzy hypotheses testing in the framework 36. V. Nova´k, I. Perfilieva, A. Dvoˇr´ak, C.Q. Chen, Q. Wei, of fuzzy logic. Fuzzy Sets and Systems, 145:229–252, and P. Yan. Mining pure linguistic associations from 2004. numerical data. To appear in International Journal of 20. M. Holenˇa. Neural networks for extraction of fuzzy Approximate Reasoning.

logic rules with application to EEG data. In B. Ri- 37. J. Quinlan. C4.5: Programs for Machine Learning. beiro, R.F. Albrecht, and A. Dobnikar, editors, Adap- Morgan Kaufmann Publishers, San Francisco, 1992. tive and Natural Computing Algorithms, pages 369– 38. A.B. Tickle, R. Andrews, M. Golea, and J. Diederich. 372. Springer Verlag, Wien, 2005. The truth will come to light: Directions and chal21. M. Holenˇa. Piecewise-linear neural networks and their lenges in extracting rules from trained artificial neurelationship to rule extraction from data. Neural Com- ral networks. IEEE Transactions on Neural Networks, putation, 18:2813–2853, 2006. 9:1057–1068, 1998. 22. J.S.R. Jang. ANFIS: Adaptive-network-based fuzzy 39. H. Tsukimoto. Extracting rules from trained neural inference system. IEEE Transactions on Systems, networks. IEEE Transactions on Neural Networks, Man, and Cybernetics, 23:665–685, 1993. 11:333–389, 2000. 23. J.S.R. Jang and C.T. Sun. Neuro-fuzzy modeling and 40. M.J. Zaki, S. Parathasarathy, M. Ogihara, and W. Li. control. The Proceedings of the IEEE, 83:378–406, New parallel algorithms for fast discovery of associ1995. ation rules. Data Mining and Knowledge Discovery, 24. K.A. Kaufman and R.S. Michalski. An adjustable de- 1:343–373, 1997.

scription quality measure for pattern discovery using the AQ methodology. Journal of Intelligent Information Systems, 14:199–216, 2000. 25. E.P. Klement, R. Mesiar, and E. Pap. Triangular

Norms. Kluwer Academic Publishers, Dordrecht, 2000. 26. S. Lallich, O. Teytaud, and E. Prudhomme. Association rule interestingness: Measure and statistical validation. In F. Guillet and H.J. Hamilton, editors, Quality Measures in Data Mining, pages 251–275. Springer

Verlag, Berlin, 2007. 27. P. Lenca, B. Vaiilant, P. Meyer, and S. Lalich. Association rule interestingness meaures: Experimental and theoretical studies. In F. Guillet and H.J. Hamilton, editors, Quality Measures in Data Mining, pages 51–76. Springer Verlag, Berlin, 2007.

Agrawal ,

Mannila ,

Srikant ,

Toivonen , and

A.I.

Verkamo . Fast discovery of association rules . In Advances in Knowledge Discovery and Data Mining , pages 307 - 328 . AAAI Press, Menlo Park, 1996 .

S.D.

Bay and

M.J.

Pazzani . Detecting group differences. mining contrast sets. Data Mining and Knowledge Discovery , 5 : 213 - 246 , 2001 .

Breiman ,

J.H.

Friedman ,

R.A.

Olshen , and

C.J.

Stone . Classification and Regression Trees . Wadsworth, Belmont, 1984 .

Clark and

Boswell . Rule induction with CN2: Some recent improvements . In Machine Learning - EWSL-91 , pages 151 - 163 . Springer Verlag, New York, 1991 .

5. L. De Raedt. Interactive Theory Revision: An Inductive Logic Programming Approach . Academic Press, London, 1992 .

Dubois , Hu¨llermeier, and

Prade . A systematic approach to the assessment of fuzzy association rules . Data Mining and Knowledge Discovery , 13 : 167 - 192 , 2006 .

Duch ,

Adamczak , and

Grabczewski . A new methodology of extraction, optimization and application of crisp and fuzzy logical rules . IEEE Transactions on Neural Networks , 11 : 277 - 306 , 2000 .

8. R.A. Fisher. The use of multiple measurements in taxonomic problems . Annals of Eugenics , 7 : 179 - 188 , 1936 .

A.A.

Freitas . Data Mining and Knowledge Discovery with Evolutionary Algorithms . Springer Verlag, Berlin, 2002 .