-

⋆ Fuzzy classification rules based on similarity

Martin Holenˇa

David Sˇtefka

0 0 Faculty of Nuclear Science and Physical Engineering Czech Technical University Trojanova 13 , 120 00 Prague 1 Institute of Computer Science, Academy of Sciences of the Czech Republic Pod Vod ́arenskou vˇeˇz ́ı 2 , 182 07 Prague

25 31

The paper deals with the aggregation of clas- More important is the semantic of the rules (cf. [5]), sification rules by means of fuzzy integrals, in particular especially the difference between rules of the Boolean with the fuzzy measures employed in that aggregation. It logic and rules of a fuzzy logic. Due to the semantics of points out that the kinds of fuzzy measures commonly en- Boolean and fuzzy formulas, the former are valid for countered in this context do not take into account the di- crisp sets of objects, whereas the validity of the latter versity of classification rules. As a remedy, a new kind of is a fuzzy set on the universe of all considered objects. fsuuzrzeys, maneadsusreevseriasl pursoepfuolsepdr,opcearltleieds soimfisluacrhitym-aewasaureresmaeare- Boolean rulesets are extracted more frequently, espeproven. Finally, results of extensive experiments on a num- cially some specific types of them, such as classification ber of benchmark datasets are reported, in which a particu- rulesets [6, 9]. Those are sets of implications such that lar similarity-aware measure was applied to a combination {Ar}r∈R and {Cr}r∈R partition the set O of considof Choquet or Sugeno integrals with three different ways ered objects, where {·}r∈R stands for the set of distinct of creating ensembles of classification rules. In the experi- formulas in (·)r∈R. Abandoning the requirement that ments, the new measure was compared with the traditional {Ar}r∈R partitions O (at least in the sense of a crisp Sugeno λ-measure, to which it was clearly superior. partitioning) allows to generalize those rulesets also to fuzzy antecedents [15]. For Boolean antecedents, how1 Introduction ever, this requirement entails a natural definition of the validity of a whole classification ruleset R for an Logical formulas of specific kinds, usually called rules, object x. Assuming that all information about x conare a traditional way of formally representing knowl- veyed by R is conveyed by the single rule r covering x edge. Therefore, it is not surprising that they are also (i.e., with Ar valid for x), the validity of R for x can the most frequent representation of the knowledge dis- be defined to coincide with the validity of Ar → Cr for covered in data mining. that r, which in turn equals the validity of Cr for x. The most natural base for differentiating between It is also possible to combine several existing classiexisting rules extraction methods is the syntax and fication rules into a new one. Such aggregation can be semantics of the extracted rules [10]. Syntactical dif- either static, i.e., the result is the same for all inputs, ferences between them are, however, not very deep be- or dynamic, where it is adapted to the currently classicause, principally, any rule r from a ruleset R has one fied input [11, 19]. In the aggregation of classification of the forms Sr ∼ Sr′, or Ar → Cr, where Sr, Sr′, Ar rules, we usually try to create a team of rules that and Cr are formulas of the considered logic, and ∼, → are not similar. This property is called diversity [14]. are symbols of the language of that logic. The differ- There are many methods for building a diverse team ence between both forms concerns semantic properties of classifiers [2, 3, 16]. of the symbols ∼ and →: Sr ∼ Sr′ is symmetric with One of popular aggregation operators is the fuzzy respect to Sr, Sr′ in the sense that its validity always integral [7, 12, 13, 17]. It aggregates the outputs of the coincides with that of Sr′ ∼ Sr whereas Ar → Cr is individual classification rules with respect to a fuzzy not symmetric with respect to Ar , Cr in that sense. In measure. The role of fuzzy measures in the aggregathe case of a propositional logic, ∼ and → are the con- tion of classification rules, in particular their role with nectives equivalence (≡) and implication, respectively, respect to the diversity of the rules, was the subject whereas in the case of a predicate logic, they are gener- of the research reported in this paper. alized quantifiers. To distinguish the formulas involved The following section recalls the fuzzy integrals and in the asymmetric case, Ar is called antecedent and Cr fuzzy measures encountered in the aggregation of clasconsequent of r. sification rules. In Section 3, which is the key section ⋆ The research reported in this paper has been sup- of the paper, a new fuzzy measure, called similarityported by the Czech Science Foundation (GA Cˇ R) grant aware measure, is introduced and its theoretical propP202/11/1368. erties are studied. Finally, in Section 4, results of ex-

Fuzzy integrals and measures in classification rules aggregation |A| = |B| ⇒ μ(A) = μ(B) for A, B ⊆ U , ( 6 ) ( 7 ) where | · | denotes the cardinality of a set.

Several definitions of a fuzzy integral exists in the literature – among them, the Choquet integral and the Consequently, the value of a symmetric measure deSugeno integral are used most often. The role played in pends only on the cardinality of its argument. If a symusual integration by additive measures (such as prob- metric measure is used in Choquet integral, the inteability or Lebesgue measure) is in fuzzy integration gral reduces to the ordered weighted average operaplayed by fuzzy measures. In this section, basic con- tor [17]. However, symmetric measures assume that cepts pertaining to different kinds of fuzzy measures all elements of U have the same importance, thus they will be recalled, as well as the definitions of Choquet do not take into account the diversity of elements. and Sugeno integrals. Due to the intended context of Definition 5. Let ⊥ be a t-conorm. A fuzzy measure aggregation of classification rules, we restrict attention μ is called ⊥-decomposable if to [0, 1]-valued functions on finite sets. tensive experiments and comparison with the tradi- Definition 4. A fuzzy measure μ on U is called symtional Sugeno λ-measure are reported. metric if ( 3 ) ( 4 ) Definition 1. A fuzzy measure μ on a finite set U = {u1, . . . , ur} is a function on the power set of U ,

μ : P(U ) → [0, 1] fulfilling:

1. the boundary conditions 2. the monotonicity

μ(∅) = 0, μ(U ) = 1 A ⊆ B ⇒ μ(A) ≤ μ(B) μ(A ∪ B) = μ(A) ⊥ μ(B) for disjoint A, B ⊆ U ( 8 ) ( 1 ) Hence, ⊥-decomposable measures need only the r fuzzy densities, whereas all the other values are computed using the formula ( 8 ). Particular cases of this kind of fuzzy measures are additive measures, including probabilistic measures (⊥ being the bounded sum), and the ( 2 ) Sugeno λ-measure.

Definition 6. Sugeno λ-measure [7, 17] on a finite set U = {u1, . . . , ur} is defined (Ch)

r f dμ = X(f − f<i−1>)μ(A),

i=1

The values μ(u1), . . . , μ(ur) are called fuzzy densities.

Definition 2. The Choquet integral of a function f : for disjoint A, B ∈ U , and some fixed λ > −1. The U → [0, 1], f (ui) = fi, i = 1, . . . , r, with respect to value of λ is: a fuzzy measure μ is defined as:

a) computed as the unique non-zero root greater than −1 of the equation

μ(A ∪ B) = μ(A) + μ(B) + λμ(A)μ(B),

( 9 ) λ + 1 =

Y (1 + λμ({ui}))

( 10 ) i=1,...,r where < · > indicates that the indices have been permuted, such that 0 = f<0> ≤ f<1> ≤ · · · ≤ f<r> ≤ 1. if the densities do not sum up to 1; A = {u, . . . , u<r>} denotes the set of of ele- b) λ = 0 else. ments of U corresponding to the (r − i + 1) highest If the densities sum up to 1, the fuzzy measure is addivalues of f . tive. Sugeno λ measure is a ⊥-decomposable measure Definition 3. The Sugeno integral of a function f : for the t-norm U → [0, 1], f (ui) = fi, i = 1, . . . , r, with respect to x ⊥ y = min(1, x + y + λxy). ( 11 ) a fuzzy measure μ is defined as: (Su) Z f dμ = mrax min(f, μ(A)). ( 5 ) sureAissetrhioauts twheeakfunzezsys omf eaansuyre⊥-odfecao mseptosoafbtlweom(eoari=1 more) classification rules is fully determined by the

To define a general fuzzy measure in the discrete formula ( 8 ) for a fixed ⊥. Therefore, if interactions case, we need to define all its 2r values, which is usually between elements are to be taken into account, then very complicated. To overcome this weakness, mea- they have to be incorporated directly into the fuzzy sures which do not need all the 2r values have been measure. That fact motivated our attempt to elabodeveloped [7, 17]: rate the concept of similarity-aware fuzzy measures. 3

Similarity-aware measures and their properties

Before introducing similarity-aware measures, let us first recall the notion of similarity [8].

Definition 7. Let ∧ be a t-norm and let ∼: U × U → [0, 1] be a fuzzy relation. ∼ is called a similarity on U with respect to ∧ if the following holds for a, b, c ∈ U :

∼ (a, a) = 1 (reflexivity), ( 12 ) ∼ (a, b) =∼ (b, a) (symmetry), ( 13 ) ∼ (a, b)∧ ∼ (b, c) ≤∼ (a, c) (transitivity w.r.t. ∧ ).

( 14 )

In the context of aggregation of crisp classification rules, we will work with an empirically defined relation, which, for rules φk, φl, is defined as the proportion of equal consequents on some validation set of patterns V ⊂ O,

P I(Cφk (x) = Cφl (x)) ∼ (φk, φl) = x∈V

|V | It is easily seen that the relation ( 15 ) is a similarity with respect to the Lukasiewicz t-norm

∧L(a, b) = max(a + b − 1, 0), but it is not a similarity with respect to the standard (minimum, G¨odel) t-norm or the product t-norm ∧S (a, b) = min(a, b),

∧P (a, b) = ab.

Fuzzy integral represents a convenient tool to work with the diversity of classification rules: As we are computing the fuzzy measure values μ(A), we are considering a single rule φ at each step i, and therefore we can influence the increase of the fuzzy measure based on the similarity of φ to the set of rules already involved in the integration, i.e., A<i+1> = {φ<i+1>, . . . , φ<r>}. If φ is similar to the classifiers in A<i+1>, the increase in the fuzzy measure should be small (since the importance of the set A should be similar to the importance of the set A<i+1>), and if φ is not similar to the classifiers in A<i+1>, the increase of the fuzzy measure should be large. These ideas motivated the following definition: (20) (22) (23) (24) (26) (27) r S = (si,j )i,j=1 with si,j =∼ (ui, uj).

( 19 )

The following propositions show that if for some Definition 8. Let U = {u1, . . . , ur} be a set, let ∼ be i, the i-th classification rule is totally similar to some a similarity w.r.t. a t-norm ∧, and let S be a an r × r other rule in A<i+1>, then μ(S) does not increase, and matrix such that: if it is totally unsimilar to all classifiers in A<i+1>, the increase in μ(S) is maximal. ( 15 ) ( 17 ) ( 18 ) is called a similarity-aware measure based on S.

Proposition 1. μ(S) is a fuzzy measure on U .

Proof. The boundary conditions follow directly from the definition of μ(S). For the monotonicity, let A ⊆ B; ( 16 ) then

r r μ˜(S)(A) = X I(u[i] ∈ A)κ[i](1 − max s[i],[j]) ≤ i=1 j=i+1 r r ≤ X I(u[i] ∈ B)κ[i](1 − max s[i],[j]) =

i=1 j=i+1 Let further κi ∈ [0, 1], i = 1, . . . , r denote some kind of weight (confidence, importance) of ui, and let [·] denote index ordering according to κ, such that 0 ≤ κ[1] ≤ · · · ≤ κ[r] ≤ 1. Finally, let

μ˜(S) : P (U ) → [0, ∞) be a mapping such that for X ⊆ U ,

r r μ˜(S)(X ) = X I(u[i] ∈ X )κ[i](1 − max s[i],[j]), (21) i=1 j=i+1 where we define maxjr=r+1 s[r],[j] = 0, and I denotes the indicator of thruth value, i.e.,

Then the mapping

I(true) = 1, I(false) = 0. μ(S) : P (U ) → [0, 1], defined μ(S)(X ) = μ˜(S)(X ) μ˜(S)(U ) , = μ˜(S)(B), (25) due to I(u[i] ∈ A) = 1 ⇒ I(u[i] ∈ B) = 1.

Proposition 2. For any of the 2r subsets X ⊂ U ,

the value μ(X ) can be expressed simply as the sum of values of μ on singletons μ(S)(X ) =

X μ(S)(ui).

ui∈X Proof. According to (21) and (23), the value of μ on the singletosn ui, i = 1, . . . , r is μ(S)(ui) = μ˜(S1)(U ) κ[i](1 − mrax s[i],[j]).

j=i+1 Then (26) follows directly from (21). Proposition 3. Let f : U → [0, 1], and let the ma- classification trees [3], by bagging [2] from rules obtrix S in ( 19 ) fulfills tained with k-NN classifiers, and by the multiple feature subset method [1] from rules obtained with quadsi,j = 1 for i 6= j. (28) ratic discriminant analysis.

In this section, we present results of comparing the

Then: measures using 10-fold crossvalidation on 5 artificial 1. (∀X ⊆ U ) u[r] ∈ X ⇒ μ(S) = 1, and 11 real-world datasets (the properties of the da2. (∀X ⊆ U ) u[r] 6∈ X ⇒ μ(S) = 0, tasets are shown in Table 1). For the random forests, the number of trees was set to r = 20, the number 3. (Ch) R f dμ(S) = (Su) R f dμ(S) = f[r]. of features to explore in each node varied between 2 Proof. 1. and 2. follow directly from the fact that and 5 (depending on the dimensionality of the particular dataset), the maximal size of a leaf was set r (0 for i = r, to 10 (see [3] for description of the parameters). For jm=ia+x1 s[i],[j] = 1 for i < r. (29) the QDA and k-NN based ensembles, their size was set also to r = 20, and we used k = 5 as the numand therefore ber of neighbors for k-NN classifiers. As the weights κ1, . . . , κr of the classification rules, we used μ˜(S) = I(u[r] ∈ X )κ[r].

(30) We will prove 3. only for the Choquet integral, the case of Sugeno integral is analogous. Let j ∈ {1, . . . , r} such that < j >= [r]; then (∀i > j) u[r] 6∈ A, and where V (Aφ) ⊆ V is the set of validation patterns therefore μ(S)(A) = 0; (∀i ≤ j) u[r] ∈ A, and belonging to some kind of neighborhood of Aφ. For therefore μ(S) = 1. Using this in the definition of the example, if Aφ concerns values of vectors in an EuChoquet integral, we obtain clidean space, then V (Aφ) is the set of k nearest neighbors under Euclidean metric of the set where the an(Ch) Z f dμ(S) = tteoc5e,de1n0t, Aorφ2i0s,vdaelipde.nTdhinegnounmtbheer soifzeneoifghthbeordsawtaasseste.t r Table 2 shows the results of the performed compar= X(f − f<i−1>)μ(S)(A) = isons. We also measured the statistical significance of i=1 the pairwse improvements (using the analysis of varij ance on the 5% confidence level by the Tukey-Kramer = X(f − f<i−1>) = method).

i=1 We interpret the results presented in Table 2 as = f<j> = f[r]. (31) a confirmation of the usefulness of similarity-aware fuzzy measures proposed in Definition 8.

κi(φ) =

P x∈V(Aφ)

I(Cφ′(x) = Cφ(x)) |V (Aφ)| , (32)

Proposition 4. Let f : U → [0, 1], and let the ma

trix S in ( 19 ) fulfills si,j = 0 for i 6= j. Then: 5

Conclusion

1. (∀X ⊆ U ) μ(S) = PiP:u[iri=]∈1Xκiκ[i] , 2. (Ch) R f dμ(S)μ(S) = PPir=ir=11κκifii , In this paper, we have studied the application of the fuzzy integral as an aggregation operator for classification rules in the context of their similarities. We have 3. (Su) R f dμ(S) = maxrk=1(f<k>, PPir=irk=1κ<κii> ). shown that traditionally used symmetric, or additive Proof. 1. follows directly from the definition of simi- and other ⊥-decomposable measures are not a good larity-aware measure, and 2. and 3. are applications choice for combining classification rules by fuzzy inteof 1. to the definition of the Choquet/Sugeno integral. gral and we have defined similarity-aware measures, which take into account both the confidence / importance and the similarities of the aggregated rules. 4 Experimental testing We have shown some basic theoretical properties and special cases of the measures, including the fact that We have experimentally compared the performance of apart the singletons, the 2r values of μ are obtained usthe proposed measure with the Sugeno λ-measure for ing only summation. In addition, we have experimenthe aggregation of classification rules by fuzzy inte- tally compared the performance of the measures to the grals (Choquet, Sugeno). The ensembles have been Sugeno λ-measure using Choquet and Sugeno fuzzy increated as random forests from rules obtained with tegrals on 16 benchmark datasets for 3 different ways nr. of patterns nr. of classes dimension of obtaining ensembles of classification rules. The experimental comparison clearly supports our theoretical conjecture that similarity-aware measures are more suitable for the aggregation of classification rules than traditionally used additive and ⊥-decomposable fuzzy measures. dataset clouds concentric gauss-3D glass letters pendigits phoneme pima poker ringnorm satimage transfusion vowel waveform wine yeast

1. S. D. Bay: Nearest neighbor classification from multiple featre subsets . Intelligent Data Analysis 3 , 1999 , 191 - 209 .

2. L. Breiman: Bagging predictors . Machine Learning 24 , 1996 , 123 - 140 .

3. L. Breiman: Random forests . Machine Learning 45 , 2001 , 5 - 32 .

4. Machine Learning Group Catholic University of Leuven. Elena database. http://mlg.info.ucl.ac.be/ index.php?page=Elena.

Dubois , Hu¨llermeier, H. Prade: A systematic approach to the assessment of fuzzy association rules . Data Mining and Knowledge Discovery 13 , 2006 , 167 - 192 .

Geng ,

H. J.

Hamilton : Choosing the right lens: Finding what is interesting in data mining . In F. Guillet and H. J. Hamilton , (Eds), Quality Measures in Data Mining , Springer Verlag, Berlin, 2007 , 3 - 24 .

Grabisch ,

H. T.

Nguyen ,

E. A.

Walker : Fundamentals of uncertainty calculi with applications to fuzzy inference . Kluwer Academic Publishers, Dordrecht, 1994 .

8. P. H´ajek: Metamathematics of fuzzy logic . Kluwer Academic Publishers, Dordrecht, 1998 .

D. J.

Hand : Construction and assessment of classification rules . John Wiley and Sons, New York, 1997 .

10. M. Holenˇa: Measures of ruleset quality capable to represent uncertain validity . Submitted to International Journal of Approximate Reasoning.

11. A. H. R. Ko , R.

Sabourin , A. S.

Britto: From dynamic classifier selection to dynamic ensemble selection . Pattern Recognition 41 , 2008 , 1718 - 1731 .

12. L. I. Kuncheva: Fuzzy versus nonfuzzy in combining classifiers designed by boosting . IEEE Transactions on Fuzzy Systems 11 , 2003 , 729 - 741 .

13. L. I. Kunchev: Combining pattern classifiers: methods and algorithms . John Wiley and Sons, New York, 2004 .

14.

L. I. Kuncheva C. J.

Whitaker : Measures of diversity in classifier ensembles . Machine Learning 51 , 2003 , 181 - 207 .

15.

L. E.

Peterson ,

M. A.

Coleman : Machine learning based receiver operating characteristic (ROC) curves for crisp and fuzzy classification of DNA microarrays in cancer research . International Journal of Approximate Reasoning 47 , 2008 , 17 - 36 .

16. L. Rokach: Taxonomy for characterizing ensemble methods in classification tasks: A review and annotated bibliography . Computational Statistics and Data Analysis 53 , 2009 , 4046 - 4072 .

17.

Torra , Y. Narukawa: Modeling decisions: information fusion and aggregation operators . Springer Verlag, Berlin, 2007 .

18. Machine Learning Group University of California Irwine. Repository of machine learning databases . http://www.ics.uci.edu/ mlearn/ MLRepository.html.

19. D. Sˇtefka , M. Holenˇa: Dynamic classifier systems and their applications to random forest ensembles . In Adaptive and Natural Computing Algorithms. Lecture Notes in Computer Science 5495 , Springer Verlag, Berlin, 2009 , 458 - 468 .