=Paper= {{Paper |id=None |storemode=property |title=Fuzzy classification rules based on similarity |pdfUrl=https://ceur-ws.org/Vol-990/paper5.pdf |volume=Vol-990 |dblpUrl=https://dblp.org/rec/conf/itat/HolenaS12 }} ==Fuzzy classification rules based on similarity== https://ceur-ws.org/Vol-990/paper5.pdf
                    Fuzzy classification rules based on similarity⋆

                                          Martin Holeňa1 and David Štefka2
                     1
                         Institute of Computer Science, Academy of Sciences of the Czech Republic
                                          Pod Vodárenskou věžı́ 2, 182 07 Prague
                                   2
                                     Faculty of Nuclear Science and Physical Engineering
                                                 Czech Technical University
                                                Trojanova 13, 120 00 Prague

Abstract. The paper deals with the aggregation of clas-             More important is the semantic of the rules (cf. [5]),
sification rules by means of fuzzy integrals, in particular     especially the difference between rules of the Boolean
with the fuzzy measures employed in that aggregation. It        logic and rules of a fuzzy logic. Due to the semantics of
points out that the kinds of fuzzy measures commonly en-        Boolean and fuzzy formulas, the former are valid for
countered in this context do not take into account the di-      crisp sets of objects, whereas the validity of the latter
versity of classification rules. As a remedy, a new kind of
                                                                is a fuzzy set on the universe of all considered objects.
fuzzy measures is proposed, called similarity-aware mea-
sures, and several useful properties of such measures are
                                                                Boolean rulesets are extracted more frequently, espe-
proven. Finally, results of extensive experiments on a num-     cially some specific types of them, such as classification
ber of benchmark datasets are reported, in which a particu-     rulesets [6, 9]. Those are sets of implications such that
lar similarity-aware measure was applied to a combination       {Ar }r∈R and {Cr }r∈R partition the set O of consid-
of Choquet or Sugeno integrals with three different ways        ered objects, where {·}r∈R stands for the set of distinct
of creating ensembles of classification rules. In the experi-   formulas in (·)r∈R . Abandoning the requirement that
ments, the new measure was compared with the traditional        {Ar }r∈R partitions O (at least in the sense of a crisp
Sugeno λ-measure, to which it was clearly superior.             partitioning) allows to generalize those rulesets also to
                                                                fuzzy antecedents [15]. For Boolean antecedents, how-
1      Introduction                                             ever, this requirement entails a natural definition of
                                                                the validity of a whole classification ruleset R for an
Logical formulas of specific kinds, usually called rules,        object x. Assuming that all information about x con-
are a traditional way of formally representing knowl-           veyed by R is conveyed by the single rule r covering x
edge. Therefore, it is not surprising that they are also        (i.e., with Ar valid for x), the validity of R for x can
the most frequent representation of the knowledge dis-          be defined to coincide with the validity of Ar → Cr for
covered in data mining.                                         that r, which in turn equals the validity of Cr for x.
    The most natural base for differentiating between                It is also possible to combine several existing classi-
existing rules extraction methods is the syntax and             fication rules into a new one. Such aggregation can be
semantics of the extracted rules [10]. Syntactical dif-         either static, i.e., the result is the same for all inputs,
ferences between them are, however, not very deep be-           or dynamic, where it is adapted to the currently classi-
cause, principally, any rule r from a ruleset R has one         fied input [11, 19]. In the aggregation of classification
of the forms Sr ∼ Sr′ , or Ar → Cr , where Sr , Sr′ , Ar        rules, we usually try to create a team of rules that
and Cr are formulas of the considered logic, and ∼, →           are not similar. This property is called diversity [14].
are symbols of the language of that logic. The differ-           There are many methods for building a diverse team
ence between both forms concerns semantic properties            of classifiers [2, 3, 16].
of the symbols ∼ and →: Sr ∼ Sr′ is symmetric with                  One of popular aggregation operators is the fuzzy
respect to Sr , Sr′ in the sense that its validity always       integral [7, 12, 13, 17]. It aggregates the outputs of the
coincides with that of Sr′ ∼ Sr whereas Ar → Cr is              individual classification rules with respect to a fuzzy
not symmetric with respect to Ar , Cr in that sense. In         measure. The role of fuzzy measures in the aggrega-
the case of a propositional logic, ∼ and → are the con-         tion of classification rules, in particular their role with
nectives equivalence (≡) and implication, respectively,         respect to the diversity of the rules, was the subject
whereas in the case of a predicate logic, they are gener-       of the research reported in this paper.
alized quantifiers. To distinguish the formulas involved             The following section recalls the fuzzy integrals and
in the asymmetric case, Ar is called antecedent and Cr          fuzzy measures encountered in the aggregation of clas-
consequent of r.                                                sification rules. In Section 3, which is the key section
⋆
    The research reported in this paper has been sup-           of the paper, a new fuzzy measure, called similarity-
    ported by the Czech Science Foundation (GA ČR) grant       aware measure, is introduced and its theoretical prop-
    P202/11/1368.                                               erties are studied. Finally, in Section 4, results of ex-
26       Martin Holeňa, David Štefka

tensive experiments and comparison with the tradi- Definition 4. A fuzzy measure µ on U is called sym-
tional Sugeno λ-measure are reported.              metric if

                                                                               |A| = |B| ⇒ µ(A) = µ(B)                (6)
2     Fuzzy integrals and measures in                                                 for A, B ⊆ U,                   (7)
     classification rules aggregation
                                                                where | · | denotes the cardinality of a set.
Several definitions of a fuzzy integral exists in the lit-
erature – among them, the Choquet integral and the              Consequently, the value of a symmetric measure de-
Sugeno integral are used most often. The role played in         pends only on the cardinality of its argument. If a sym-
usual integration by additive measures (such as prob-           metric measure is used in Choquet integral, the inte-
ability or Lebesgue measure) is in fuzzy integration            gral reduces to the ordered weighted average opera-
played by fuzzy measures. In this section, basic con-           tor [17]. However, symmetric measures assume that
cepts pertaining to different kinds of fuzzy measures            all elements of U have the same importance, thus they
will be recalled, as well as the definitions of Choquet          do not take into account the diversity of elements.
and Sugeno integrals. Due to the intended context of            Definition 5. Let ⊥ be a t-conorm. A fuzzy measure
aggregation of classification rules, we restrict attention       µ is called ⊥-decomposable if
to [0, 1]-valued functions on finite sets.
Definition 1. A fuzzy measure µ on a finite set U =               µ(A ∪ B) = µ(A) ⊥ µ(B)
{u1 , . . . , ur } is a function on the power set of U,                                       for disjoint A, B ⊆ U   (8)

                     µ : P(U) → [0, 1]                   (1) Hence, ⊥-decomposable measures need only the r fuzzy
                                                             densities, whereas all the other values are computed
fulfilling:                                                  using the formula (8). Particular cases of this kind of
 1. the boundary conditions                                  fuzzy measures are additive measures, including prob-
                                                             abilistic measures (⊥ being the bounded sum), and the
                      µ(∅) = 0, µ(U) = 1                 (2) Sugeno λ-measure.
 2. the monotonicity                                            Definition 6. Sugeno λ-measure [7, 17] on a finite
                                                                set U = {u1 , . . . , ur } is defined
                   A ⊆ B ⇒ µ(A) ≤ µ(B)                   (3)
                                                                       µ(A ∪ B) = µ(A) + µ(B) + λµ(A)µ(B),            (9)
The values µ(u1 ), . . . , µ(ur ) are called fuzzy densities.
                                                             for disjoint A, B ∈ U, and some fixed λ > −1. The
Definition 2. The Choquet integral of a function f :
                                                             value of λ is:
U → [0, 1], f (ui ) = fi , i = 1, . . . , r, with respect to
a fuzzy measure µ is defined as:                              a) computed as the unique non-zero root greater
        Z            r                                           than −1 of the equation
                    X
    (Ch) f dµ =         (f − f )µ(A ),        (4)                         Y
                    i=1
                                                                            λ+1=         (1 + λµ({ui }))   (10)
                                                                                        i=1,...,r
where < · > indicates that the indices have been per-
muted, such that 0 = f<0> ≤ f<1> ≤ · · · ≤ f ≤ 1.         if the densities do not sum up to 1;
A = {u , . . . , u } denotes the set of of ele-  b) λ = 0 else.
ments of U corresponding to the (r − i + 1) highest If the densities sum up to 1, the fuzzy measure is addi-
values of f .                                            tive. Sugeno λ measure is a ⊥-decomposable measure
Definition 3. The Sugeno integral of a function f : for the t-norm
U → [0, 1], f (ui ) = fi , i = 1, . . . , r, with respect to             x ⊥ y = min(1, x + y + λxy).         (11)
a fuzzy measure µ is defined as:
            Z                                                   A serious weakness of any ⊥-decomposable mea-
                        r
       (Su) f dµ = max min(f , µ(A )).             (5) sure is that the fuzzy measure of a set of two (or
                       i=1
                                                             more) classification rules is fully determined by the
    To define a general fuzzy measure in the discrete formula (8) for a fixed ⊥. Therefore, if interactions
case, we need to define all its 2r values, which is usually between elements are to be taken into account, then
very complicated. To overcome this weakness, mea- they have to be incorporated directly into the fuzzy
sures which do not need all the 2r values have been measure. That fact motivated our attempt to elabo-
developed [7, 17]:                                           rate the concept of similarity-aware fuzzy measures.
                                                                      Fuzzy classification rules based on similarity             27

3     Similarity-aware measures and                              Let further κi ∈ [0, 1], i = 1, . . . , r denote some kind
     their properties                                            of weight (confidence, importance) of ui , and let [·]
                                                                 denote index ordering according to κ, such that 0 ≤
Before introducing similarity-aware measures, let us             κ[1] ≤ · · · ≤ κ[r] ≤ 1. Finally, let
first recall the notion of similarity [8].
                                                                                       µ̃(S) : P(U) → [0, ∞)      (20)
Definition 7. Let ∧ be a t-norm and let ∼: U × U →
                                                           be a mapping such that for X ⊆ U,
[0, 1] be a fuzzy relation. ∼ is called a similarity on U
                                                                          r
with respect to ∧ if the following holds for a, b, c ∈ U:                X                          r
                                                             µ̃(S) (X) =     I(u[i] ∈ X)κ[i] (1 − max s[i],[j] ), (21)
                                                                                                  j=i+1
                ∼ (a, a) = 1 (reflexivity),           (12)               i=1

              ∼ (a, b) =∼ (b, a) (symmetry),          (13) where we define maxrj=r+1 s[r],[j] = 0, and I denotes
  ∼ (a, b)∧ ∼ (b, c) ≤∼ (a, c) (transitivity w.r.t. ∧ ).   the indicator of thruth value, i.e.,
                                                         (14)                      I(true) = 1, I(false) = 0.                   (22)
    In the context of aggregation of crisp classification Then the mapping
rules, we will work with an empirically defined rela-
                                                                        µ(S) : P(U) → [0, 1], defined         (23)
tion, which, for rules φk , φl , is defined as the propor-
                                                                                           (S)
tion of equal consequents on some validation set of                                     µ̃ (X)
                                                                            µ(S) (X) = (S)      ,             (24)
patterns V ⊂ O,                                                                          µ̃ (U)
                                                          is called a similarity-aware measure based on S.
                       P
                          I(Cφk (x) = Cφl (x))
                      x∈V
       ∼ (φk , φl ) =                          .     (15)
                                  |V|
                                                          Proposition 1. µ(S) is a fuzzy measure on U.
It is easily seen that the relation (15) is a similarity
with respect to the Lukasiewicz t-norm                    Proof. The boundary conditions follow directly from
                                                          the definition of µ(S) . For the monotonicity, let A ⊆ B;
              ∧L (a, b) = max(a + b − 1, 0),         (16) then
                                                                                 r
but it is not a similarity with respect to the standard                          X                                r
(minimum, Gödel) t-norm                                           µ̃(S) (A) =         I(u[i] ∈ A)κ[i] (1 − max s[i],[j] ) ≤
                                                                                                              j=i+1
                                                                                 i=1
                                                                             r
                   ∧S (a, b) = min(a, b),                (17)                X                               r
                                                                         ≤         I(u[i] ∈ B)κ[i] (1 − max s[i],[j] ) =
                                                                                                          j=i+1
or the product t-norm                                                        i=1

                                                                                                                 = µ̃(S) (B),   (25)
                      ∧P (a, b) = ab.                    (18)
                                                                 due to I(u[i] ∈ A) = 1 ⇒ I(u[i] ∈ B) = 1.
    Fuzzy integral represents a convenient tool to work
with the diversity of classification rules: As we are             Proposition 2. For any of the 2r subsets X ⊂ U,
                                                                 the value µ(X) can be expressed simply as the sum of
computing the fuzzy measure values µ(A ), we are
considering a single rule φ at each step i, and there-        values of µ on singletons
fore we can influence the increase of the fuzzy measure
                                                                                           X
                                                                               µ(S) (X) =      µ(S) (ui ).       (26)
based on the similarity of φ to the set of rules                                               ui ∈X
already involved in the integration, i.e., A =
{φ , . . . , φ }. If φ is similar to the classifiers   Proof. According to (21) and (23), the value of µ on
in A , the increase in the fuzzy measure should             the singletosn ui , i = 1, . . . , r is
be small (since the importance of the set A should                                      1                 r
                                                                      µ(S) (ui ) =               κ[i] (1 − max s[i],[j] ).      (27)
be similar to the importance of the set A ), and                                  µ̃(S) (U)           j=i+1
if φ is not similar to the classifiers in A , the
                                                                 Then (26) follows directly from (21).
increase of the fuzzy measure should be large. These
ideas motivated the following definition:
                                                                 The following propositions show that if for some
Definition 8. Let U = {u1 , . . . , ur } be a set, let ∼ be i, the i-th classification rule is totally similar to some
                                                                                             (S)
a similarity w.r.t. a t-norm ∧, and let S be a an r × r other rule in A , then µ does not increase, and
matrix such that:                                           if it is totally unsimilar to all classifiers in A , the
                                                            increase in µ(S) is maximal.
         S = (si,j )ri,j=1 with si,j =∼ (ui , uj ).    (19)
28      Martin Holeňa, David Štefka

Proposition 3. Let f : U → [0, 1], and let the ma-       classification trees [3], by bagging [2] from rules ob-
trix S in (19) fulfills                                  tained with k-NN classifiers, and by the multiple fea-
                                                         ture subset method [1] from rules obtained with quad-
                    si,j = 1 for i 6= j.            (28) ratic discriminant analysis.
                                                              In this section, we present results of comparing the
    Then:                                                measures using 10-fold crossvalidation on 5 artificial
 1. (∀X ⊆ U) u[r] ∈ X ⇒ µ(S) = 1,                        and 11 real-world datasets (the properties of the da-
 2. (∀X ⊆ U) u[r] 6∈ X ⇒ µ(S) = 0,                       tasets are shown in Table 1). For the random forests,
                                                         the number of trees was set to r = 20, the number
 3. (Ch) f dµ(S) = (Su) f dµ(S) = f[r] .
          R                 R
                                                         of features to explore in each node varied between 2
Proof. 1. and 2. follow directly from the fact that      and 5 (depending on the dimensionality of the par-
                             (                           ticular dataset), the maximal size of a leaf was set
               r               0 for i = r,              to 10 (see [3] for description of the parameters). For
              max s[i],[j] =                        (29) the QDA and k-NN based ensembles, their size was
             j=i+1             1 for i < r.
                                                         set also to r = 20, and we used k = 5 as the num-
and therefore                                            ber of neighbors for k-NN classifiers. As the weights
                                                         κ1 , . . . , κr of the classification rules, we used
                   (S)
                 µ̃ = I(u[r] ∈ X)κ[r].              (30)
                                                                                         I(Cφ′ (x) = Cφ (x))
                                                                                   P
                                                                                x∈V(Aφ )
We will prove 3. only for the Choquet integral, the                  κi (φ) =                           ,     (32)
                                                                                           |V(Aφ )|
case of Sugeno integral is analogous. Let j ∈ {1, . . . , r}
such that < j >= [r]; then (∀i > j) u[r] 6∈ A , and    where V(Aφ ) ⊆ V is the set of validation patterns
therefore µ(S) (A ) = 0; (∀i ≤ j) u[r] ∈ A , and    belonging to some kind of neighborhood of Aφ . For
therefore µ(S) = 1. Using this in the definition of the    example, if Aφ concerns values of vectors in an Eu-
Choquet integral, we obtain                               clidean space, then V(Aφ ) is the set of k nearest neigh-
                                                          bors under Euclidean metric of the set where the an-
                                                          tecedent Aφ is valid. The number of neighbors was set
        Z
   (Ch) f dµ(S) =
                                                          to 5, 10, or 20, depending on the size of the dataset.
            Xr                                                Table 2 shows the results of the performed compar-
          =     (f − f )µ(S) (A ) =            isons. We also measured the statistical significance of
            i=1                                           the pairwse improvements (using the analysis of vari-
                  X j                                     ance on the 5% confidence level by the Tukey-Kramer
                =      (f − f ) =                 method).
                   i=1                                        We interpret the results presented in Table 2 as
                                     = f = f[r] . (31) a confirmation of the usefulness of similarity-aware
                                                          fuzzy measures proposed in Definition 8.
Proposition 4. Let f : U → [0, 1], and let the ma-
trix S in (19) fulfills si,j = 0 for i 6= j. Then:
                        P
                                                               5   Conclusion
                          i:u[i] ∈X κ[i]
                 (S)       Pr
1. (∀X ⊆ U) µ =                ,                       In this paper, we have studied the application of the
                            Pi=1 κi
                               r
                               i=1 κi fi
2. (Ch) f dµ(S) µ(S) =
       R
                             P , r                     fuzzy integral as an aggregation operator for classifica-
                                 i=1 κi
                                 Pr
                                  i=k κ
                                                       tion rules in the context of their similarities. We have
3. (Su) f dµ(S) = maxrk=1 (f , P
       R
                                    r
                                       κi ).
                                       i=1             shown that traditionally used symmetric, or additive
Proof. 1. follows directly from the definition of simi- and other ⊥-decomposable measures are not a good
larity-aware measure, and 2. and 3. are applications choice for combining classification rules by fuzzy inte-
of 1. to the definition of the Choquet/Sugeno integral. gral and we have defined similarity-aware measures,
                                                       which take into account both the confidence / im-
                                                       portance and the similarities of the aggregated rules.
4 Experimental testing                                 We have shown some basic theoretical properties and
                                                       special cases of the measures, including the fact that
We have experimentally compared the performance of apart the singletons, the 2r values of µ are obtained us-
the proposed measure with the Sugeno λ-measure for ing only summation. In addition, we have experimen-
the aggregation of classification rules by fuzzy inte- tally compared the performance of the measures to the
grals (Choquet, Sugeno). The ensembles have been Sugeno λ-measure using Choquet and Sugeno fuzzy in-
created as random forests from rules obtained with tegrals on 16 benchmark datasets for 3 different ways
                                                                    Fuzzy classification rules based on similarity      29


 dataset            nr. of patterns nr. of classes dimension    6. L. Geng, H. J. Hamilton: Choosing the right lens: Find-
                                                                   ing what is interesting in data mining. In F. Guillet
                           Artificial                              and H. J. Hamilton, (Eds), Quality Measures in Data
                                                                   Mining, Springer Verlag, Berlin, 2007, 3–24.
 clouds [4]              5000            2            2
                                                                7. M. Grabisch, H. T. Nguyen, E. A. Walker: Fundamen-
 concentric [4]          2500            2            2            tals of uncertainty calculi with applications to fuzzy
                                                                   inference. Kluwer Academic Publishers, Dordrecht,
 gauss 3D [4]            5000            2            3
                                                                   1994.
 ringnorm [18]           3000            2           20         8. P. Hájek: Metamathematics of fuzzy logic. Kluwer
                                                                   Academic Publishers, Dordrecht, 1998.
 waveform [18]           5000            3           21
                                                                9. D. J. Hand: Construction and assessment of classifi-
                          Real-world                               cation rules. John Wiley and Sons, New York, 1997.
                                                               10. M. Holeňa: Measures of ruleset quality capable to rep-
 glass [18]              214             7            9            resent uncertain validity. Submitted to International
 letters [18]           20000            26          16            Journal of Approximate Reasoning.
                                                               11. A. H. R. Ko, R. Sabourin, A. S. Britto: From dynamic
 pendigits [18]         10992            10          16            classifier selection to dynamic ensemble selection. Pat-
                                                                   tern Recognition 41, 2008, 1718–1731.
 phoneme [4]             5427            2            5
                                                               12. L. I. Kuncheva: Fuzzy versus nonfuzzy in combining
 pima [18]               768             2            8            classifiers designed by boosting. IEEE Transactions on
                                                                   Fuzzy Systems 11, 2003, 729–741.
 poker [18]              4828            3           10
                                                               13. L. I. Kunchev: Combining pattern classifiers: methods
 satimage [4]            6435            6            4            and algorithms. John Wiley and Sons, New York, 2004.
                                                               14. L. I. Kuncheva C. J. Whitaker: Measures of diversity
 transfusion [18]        748             2            4
                                                                   in classifier ensembles. Machine Learning 51, 2003,
 vowel [18]              990             11          10            181–207.
                                                               15. L. E. Peterson, M. A. Coleman: Machine learning
 wine [18]               178             3           13            based receiver operating characteristic (ROC) curves
 yeast [18]              1484            4            8            for crisp and fuzzy classification of DNA microarrays
                                                                   in cancer research. International Journal of Approxi-
       Table 1. Datasets used in the experiments.                  mate Reasoning 47, 2008, 17–36.
                                                               16. L. Rokach: Taxonomy for characterizing ensemble
                                                                   methods in classification tasks: A review and annotated
                                                                   bibliography. Computational Statistics and Data Anal-
of obtaining ensembles of classification rules. The ex-             ysis 53, 2009, 4046–4072.
                                                               17. V. Torra, Y. Narukawa: Modeling decisions: informa-
perimental comparison clearly supports our theoreti-
                                                                   tion fusion and aggregation operators. Springer Verlag,
cal conjecture that similarity-aware measures are more
                                                                   Berlin, 2007.
suitable for the aggregation of classification rules than       18. Machine Learning Group University of Califor-
traditionally used additive and ⊥-decomposable fuzzy               nia Irwine.         Repository of machine learning
measures.                                                          databases.          http://www.ics.uci.edu/ mlearn/
                                                                   MLRepository.html.
                                                               19. D. Štefka, M. Holeňa: Dynamic classifier systems
                                                                   and their applications to random forest ensembles. In
References                                                         Adaptive and Natural Computing Algorithms. Lec-
                                                                   ture Notes in Computer Science 5495, Springer Verlag,
 1. S. D. Bay: Nearest neighbor classification from multiple       Berlin, 2009, 458–468.
    featre subsets. Intelligent Data Analysis 3, 1999, 191–
    209.
 2. L. Breiman: Bagging predictors. Machine Learning 24,
    1996, 123–140.
 3. L. Breiman: Random forests. Machine Learning 45,
    2001, 5–32.
 4. Machine Learning Group Catholic University of Leu-
    ven. Elena database. http://mlg.info.ucl.ac.be/
    index.php?page=Elena.
 5. D. Dubois, Hüllermeier, H. Prade: A systematic ap-
    proach to the assessment of fuzzy association rules.
    Data Mining and Knowledge Discovery 13, 2006, 167–
    192.
30   Martin Holeňa, David Štefka




                       dataset            Choquet integral                 Sugeno integral
                                                           S
                                      λ-measure           µ          λ-measure          µS
                                                    random forests
                       clouds        12.40 ± 1.81 12.25 ± 1.85 12.80 ± 1.64 12.33 ± 1.47
                       concentric     4.32 ± 1.25    2.82 ± 1.30 3.24 ± 1.37        2.98 ± 1.50
                       gauss-3D      23.92 ± 2.97 22.76 ± 1.59 24.60 ± 1.36 23.28 ± 1.58
                       glass          21.3 ± 10.3    14.1 ± 3.5        24.1 ± 7.0   17.5 ± 9.4
                       letters        7.1 ± 0.6       7.3 ± 0.2        8.0 ± 0.6     7.9 ± 0.8
                       pendigits       3.1 ± 0.5      2.7 ± 0.5        3.2 ± 0.4     3.8 ± 0.7
                       phoneme        12.4 ± 1.2      13.2 ± 1.9       12.7 ± 0.8   13.3 ± 1.6
                       pima           26.0 ± 4.8     23.8 ± 2.0        25.0 ± 2.2   23.9 ± 3.6
                       poker          46.5 ± 3.0     44.4 ± 1.3        46.5 ± 1.5   45.1 ± 1.9
                       ringnorm      13.27 ± 2.11 7.69 ± 2.06 12.74 ± 2.08 7.46 ± 1.79
                       satimage       14.7 ± 1.4     14.3 ± 1.3        14.9 ± 0.9   14.8 ± 1.4
                       transfusion     4.8 ± 1.1      2.3 ± 0.7        4.9 ± 1.0     2.6 ± 0.7
                       vowel          14.5 ± 3.0     13.1 ± 3.5        17.0 ± 5.3   13.4 ± 3.8
                       waveform      18.56 ± 2.42 17.93 ± 1.89 18.24 ± 3.04 18.23 ± 1.58
                       wine            5.6 ± 6.0      3.3 ± 5.5        3.4 ± 4.0     6.6 ± 5.8
                       yeast          38.2 ± 4.1     34.8 ± 2.6        38.5 ± 3.7   36.3 ± 3.4

                                                    k-NN classifiers
                       clouds        11.93 ± 2.29 12.12 ± 1.57 12.64 ± 2.48 12.96 ± 2.26
                       concentric     1.39 ± 0.77    1.72 ± 0.57 1.30 ± 0.80 1.56 ± 0.64
                       gauss-3D      26.71 ± 2.55 26.00 ± 2.88 27.68 ± 3.66 26.28 ± 2.74
                       glass          22.4 ± 9.8     20.7 ± 10.3   21.7 ± 11.1      19.3 ± 6.5
                       letters        17.7 ± 2.7     17.6 ± 2.9        19.3 ± 3.1   19.1 ± 2.7
                       pendigits       1.3 ± 0.8      1.4 ± 0.8        1.3 ± 0.5     1.3 ± 0.7
                       phoneme        14.6 ± 0.9     14.2 ± 2.4        14.4 ± 1.8   14.5 ± 1.7
                       pima           29.1 ± 5.1      30.2 ± 7.2       29.5 ± 4.4   30.3 ± 6.6
                       poker          45.3 ± 2.4     43.5 ± 2.3        47.2 ± 2.7   43.9 ± 1.4
                       ringnorm      36.20 ± 4.41 34.28 ± 2.59 33.56 ± 3.34 33.48 ± 2.94
                       satimage       16.5 ± 2.0     15.5 ± 1.7        16.8 ± 2.4   16.2 ± 2.3
                       transfusion    24.0 ± 4.0     23.4 ± 4.7        25.2 ± 3.5   24.0 ± 4.7
                       vowel           4.8 ± 2.2      4.0 ± 1.9        5.6 ± 2.1     7.0 ± 1.8
                       waveform      19.40 ± 2.10 18.28 ± 2.85 19.57 ± 2.20 19.04 ± 2.99
                       wine           30.0 ± 10.3    28.6 ± 14.8       31.2 ± 6.7   33.4 ± 16.0
                       yeast          41.8 ± 4.7     40.5 ± 2.6        42.6 ± 3.7   40.7 ± 3.6
                                                                  Fuzzy classification rules based on similarity    31




                                             QDA with multiple subsets
                          clouds      26.00 ± 2.70 23.14 ± 2.49 25.74 ± 1.92 22.66 ± 1.10
                          concentric 4.36 ± 1.96    3.68 ± 1.68   5.72 ± 1.84   3.40 ± 0.98
                          gauss-3D    23.87 ± 1.86 22.06 ± 2.10 23.96 ± 2.03 22.36 ± 2.12
                          glass       42.3 ± 10.9   38.5 ± 12.0   43.2 ± 14.9   32.4 ± 12.5
                          letters      17.1 ± 0.7   14.7 ± 0.7    17.2 ± 0.7    14.7 ± 0.8
                          pendigits    2.8 ± 0.5     2.2 ± 0.2     2.7 ± 0.5     2.7 ± 0.6
                          phoneme      25.4 ± 2.4   20.8 ± 1.4    24.7 ± 1.0    20.2 ± 2.2
                          pima         27.9 ± 4.7   25.5 ± 4.2    28.3 ± 3.3     26.1 ± 5.1
                          poker        66.1 ± 2.1   55.1 ± 2.3    66.3 ± 3.9    55.1 ± 2.1
                          ringnorm    1.94 ± 0.96   2.53 ± 1.01 1.68 ± 0.60     3.66 ± 1.31
                          satimage     17.0 ± 1.3   15.7 ± 1.1    17.2 ± 2.0     16.2 ± 1.3
                          transfusion 29.6 ± 8.6    22.3 ± 4.5    29.2 ± 7.1     23.4 ± 3.4
                          vowel        16.7 ± 5.4   14.0 ± 3.8    18.1 ± 4.1     15.6 ± 3.3
                          waveform    15.73 ± 2.07 14.52 ± 1.59 15.33 ± 1.72 14.54 ± 1.80
                          wine         1.2 ± 2.5     2.8 ± 3.9    0.6 ± 1.9      3.3 ± 2.9
                          yeast        49.0 ± 4.3   39.8 ± 4.3    49.5 ± 4.9    39.1 ± 3.8

Table 2. Mean error rates ± standard deviation of the error rate [%], based on 10-fold crossvalidation. The best result
for each dataset is displayed in boldface, statistically significant improvements (measured by the analysis of variance
using the Tukey-Kramer method at the 5% level) are displayed in italics