Confusion Matrix-based Feature Selection
        Sofia Visa                        Brian Ramsay                  Anca Ralescu                Esther van der Knaap
Computer Science Department   Sentry Data Systems, Inc. Computer Science Department                 Ohio Agricultural Research
    College of Wooster          Fort Lauderdale, FL       University of Cincinnati                   and Development Center
       Wooster, OH          brian.ramsay@gmail.com            Cincinnati, OH                         The Ohio State University
  svisa@wooster.edu                                      anca.ralescu@uc.edu                              Wooster, OH
                                                                                                   vanderknaap.1@osu.edu


                            Abstract                                numbers of attributes (e.g. thousands). Examples of such
                                                                    data domains with many features include text categorization
  This paper introduces a new technique for feature selection
                                                                    and gene expression analysis. In the first example, each doc-
  and illustrates it on a real data set. Namely, the proposed ap-   ument is described by the most frequent words, leading to
  proach creates subsets of attributes based on two criteria: (1)   20,000 words or more. In working with expressed genes in
  individual attributes have high discrimination (classification)   order to separate healthy from cancer patients, for example,
  power; and (2) the attributes in the subset are complemen-        the number of attributes may grow as high as 30,000 (Guyon
  tary - that is, they misclassify different classes. The method    and Elisseeff 2003). Another example of a challenging do-
  uses information from a confusion matrix and evaluates one        main is the microarray data found in (Xin, Jordan, and Karp
  attribute at a time. Keywords: classification, attribute selec-   2001), where a hybrid of filter and wrapper approaches is
  tion, confusion matrix, k-nearest neighbors;                      employed to successfully select relevant features to classify
                                                                    72 data examples in a 7,130 dimensional space.

                        Background                                  In addition to reducing the data dimensionality, selecting
                                                                    fewer attributes may improve classification and may give a
In classification problems, good accuracy in classification is      better understanding of the underlying process that gener-
the primary concern; however, the identification of the at-         ated that data. Here we propose an attribute-selection tech-
tributes (or features) having the largest separation power is       nique based on a confusion matrix with the two-fold objec-
also of interest. Even more, for very large data sets (such         tive of better classification and better data understanding.
as MRI images of brain), the classification is highly depen-
dent on feature selection. This is mainly because the larger        Depending on where the feature selection module is placed
the number of attributes, the more sparse the data become           in relation to the classification module, there are two classes
and thus many more (exponential growth) training data are           of methods for feature selections (Jain and Zongker 1997):
necessary to accurately sample such a large domain. In this
sense, the high dimensional data sets are almost always un-         • Filter methods (Pudil, Novovicova, and Kittler 1994)
der represented. This problem is also known in literature             rank features (or feature subsets) independently of the
as ”the curse of dimensionality”. For example, a 2-attribute          predictor. These methods investigate irrelevant features
data set having 10 examples in the square defined by the cor-         to be eliminated by looking at correlation or underlying
ners (0,0) and (1,1) covers the domain acceptably. If the do-         distribution. For example, if two attributes have the same
main to be learned is the cube defined by the corners (0,0,0)         probability distribution, then they are redundant and one
and (1,1,1), 10 points will not cover this 3-D domain as ef-          of them can be dropped. Such analysis is performed re-
fectively.                                                            gardless of the classification method. Another filtering
                                                                      method ranks attributes based on the notion of nearest hit
Reducing the number of attributes for a classification prob-          (closest example of same the class) and nearest miss (clos-
lem is a much researched field. The brute force approach in           est example of a different class) (Kira and Rendell 1992).
finding the best combination of attributes for classification         The it h feature ranking is given by the score computed as
requires the trial of all possible combinations of the avail-         the average (over all examples) of the difference between
able n attributes. That is, consider one attribute at a time,         the distance to the nearest hit and the distance to the near-
then investigate all combinations of two attributes, three at-        est miss, in the projection of the it h dimension (Guyon
tributes, etc. However, this approach is unfeasible because           and Elisseeff 2003).
there are 2n − 1 such possible combinations for n attributes
and, for example, even for n=10 there are 1,023 different           • Wrapper methods (Kohavi and John 1997) use a classi-
attribute combinations to be investigated. Additionally, fea-         fier to assess features (or feature subsets). For example,
ture selection is especially needed for data sets having large        the decision tree algorithm selects the attributes having
                                                                  We define the disagreement score associated with a confu-
Table 1: The confusion matrix for two-class classification        sion matrix in equation (3). According to this equation the
problem.                                                          disagreement is 1 when one of the quantities b or c is 0 (in
                                                                  this case the classifier misclassifies examples of one class
                             P REDICTED     P REDICTED            only), and is 0 when b and c are the same.
                             N EGATIVE       P OSITIVE
      ACTUAL N EGATIVE            a              b                                      0           if b = c = 0;
                                                                                    
      ACTUAL P OSITIVE            c              d                             D=          |b−c|                           (3)
                                                                                         max{b,c}   otherwise.

                                                                  The attribute selection methodology proposed here selects
  the best discriminatory power and places them closer to         attributes that not only have good discrimination power on
  the root. Hence, besides the classification tree, a ranking     their own, but more importantly are complementary to each
  of attributes results.                                          other. For example, consider two attributes A1 and A2, hav-
                                                                  ing similar classification accuracy. Our approach will select
                                                                  them as a good subset of attributes if they have a large dis-
Another classification of attribute selection methods consid-
                                                                  agreement in terms of what examples they misclassify. A
ers the search technique of the feature subsets. There are two
                                                                  large disagreement is indicated by D values closer to 1 for
main greedy search strategies: forward selection and back-
                                                                  both attributes, but distinct denominators in equation (3).
ward elimination. Both techniques yield nested subsets of
features. The forward selection starts with one attribute and
continues adding one attribute at a time if the newly formed
subset gives better classification. During backward elimi-             Algorithm for Confusion Matrix-based
nation, unpromising attributes are progressively eliminated                     Attribute Selection
(Guyon and Elisseeff 2003). This greedy search technique
is often used in system identification. The result, in either     The pseudocode outlined below shows the steps to per-
case, is not guaranteed to yield the optimal attribute subset     form confusion matrix-based attribute selection for a 2-class
(Sugeno and Yasukawa 1993).                                       classification problem. This method basically constructs
                                                                  attribute-subsets that: (1) have attributes with good individ-
In this research we investigate the use of the confusion ma-      ual classification power, and (2) have attributes that are com-
trix (Kohavi and Provost 1998) (which contains information        plementary (i.e. they disagree in their misclassifications).
about actual and predicted classifications) for attribute se-
lection. In the context described above, this approach is a       Note that the algorithm may lead to several subsets of at-
wrapper method because it uses a classifier to estimate the       tributes to be further investigated, i.e. further the subset
classification power of an attribute (or subset of attributes).   yielding higher classification accuracy may be selected.

                                                                  Also, the algorithm does not account for the possibility that
   The Confusion Matrix and Disagreement                          two individually lower ranked attributes may combine in a
                  Score                                           high classification accuracy subset due to their high comple-
                                                                  mentarity.
A confusion matrix of size n x n associated with a classi-
fier shows the predicted and actual classification, where n is    Algorithm 1 Pseudocode for Confusion Matrix-based At-
the number of different classes. Table 1 shows a confusion        tribute Selection Algorithm
matrix for n = 2, whose entries have the following meanings:      Require: 2-class data of n attributes
                                                                  Require: classification technique
                                                                  Require: k - number of member subset
• a is the number of correct negative predictions;                Ensure: Output k-attribute subset as tuple Sk           =
• b is the number of incorrect positive predictions;                 (A1 , A2 , ..., Ak )
                                                                     Compute classifier Ci based on feature Ai , i = 1..n
• c is the number of incorrect negative predictions;                 Obtain: Accuracy(Ci ) and Conf M atrix(Ci )
• d is the number of correct positive predictions.                   Rank Ai according to Accuracy(Ci ) ⇒ RA
                                                                     for i = 1 ... n do
                                                                       Compute disagreement based on Conf M atrix(Ci )
The prediction accuracy and classification error can be ob-                           |b−c|
                                                                       as: Di = max{b,c}
tained from this matrix as follows:
                                                                     end for
                                 a+d                                 Rank Ai according to Di ⇒ RD
                Accuracy =                                 (1)
                               a+b+c+d                               Select top k (according to RA ) attributes having large
                                                                     D (according to RD ) but in different classes: ⇒ Sk =
                               b+c                                   (A1 , A2 , ..., Ak )
                  Error =                                  (2)
                             a+b+c+d
                                                                  In addition to tomato classification, of interest here is to find
                                                                  which attributes have more discriminative power and further
                                                                  to find a ranking of the attributes.


                                                                    Data Classification and Attribute Selection

                                                                  In this section we show the decision tree classification of
                                                                  the tomato data set, then we illustrate our attribute selection
                                                                  algorithm (in combination with a k-nearest neighbor classi-
                                                                  fier) on two (out of 8) classes. These two classes (1 and 7)
                                                                  are identified by both, decision trees and k-nearest neigh-
                                                                  bors, as highly overlapping.


                                                                  Classification with Decision Trees - CART

                                                                  We used the Classification and Regression Trees (CART)
Figure 1: Decision tree obtained with CART for all data and       method (Breiman et al. 1984) because it generates rules that
all attributes.                                                   can be easily understood and explained. At the same time,
                                                                  classification trees have a built-in mechanism to perform at-
                                                                  tribute selection (Breiman et al. 1984) and we can com-
                                                                  pare our set of selected attributes, obtained from the confu-
                                                                  sion matrix and k-nearest neighbors analysis, with the set of
                                                                  attributes identified by CART. However, we anticipate that
                                                                  these sets will not perfectly coincide, which only means that
Table 2: Data distribution across classes. In total, there are
                                                                  the two approaches quantify the importance of a given at-
416 examples each having 34 attributes.
                                                                  tribute (or subset of attributes) differently and that the two
               Class   Class label   No. of examples
                                                                  methods learn the data differently.
               1       Ellipse       110
               2       Flat          115
               3       Heart         29
                                                                  The pruned decision tree obtained using CART is shown in
               4       Long          36                           Figure 1. The train and test error associated with this tree are
               5       Obvoid        32                           11.54% and 18.27%, respectively. As it can be seen from
               6       Oxheart       12                           this figure, 10 rules can be extracted. In addition, CART se-
               7       Rectangular   34                           lects the following 8 attribute as best in classification (listed
               8       Round         48                           in decreasing order of their importance):

                                                                  • 7 - Fruit Shape Idx Ext1
             The Tomato Fruit Data Set                            • 13 - Circular
                                                                  • 12 - Ellipsoid
The data set used in the experimental part of this research       • 11 - Fruit Shape Triangle
consists of 416 examples having 34 attributes and distributed     • 14 - Rectangular
in 8 classes (the class-distribution is shown in Table 2). This
set was obtained from the Ohio Agricultural Research and          • 10 - Distal Fruit Blockiness
Development Center (OARDC) research group led by E.               • 8 - Fruit Shape Idx Ext2
Van Der Knaap (Rodriguez et al. 2010) and the classifica-         • 1 - Perimeter
tion task is to correctly label a tomato fruit based on mor-
phological measurements such as width, length, perimeter,
circularity (i.e. how well a transversal cut of a tomato fits a   We also investigate the k-nearest neighbors classifier as we
circle), angle at the tip of the tomato, etc.                     will use this classification method in combination with our
                                                                  attribute selection approach. Figure 2 shows the k - nearest
The data set was collected as follows: from the scanned im-       neighbors classification error for k = 2,...,15. The top (blue)
age of a longitudinally section of a tomato fruit the 34 mea-     line corresponds to runs that include all 34 attributes and the
surements are extracted by the Tomato Analyzer Software           bottom (red) line shows the error when only the best five
(TA) (Rodriguez et al. 2010) developed by the same group.         attributes (identified by CART classification) are used.
For a complete description of the 34 tomato fruit measure-
ments and the TA software see (Gonzalo et al. 2009).
Figure 2: K - nearest neighbors classification error for k =
2,...,15. The top (blue) line corresponds to runs that include
all 34 attributes and the bottom (red) line shows the error
when only the best five attributes are used (these attributes
were identified through CART classification).


As shown in Figure 2, the k-nearest neighbors classifica-
tion technique consistently scores lower error when using          Figure 3: Confusion matrix for all classes and all attributes.
only 8 attributes (the one selected by CART), rather then all      Class 7 has 8 examples wrongly predicted as class 1 (see top
34. Thus, a natural question arising here is: is there a better    row).
combination of attributes then the one selected by CART for
classification? For k = 4 the k-nearest neighbors technique
yields the lowest error, which justifies our choice of using k
= 4 in the next experiments.


The Confusion Matrix for the Tomato Data Set

When using all 34 attributes and all 8 classes, the confu-
sion matrix obtained from the k-nearest neighbors cluster-
ing with k=4 is shown in Figure 3, where along the x-axis
are listed the true class labels and along the y-axis are the k-
nearest neighbors class predictions. Along the first diagonal
are the correct classifications, whereas all the other entries
show misclassifications. The bottom right cell shows the
overall accuracy.

In this confusion matrix it can be seen that 8 examples of
class 7 are wrongly predicted as class 1. Additionally, from
the CART classification, the classes 1 and 7 are identified as
the two classes overlapping the most. Thus the experiment
presented next uses the confusion matrix attribute selection
to better separate these two classes. Namely, we search for
a subset of the 34 attributes such that the attributes are com-    Figure 4: Confusion matrix for class 1 and 7 along attribute
plementary in the sense described above and quantified in          14. Four examples of class 1 are misclassified as class 7, and
equation (3).                                                      3 examples of class 7 belong to class 1.
                                                                   Table 3: Attribute ranking based on disagreement score. The
                                                                   best classification attributes found by CART are shown in
                                                                   bold (they are also underlined). The attributes marked by
                                                                   (*) are the ones identified by our selection algorithm.
                                                                           Attr. number   Disagreement score   Class of largest error
                                                                           20             1                    7
                                                                           22             1                    7
                                                                           24             1                    7
                                                                           25             1                    7
                                                                           26             1                    7
                                                                           27             1                    7
                                                                           31             1                    7
                                                                           23             0.9655               7
                                                                           28             0.9565               7
                                                                           15             0.9524               7
                                                                           2              0.9375               7
                                                                           21             0.9375               7
Figure 5: Confusion matrix for class 1 and 7 along attribute               29             0.9231               7
20. Fourteen examples of class 7 are misclassified as class                12             0.9167               7
                                                                           30             0.9167               7
1.
                                                                           1              0.9091               7
                                                                           3              0.9091               7
                                                                           7              0.9000               7
                                                                           17             0.9000               7
                                                                           5              0.8889               7
                                                                           11             0.8824               7
                                                                           32             0.8750               7
                                                                           34             0.8667               7
                                                                           18             0.8462               7
Confusion Matrix-based Attribute Selection for                             13             0.8235               7
Classes 1 and 7                                                            19             0.8235               7
                                                                           6              0.8182               7
                                                                           33             0.8125               7
When using the data from classes 1 (Ellipse) and 7 (Rectan-                4              0.7273               7
gular), a data set of size 145 is obtained (110 in class 1 and             9*             0.7273               7
34 from class 7).                                                          16*            0.6875               7
                                                                           8*             0.6250               7
As illustrated in Algorithm 1, for each of the 34 attributes,              10*            0.3000               7
the k-nearest neighbor algorithm (with k = 4) is used for                  14*            0.2500               1
classification and the corresponding classification accuracy
and confusion matrix are obtained (for each attribute). Fur-
ther, the 34 attributes are ranked in the order of their indi-
vidual performance in distinguishing between class 1 and 7,        yielding a 97.2% correct classification.
leading to the ranking set R = 14, 7, 8, 17, 1, 3, 6, 12, 30, 4,
9, 20, 29, 18, 26, 2, 10, 21, 34, 32, 11, 33, 5, 13, 19, 16, 15,   The above approach is a (greedy) forward selection method,
31, 25, 28, 24, 27, 22, 23.                                        i.e. the attributes are progressively incorporated in larger and
                                                                   larger subsets. However, here we incorporate the attributes
We first create growing nested subsets of attributes in the or-    in the order dictated by their individual performance, yield-
der specified by their individual classification abilities. Note   ing nested subsets of attributes. For example, we do not
that this particular choice of subsets is not part of Algorithm    evaluate the performance of the subset consisting of first and
1 and makes no use of the complementarity. We simply in-           third attribute. Indeed, it may be that this subset can perform
troduce it as a comparative model for our selection approach,      better then considering all top three attributes. However, as
which, besides the accuracy ranking, incorporates comple-          indicated earlier in this paper, evaluating all possible com-
mentarity information as well.                                     bination is not a feasible approach. Thus, we propose to
                                                                   combine the attributes that are complementary, i.e. two
Figure 6 shows the classification accuracy for subsets of at-      (or more) attributes that may achieve individually simi-
tributes consisting of the top 1, top 2, top 3, etc. attributes    lar classification accuracy but they have the largest dis-
from R (the subsets are shown on x-axis, while the y-axis          agreement (this information is extracted from the confu-
shows classification accuracy). From Figure 6 it can be            sion matrix).
seen that the highest accuracy is achieved when the top 3
attributes are used together (i.e. attributes 14, 7, and 8),       The disagreement scores for all 34 attributes when classify-
ing data from classes 1 and 7 are listed in Table 3, column
2. As column 3 of the same table shows, only attribute 14
misclassifies more examples of class 1 then of class 7 (see
bottom row). All other 33 attributes have the largest num-                                             0.98


ber of misclassifications attributed to class 7. Thus, for this
particular data set, we will consider subsets that combine at-                                         0.97


tribute 14 with one or more other attributes having the largest
disagreement.                                                                                          0.96


                                                                                                       0.95
For example, Figures 4 and 5 show the confusion matrix for


                                                                             Classification accuracy
classes 1 and 7 when using attribute 14 and 20, respectively.                                          0.94

These two attributes disagree the most as the above figures
and Table 3 show.                                                                                      0.93


Algorithm 1 is illustrated here for k = 2,3,4, and 5 only (note                                        0.92


for k =1 the classification results are the same as in Figure 6).                                                                           X: 11
                                                                                                                                            Y: 0.9097

The classification accuracy for these experiments is shown                                             0.91


in Figures 7, 8, 9, and 10, respectively. In addition to select-                                        0.9

ing the top k attributes in Table 3 (this combination yields                                                  0     5              10                      15                               20
                                                                                                                                                        Top attributes selected for classification
                                                                                                                                                                                                             25               30                    35


the first star in the above plots), we also plot the classifica-
tion accuracy of the sliding (moving from top to bottom in
Table 3) window of k attributes.                                    Figure 6: K - nearest neighbors classification accuracy for k
                                                                    = 4 when using data from classes 1 and 7 only. On x-axis are
Figures 7, 8, 9, and 10 show the classification accuracy of         listed the nested subsets of attributes having top 1,2,3,...,34
the k-nearest neighbor algorithm when attribute 14 is com-          attributes. The highest accuracy (97.2%) is obtained for the
bined with all the other attributes in decreasing order of their    subset having the top 3 attributes: 14,7, and 8.
disagreement scores (see Table 3). Figure 7 shows results for
2-member subsets and the results from Figure 8 are obtained
for 3-member subsets: attribute 14 and two consecutive at-
tributes from Table 3.

As Figure 10 illustrates, simply selecting the top attributes
from Table 3 (having the largest disagreement) does not en-
sure a better classification, nor is an increasing or decreas-                                     0.98


ing trend observed when sliding down the table. This is be-                                        0.97

cause classification ability is not additive and opens up the
question of whether a better k-subset of attributes can be                                         0.96


obtained by mixing attributes across the table, not only the                                       0.95

k-neighbors selection used here.
                                                                                                   0.94
                                                                        Accuracy


Among the k-member subsets investigated here (note, there                                          0.93


are more sliding window subsets for k > 5), the largest clas-                                      0.92
sification accuracy (98%) is achieved for a 5-member sub-
set, namely for the attribute-set 14, 9, 16, 8, and 10. The                                        0.91


CART classifier recognizes these two classes with 93% ac-                                              0.9

curacy (using attributes 7, 13, 12, 14, 11, and 10), and the
accuracy-ranking only (no complementarity information in-                                          0.89


corporated) selection achieves 97.3% (using top 3 attributes:                                      0.88
                                                                                                       20 22 24 25 26 27 31 23 28 15    2     21 29 12 30            1       3 7         17     5    11 32 34 18 13 19   6   33    4   9   16   8   10
14, 7 and 8). The attribute subset with the largest discrimi-                                                                                                            Attribute


nating power (when using a k-nearest neighbors clustering, k
= 4) is obtained with the confusion matrix-based attribute se-
lection; however, it is not a large improvement as the classes      Figure 7: K - nearest neighbors classification accuracy for k
are pretty well separated to begin with.                            = 4 when using 2-member subsets of attributes. Each sub-
                                                                    set contains attribute 14 and one of the remaining attributes;
                                                                    x-axis shows these subsets listed in the order of their com-
                                                                    plementarity - see Table 3. Largest accuracy is 97.3%.
               0.97


               0.96
                                                                                                   0.98


               0.95                                                                                0.97


               0.94                                                                                0.96


                                                                                                   0.95
               0.93
    Accuracy


                                                                                                   0.94
               0.92


                                                                                        Accuracy
                                                                                                   0.93
               0.91

                                                                                                   0.92

                0.9
                                                                                                   0.91

               0.89
                                                                                                     0.9


               0.88
                      0    5    10   15                      20   25   30   35                     0.89
                                          Attribute subset

                                                                                                   0.88
                                                                                                           0   5   10   15                      20   25   30   35
                                                                                                                              Atribute subset


Figure 8: K - nearest neighbors classification accuracy for k
= 4 when using 3-member subsets of attributes. Each sub-
set contains attribute 14 and two consecutive attributes (in
the order of their complementarity - see Table 3). Largest                       Figure 9: K - nearest neighbors classification accuracy for k
accuracy is 97.3%.                                                               = 4 when using 4-member subsets of attributes. Each sub-
                                                                                 set contains attribute 14 and three consecutive attributes (in
                                                                                 the order of their complementarity - see Table 3). Largest
                                                                                 accuracy is 96.7%.


                                                                                                   0.98


                                                                                                   0.97


                          Conclusions and Future Work                                              0.96


                                                                                                   0.95


A new technique for attribute selection is proposed here.                                          0.94
                                                                                     Accuracy


The method selects attributes that are complementary to                                            0.93

each other, in the sense that they misclassify different
                                                                                                   0.92
classes, and favors attributes that have good classification
abilities by themselves. This new approach is illustrated on                                       0.91

a real data set. For two classes of interest within this data set,
                                                                                                    0.9
this technique found a better (i.e. yielding higher classifica-
tion accuracy) subset of attributes, than using all attributes or                                  0.89


even using the 8 attributes identified by CART. However, we                                        0.88
must investigate this new approach in more data sets and in                                               0    5   10   15
                                                                                                                             Attribute subset
                                                                                                                                                20   25   30   35


combination with other classification techniques (here only
the k-nearest neighbor classifier was investigated). Another
future direction is to investigate the use of subsets that com-                  Figure 10: K - nearest neighbors classification accuracy for
bine complementary attributes, even if these attributes are                      k = 4 when using 5-member subsets of attributes. Each sub-
weak classifiers by themselves. The challenging factor for                       set contains attribute 14 and four consecutive attributes (in
this approach is the large number of subsets that must be in-                    the order of their complementarity - see Table 3). Largest
vestigated. Depending on the data set, if this search space                      accuracy is 98%.
is very large, then genetic algorithms can be used to explore
the version space. We must also extrapolate this method to
multi-class data sets and investigate its scalability factor.
                  Acknowledgments                              feature selection. In International Conference on Machine
                                                               Learning, 368–377.
Esther van der Knaap acknowledges support from the NSF         Kohavi, R., and John, G. 1997. Wrappers for features
grant NSF DBI-0922661. Sofia Visa was partially supported      subset selection. Artificial Intelligence 97:273–324.
by the NSF grant DBI-0922661(60020128) and by the Col-         Kohavi, R., and Provost, F. 1998. On Applied Research
lege of Wooster Faculty Start-up Fund.                         in Machine Learning. In Editorial for the Special Issue
                                                               on Applications of Machine Learning and the Knowledge
                                                               Discovery Process, Columbia University, New York, vol-
                                                               ume 30.
                      References                               Pudil, P.; Novovicova, J.; and Kittler, J. 1994. Floating
                                                               search methods in feature selection. Pattern Recognition
 Breiman, L.; Friedman, J.; Olshen, R.; and Stone, C., eds.    Letters 15(11):1119–1125.
 1984. Classification and Regression Trees. CRC Press,         Rodriguez, S.;        Moyseenko, J.;         Robbins, M.;
 Boca Raton, FL.                                               Huarachi Morejn, N.; Francis, D.; and van der Knaap, E.
 Gonzalo, M.; Brewer, M.; Anderson, C.; Sullivan, D.;          2010. Tomato Analyzer: A Useful Software Application
 Gray, S.; and van der Knaap, E. 2009. Tomato Fruit Shape      to Collect Accurate and Detailed Morphological and Col-
 Analysis Using Morphometric and Morphology Attributes         orimetric Data from Two-dimensional Objects. Journal of
 Implemented in Tomato Analyzer Software Program. Jour-        Visualized Experiments 37.
 nal of American Society of Horticulture 134:77–87.            Sugeno, M., and Yasukawa, T. 1993. A fuzzy-logic-based
 Guyon, I., and Elisseeff, A. 2003. An Introduction to Vari-   approach to qualitative modeling. IEEE Transactions on
 able and Feature Selection. Journal of Machine Learning       fuzzy systems 1(1):7–31.
 Research 3:1157–1182.                                         Xin, E.; Jordan, M.; and Karp, R. 2001. Feature Selec-
 Jain, A., and Zongker, D. 1997. Feature selection: eval-      tion for High-Dimensional Genomic Microarray Data. In
 uation, application, and small sample performance. IEEE       Proceedings of the 18 International Conference in Machine
 Transactions on Pattern Analysis and Machine Intelligence     Learning ICML-2001, 601–608.
 19(2):153–158.
 Kira, K., and Rendell, L. 1992. A practical approach to