Learning Morphological Data of Tomato Fruits
                        Joshua Thomas, Matthew Lambert, Bennjamin Snyder,
              Michael Janning, Jacob Haning, Yanglong Hu, Mohammad Ahmad, Sofia Visa
                                      Computer Science Department
                                          College of Wooster
                jet4416@gmail.com, mlambert13@wooster.edu, benn.snyder@gmail.com,
                         mjanning13@wooster.edu,jacob.haning1@gmail.com,
                    yhu12@wooster.edu, mahmad12@wooster.edu, svisa@wooster.edu


                            Abstract                                  best classification algorithm (or ensemble of algorithms) for
                                                                      this particular data set.
  Three methods for attribute reduction in conjunction with           The research presented here is part of a bigger project, with
  Neural Networks, Naive Bayes, and k-Nearest Neighbor clas-          the classification of morphological tomato data (i.e. data
  sifiers are investigated here when classifying a particularly
  challenging data set. The difficulty encountered with this data
                                                                      describing the shape and size of tomato fruits such as the
  set is mainly due to the high dimensionality and to some in-        data set used here) being the first step. Namely, having the
  balance between classes. As a result of this research, a subset     morphological and gene expression data, the dependencies
  of only 8 attributes (out of 34) is identified leading to a 92.7%   between these two sets are to be investigated. The goal of
  classification accuracy. The confusion matrix analysis identi-      such computational study is to reveal genes that affect par-
  fies class 7 as the one poorly learned across all combinations      ticular shapes (e.g. elongated or round tomato) or sizes (e.g.
  of attributes and classifiers. This information can be further      cherry versus beef tomato) in the tomato fruit. However, as
  used to upsample this underrepresented class or to investigate      mentioned above, a high classification accuracy of tomatoes
  a classifier less sensitive to imbalance.                           based on their morphological attributes is required first.
  Keywords: classification, attribute selection, confusion ma-
  trix;
                                                                      Table 1: Data distribution across classes. In total, there are
                                                                      416 examples each having 34 attributes. (Table from (Visa
                                                                      et al. 2011))
                        Introduction                                                 Class   Class label   No. of examples
                                                                                     1       Ellipse       110
                                                                                     2       Flat          115
Knowing (or choosing) the best machine learning algo-                                3       Heart         29
rithm for classifying a particular real world data set is still                      4       Long          36
an ongoing research topic. Researchers have tackled this                             5       Obvoid        32
problem more as experimental studies, such as the ones                               6       Oxheart       12
shown in (Michie, Spiegelhalter, and Taylor 1999) and (Visa                          7       Rectangular   34
and Ralescu 2004), than as theoretical ones. Currently, it                           8       Round         48
is difficult to study the problem of the best classification
method given a particular data set (or the reverse problem            The 8 classes are illustrated in Figure 1 and the distribution
for that matter), because data classification depends on many         of the 416 examples is shown in Table 1 (Visa et al. 2011).
variables, e.g. number of attributes, number of examples
and their distribution across classes, underlying distribution
along each attribute, etc. Additionally, it is difficult to study
classifier induction in general, because different classifiers
learn in different ways, or stated differently, different classi-
fiers may have different learning biases. Thus, this research              The Tomato Fruit Morphological Data
focuses on finding the best classification method for a par-
ticular data set of interest through experimental means.              The experimental data set was obtained from the Ohio Agri-
                                                                      cultural Research and Development Center (OARDC) re-
We investigate several machine learning techniques for                search group led by E. Van Der Knaap (Rodriguez et al.
learning a particular 8-class domain having 34 attributes and         2010).
only 416 examples. We also combine these methods with
various subsets of attributes selected based on their discrim-        This morphological data of tomato fruits consists of 416 ex-
inating power. The main goal of this research is to find the          amples having 34 attributes and distributed in 8 classes. The
                                                                   classify class 7 with class 1) is also of interest in this ex-
                                                                   perimental study. Our hypothesis is that (some) different
                                                                   classifiers misclassify different data-examples and thus, by
                                                                   combining different classifiers, one can achieve better accu-
                                                                   racy merely through their complementarity. The misclassifi-
                                                                   cation error for each individual class is tracked through the
                                                                   use of confusion matrices.


                                                                   Attribute Selection Techniques


                                                                   Two filter-methods (analysis of variance ANOVA (Hogg and
                                                                   Ledolter 1987) and the RELIEF method (Kira and Ren-
                                                                   dell 1992b), (Kira and Rendell 1992a)) and one wrapper-
                                                                   method are used in our experiments for attribute-ranking.
                                                                   The first two algorithms are independent of the choice of
Figure 1: Sketch of the 8 morphological classes of the             classifier (Guyon and Elisseeff 2003), whereas the third one
tomato fruits.                                                     is ”wrapped” around a classifier - here the attributes selected
                                                                   by the CART decision tree are used (Breiman et al. 1984).

34 attributes numerically quantify morphological properties        The first attribute ranking method considered here is based
of the tomato fruits such as perimeter, width, length, circu-      on the analysis of variance which estimates the mean value
larity (i.e. how well a transversal cut of a tomato fits a cir-    of each attribute by comparing the variation within the data
cle), rectangle (similarly, how well it fits a rectangle), angle   (Hogg and Ledolter 1987).
at the tip of the tomato, etc. A more detailed description of
the 34 attributes and how they are calculated can be found in      The second ranking method we use is the RELIEF algo-
(Gonzalo et al. 2009).                                             rithm, introduced by (Kira and Rendell 1992b) and de-
                                                                   scribed and expanded upon by (Sun and Wu 2009). In short,
                                                                   the algorithm examines all instances of every attribute and
     Problem Description and Methodology                           calculates the distance to each instance’s nearest hit (nearest
                                                                   instance that has the same classification) and nearest miss
The focus of this research is to find the best (defined here as    (nearest instance that has a different classification). It then
high classification accuracy, e.g. 90%) classification tech-       calculates the differences of the nearest misses and the near-
nique (or combination of classifiers) for the morphological        est hits over all instances of each attribute. as shown in equa-
tomato data set. In addition to tomato fruit classification, it    tion (1).
concentrates on finding which attributes have more discrim-
inative power and finding a ranking of these attributes.           dn = kxn −N earestM iss(xn )k−kxn −N earestHit(xn )k
                                                                                                                             (1)
As seen in Figure 1, the round class and several others may        where dn is an instance of an attribute, N earestM iss(xn ))
have smaller or much larger instances of tomato fruits. Thus,      is the nearest miss of the instance, and N earestHit(xn ))
attributes 1 (perimeter) and 2 (area), for example, might have     is the nearest hit of the instance. Then, the d-values are
no positive influence in classifying these classes; at worst,      summed over all instances and the attributes are ranked from
it may hinder classification. The tomato data set of inter-        largest value to smallest value. Zero may provide an appro-
est here has 34 attributes and only 416 examples available         priate cut-off point when selecting attributes.
for learning and testing. One can argue that many more ex-
amples are needed to have effective learning in such high-         The third ranking is obtained as a result of decision trees
dimensional space. Furthermore, the class-distribution is          classification (CART), which through a greedy approach
imbalanced with the largest and the smallest classes hav-          places the most important attributes (based on information
ing 115 (class 2, Flat tomatoes) and 12 examples (class 6,         gain) closer to the root of the tree.
Oxheart tomatoes), respectively (see Table 1).

For these reasons, our strategy is to investigate several ma-
chine learning classifiers on subsets of top-ranked attributes
in an effort to reduce the data-dimensionality and to achieve
better data classification. Finding if different classification
algorithms make identical errors (for example, they all mis-
                                                                            Table 2: Top 10 ANOVA and RELIEF attribute rankings.
                                                                            Column 3 shows the top 8 ranked attributes resulted from
                                                                            classification and regression trees (CART) (Visa et al. 2011)
                                                                                               ANOVA     RELIEF   CART
                                                                                               17        21       7
                                                                                               20        18       13
                                                                                               18        7        12
                                                                                               21        8        11
                                                                                               2         33       14
                                                                                               1         13       10
                                                                                               28        11       8
                                                                                               26        9        1
                                                                                               6         19       -
                                                                                               5         22       -


                                                                            Classification Techniques

                                                                            We use Matlab to conduct these experiments. For each ex-
                                                                            periment 75% of data is randomly selected for training, and
Figure 2: Confusion matrix of NN for top 8 CART attributes.                 the remaining 25% of data is used for testing.
This case achieved the highest classification accuracy when
using NN (92.7%).
                                                                            Matlab implementations of the Naive Bayes (NB), k-nearest
                                                                            Neighbors (kNN) for k=4, and various Artificial Neural Net-
                                                                            work (NN) configurations are tested in conjunction with the
                                                                            three reduced-attribute tomato data sets, as well as with the
                                                                            whole data sets (i.e. having all 34 attributes). For the latter
                                                                            case, the classifiers are ordered by their accuracies as fol-
100
                          Top 10 ANOVA attributes
                                                                            lows: NN (89.1%), NB (80%), kNN (79.1%). Here, kNN is
 90
                                                                            investigated for k=4 only because (Visa et al. 2011) shows
 80

 70
                                                                            that it achieves the lowest error over a larger range of k.
 60

 50                                                              NB

 40                                                              kNN

 30
      1   2   3     4       5                 6      7   8   9         10
                                                                                                       Results
                          Top 10 RELIEF attributes
100

 90

 80                                                                         The top 10 ANOVA and RELIEF attribute rankings are
 70
                                                                            shown in the first two columns of Table 2. Column 3 shows
 60

 50
                                                                 NB         the top 8 ranked attributes resulted from classification and
 40
                                                                 kNN
                                                                            regression trees.
 30
      1   2   3     4       5                 6      7   8   9         10


                          Top 10 CART attributes
                                                                            NN Results
100

 90
                                                                 NB
 80
                                                                 kNN
                                                                            Many NN configurations (in terms of number of layers,
 70

 60
                                                                            number of neurons in each layer, training method, and ac-
 50                                                                         tivation function) for each of the three data sets obtained
 40                                                                         from selecting the subsets of attributes shown in Table 2
 30
      1   2   3     4       5                 6      7   8   9         10   were tried. However, only the ones leading to the best re-
                                                                            sults are reported in Table 3. Among the subsets of attributes
                                                                            studied here, the 8 attributes resulting from the decision tree
                                                                            classification lead to the best classification in the NN case
Figure 3: Accuracy of NB and kNN for top k (k=1,10)                         (92.7%). The confusion matrix associated with this case is
ANOVA attributes (top figure), top k (k=1,10) RELIEF at-                    shown in Figure 2. From this matrix, it can be seen that the
tributes, and top k (k=1,8) CART attributes (k is shown on                  largest error comes from misclassifying 3 test data points of
x-axis).                                                                    class 7 (Rectangle) as class 1 (Ellipsoid). Indeed, Figure 1
                                                                            shows that these two clases are the most similar in terms of
                                                                            shape.
Table 3: Best NN configurations and their corresponding
classification accuracies.
      No. of attributes   No. of layers   No. of neurons   Accuracy
      Top 10 ANOVA        1               10               84.5%
      Top 10 RELIEF       2               25+15            88.2%
      Top 8 CART          1               10               92.7%
      All 34              2               25 +15           89.1%


                                                                      Figure 5: Confusion matrix of NB for top 9 RELIEF at-
                                                                      tributes. This case achieved best classification accuracy
                                                                      when using NB (81.1%).


                                                                                Conclusions and Future Work

                                                                      Several machine learning algorithms for classifying the 8-
Figure 4: Confusion matrix of kNN for top 5 ANOVA at-                 class tomato data are investigated here. In addition, 3 at-
tributes. This case achieved the highest classification accu-         tribute selection strategies are combined with these learning
racy when using kNN (83.6%).                                          algorithms to reduce the data set dimensionality. The best
                                                                      combination of attribute selection and classification method
                                                                      among the ones investigated here leads to a 92.7% classi-
NB and kNN Results                                                    fication accuracy (for the NN classifier on the 8 CART at-
                                                                      tributes).
Figure 3 shows the accuracy of NB and kNN for top k
(k=1,10) ANOVA attributes (top figure), top k (k=1,10) RE-            The confusion matrix analysis points out that class 7 (Rect-
LIEF attributes, and top k (k=1,8) CART attributes (k is              angle) is the one most frequently miscclassified (or very
shown on x-axis). The two largest accuracy values are ob-             poorly learned) across all three classifiers. It is more often
tained for kNN (83.6%) for the top 5 ANOVA attributes, and            missclasified as class 1. This is consistent with the observa-
for NB (81.1%) in the case of top 9 RELIEF attributes. For            tion that (1) based on Figure 1, these two classes are very
these two cases, the confusion matrices showing the mis-              similar, and (2) since class 1 is larger in terms of available
clasifications across the 8 classes are shown in Figures 4 and        examples (110 versus only 34 in class 7, see Table 1), we
5, respectively. Similar to NN classifier, NB and kNN both            can conclude that the classifiers are biased toward the larger
misclasify class 7 as class 1 (by 4 and 3 examples, respec-           class. This situation is known in literature as learning with
tively). However, contrary to NN, NB and kNN carry some               imbalanced data (Visa and Ralescu 2004). As a future direc-
additional class confusions:                                          tion, we point out that, for imbalanced data sets, classifiers
                                                                      less sensitive to the imbalance can be used such as the one
• NB misclassifies class 3 as class 1 (3 instances) and as            proposed in (Visa and Ralescu 2004). Also, the imbalance
  class 8 (3 instances);                                              can be corrected by intentional upsampling (if possible) of
                                                                      the underrepresented class.
• Additional error for kNN comes from misclassifying class
  3 as class 2 (3 examples) and class 7 as class 2 (3 exam-           A similar study that considers some additional classifica-
  ples).                                                              tion techniques applied to a larger overall data set (the 416
                                                                      examples in the current data sets poorly covers the 34-
                                                                      dimensional space) in which the classes are less imbalanced
                                                                      will provide more insight as to what attributes should be se-
                                                                      lected for better classification accuracy. Also, a more thor-
                                                                      ough analysis of the confusion matrices will identify com-
plementary classification techniques which can be subse-       Kira, K., and Rendell, L. 1992a. The Feature Selection
quently combined to obtain a larger classification accuracy    Problem: Traditional Methods and a New Algorithm. In
for the data set of interest.                                  Proceedings of AAAI, 129–134.
                                                               Kira, K., and Rendell, L. 1992b. A practical approach to
                                                               feature selection. In International Conference on Machine
                  Acknowledgments                              Learning, 368–377.
                                                               Michie, D.; Spiegelhalter, D.; and Taylor, C. e., eds. 1999.
This research was partially supported by the NSF grant DBI-    Machine Learning, Neural and Statistical Classification.
0922661(60020128) (E.Van Der Knaap, PI and S.Visa, Co-         http://www.amsta.leeds.ac.uk/ charles/statlog/.
PI) and by the College of Wooster Faculty Start-up Fund        Rodriguez, S.;        Moyseenko, J.;       Robbins, M.;
awarded to Sofia Visa in 2008.                                 Huarachi Morejn, N.; Francis, D.; and van der Knaap, E.
                                                               2010. Tomato Analyzer: A Useful Software Application
                                                               to Collect Accurate and Detailed Morphological and Col-
                                                               orimetric Data from Two-dimensional Objects. Journal of
                      References                               Visualized Experiments 37.
 Breiman, L.; Friedman, J.; Olshen, R.; and Stone, C., eds.    Sun, I., and Wu, D. 2009. Feature extraction through local
 1984. Classification and Regression Trees. CRC Press,         learning. In Statistical Analysis and Data Mining, 34–47.
 Boca Raton, FL.                                               Visa, S., and Ralescu, A. 2004. Fuzzy Classifiers for Imbal-
 Gonzalo, M.; Brewer, M.; Anderson, C.; Sullivan, D.;          anced, Complex Classes of Varying Size. In Proceedings
 Gray, S.; and van der Knaap, E. 2009. Tomato Fruit Shape      of the Information Processing and Management of Uncer-
 Analysis Using Morphometric and Morphology Attributes         tainty in Knowledge-Based Systems Conference, Perugia,
 Implemented in Tomato Analyzer Software Program. Jour-        Italy, 393–400.
 nal of American Society of Horticulture 134:77–87.            Visa, S.; Ramsay, B.; Ralescu, A.; and Van der Knaap,
 Guyon, I., and Elisseeff, A. 2003. An Introduction to Vari-   E. 2011. Confusion Matrix-based Feature Selection. In
 able and Feature Selection. Journal of Machine Learning       Proceedings of the 23rd Midwest Artificial Intelligence and
 Research 3:1157–1182.                                         Cognitive Science Conference, Cincinnati.
 Hogg, R., and Ledolter, J., eds. 1987. Engineering Statis-
 tics. New York:MacMillan.