=Paper=
{{Paper
|id=None
|storemode=property
|title= Learning Morphological Data of Tomato Fruits
|pdfUrl=https://ceur-ws.org/Vol-710/paper43.pdf
|volume=Vol-710
|dblpUrl=https://dblp.org/rec/conf/maics/ThomasLSJHHAV11
}}
== Learning Morphological Data of Tomato Fruits==
Learning Morphological Data of Tomato Fruits
Joshua Thomas, Matthew Lambert, Bennjamin Snyder,
Michael Janning, Jacob Haning, Yanglong Hu, Mohammad Ahmad, Sofia Visa
Computer Science Department
College of Wooster
jet4416@gmail.com, mlambert13@wooster.edu, benn.snyder@gmail.com,
mjanning13@wooster.edu,jacob.haning1@gmail.com,
yhu12@wooster.edu, mahmad12@wooster.edu, svisa@wooster.edu
Abstract best classification algorithm (or ensemble of algorithms) for
this particular data set.
Three methods for attribute reduction in conjunction with The research presented here is part of a bigger project, with
Neural Networks, Naive Bayes, and k-Nearest Neighbor clas- the classification of morphological tomato data (i.e. data
sifiers are investigated here when classifying a particularly
challenging data set. The difficulty encountered with this data
describing the shape and size of tomato fruits such as the
set is mainly due to the high dimensionality and to some in- data set used here) being the first step. Namely, having the
balance between classes. As a result of this research, a subset morphological and gene expression data, the dependencies
of only 8 attributes (out of 34) is identified leading to a 92.7% between these two sets are to be investigated. The goal of
classification accuracy. The confusion matrix analysis identi- such computational study is to reveal genes that affect par-
fies class 7 as the one poorly learned across all combinations ticular shapes (e.g. elongated or round tomato) or sizes (e.g.
of attributes and classifiers. This information can be further cherry versus beef tomato) in the tomato fruit. However, as
used to upsample this underrepresented class or to investigate mentioned above, a high classification accuracy of tomatoes
a classifier less sensitive to imbalance. based on their morphological attributes is required first.
Keywords: classification, attribute selection, confusion ma-
trix;
Table 1: Data distribution across classes. In total, there are
416 examples each having 34 attributes. (Table from (Visa
et al. 2011))
Introduction Class Class label No. of examples
1 Ellipse 110
2 Flat 115
Knowing (or choosing) the best machine learning algo- 3 Heart 29
rithm for classifying a particular real world data set is still 4 Long 36
an ongoing research topic. Researchers have tackled this 5 Obvoid 32
problem more as experimental studies, such as the ones 6 Oxheart 12
shown in (Michie, Spiegelhalter, and Taylor 1999) and (Visa 7 Rectangular 34
and Ralescu 2004), than as theoretical ones. Currently, it 8 Round 48
is difficult to study the problem of the best classification
method given a particular data set (or the reverse problem The 8 classes are illustrated in Figure 1 and the distribution
for that matter), because data classification depends on many of the 416 examples is shown in Table 1 (Visa et al. 2011).
variables, e.g. number of attributes, number of examples
and their distribution across classes, underlying distribution
along each attribute, etc. Additionally, it is difficult to study
classifier induction in general, because different classifiers
learn in different ways, or stated differently, different classi-
fiers may have different learning biases. Thus, this research The Tomato Fruit Morphological Data
focuses on finding the best classification method for a par-
ticular data set of interest through experimental means. The experimental data set was obtained from the Ohio Agri-
cultural Research and Development Center (OARDC) re-
We investigate several machine learning techniques for search group led by E. Van Der Knaap (Rodriguez et al.
learning a particular 8-class domain having 34 attributes and 2010).
only 416 examples. We also combine these methods with
various subsets of attributes selected based on their discrim- This morphological data of tomato fruits consists of 416 ex-
inating power. The main goal of this research is to find the amples having 34 attributes and distributed in 8 classes. The
classify class 7 with class 1) is also of interest in this ex-
perimental study. Our hypothesis is that (some) different
classifiers misclassify different data-examples and thus, by
combining different classifiers, one can achieve better accu-
racy merely through their complementarity. The misclassifi-
cation error for each individual class is tracked through the
use of confusion matrices.
Attribute Selection Techniques
Two filter-methods (analysis of variance ANOVA (Hogg and
Ledolter 1987) and the RELIEF method (Kira and Ren-
dell 1992b), (Kira and Rendell 1992a)) and one wrapper-
method are used in our experiments for attribute-ranking.
The first two algorithms are independent of the choice of
Figure 1: Sketch of the 8 morphological classes of the classifier (Guyon and Elisseeff 2003), whereas the third one
tomato fruits. is ”wrapped” around a classifier - here the attributes selected
by the CART decision tree are used (Breiman et al. 1984).
34 attributes numerically quantify morphological properties The first attribute ranking method considered here is based
of the tomato fruits such as perimeter, width, length, circu- on the analysis of variance which estimates the mean value
larity (i.e. how well a transversal cut of a tomato fits a cir- of each attribute by comparing the variation within the data
cle), rectangle (similarly, how well it fits a rectangle), angle (Hogg and Ledolter 1987).
at the tip of the tomato, etc. A more detailed description of
the 34 attributes and how they are calculated can be found in The second ranking method we use is the RELIEF algo-
(Gonzalo et al. 2009). rithm, introduced by (Kira and Rendell 1992b) and de-
scribed and expanded upon by (Sun and Wu 2009). In short,
the algorithm examines all instances of every attribute and
Problem Description and Methodology calculates the distance to each instance’s nearest hit (nearest
instance that has the same classification) and nearest miss
The focus of this research is to find the best (defined here as (nearest instance that has a different classification). It then
high classification accuracy, e.g. 90%) classification tech- calculates the differences of the nearest misses and the near-
nique (or combination of classifiers) for the morphological est hits over all instances of each attribute. as shown in equa-
tomato data set. In addition to tomato fruit classification, it tion (1).
concentrates on finding which attributes have more discrim-
inative power and finding a ranking of these attributes. dn = kxn −N earestM iss(xn )k−kxn −N earestHit(xn )k
(1)
As seen in Figure 1, the round class and several others may where dn is an instance of an attribute, N earestM iss(xn ))
have smaller or much larger instances of tomato fruits. Thus, is the nearest miss of the instance, and N earestHit(xn ))
attributes 1 (perimeter) and 2 (area), for example, might have is the nearest hit of the instance. Then, the d-values are
no positive influence in classifying these classes; at worst, summed over all instances and the attributes are ranked from
it may hinder classification. The tomato data set of inter- largest value to smallest value. Zero may provide an appro-
est here has 34 attributes and only 416 examples available priate cut-off point when selecting attributes.
for learning and testing. One can argue that many more ex-
amples are needed to have effective learning in such high- The third ranking is obtained as a result of decision trees
dimensional space. Furthermore, the class-distribution is classification (CART), which through a greedy approach
imbalanced with the largest and the smallest classes hav- places the most important attributes (based on information
ing 115 (class 2, Flat tomatoes) and 12 examples (class 6, gain) closer to the root of the tree.
Oxheart tomatoes), respectively (see Table 1).
For these reasons, our strategy is to investigate several ma-
chine learning classifiers on subsets of top-ranked attributes
in an effort to reduce the data-dimensionality and to achieve
better data classification. Finding if different classification
algorithms make identical errors (for example, they all mis-
Table 2: Top 10 ANOVA and RELIEF attribute rankings.
Column 3 shows the top 8 ranked attributes resulted from
classification and regression trees (CART) (Visa et al. 2011)
ANOVA RELIEF CART
17 21 7
20 18 13
18 7 12
21 8 11
2 33 14
1 13 10
28 11 8
26 9 1
6 19 -
5 22 -
Classification Techniques
We use Matlab to conduct these experiments. For each ex-
periment 75% of data is randomly selected for training, and
Figure 2: Confusion matrix of NN for top 8 CART attributes. the remaining 25% of data is used for testing.
This case achieved the highest classification accuracy when
using NN (92.7%).
Matlab implementations of the Naive Bayes (NB), k-nearest
Neighbors (kNN) for k=4, and various Artificial Neural Net-
work (NN) configurations are tested in conjunction with the
three reduced-attribute tomato data sets, as well as with the
whole data sets (i.e. having all 34 attributes). For the latter
case, the classifiers are ordered by their accuracies as fol-
100
Top 10 ANOVA attributes
lows: NN (89.1%), NB (80%), kNN (79.1%). Here, kNN is
90
investigated for k=4 only because (Visa et al. 2011) shows
80
70
that it achieves the lowest error over a larger range of k.
60
50 NB
40 kNN
30
1 2 3 4 5 6 7 8 9 10
Results
Top 10 RELIEF attributes
100
90
80 The top 10 ANOVA and RELIEF attribute rankings are
70
shown in the first two columns of Table 2. Column 3 shows
60
50
NB the top 8 ranked attributes resulted from classification and
40
kNN
regression trees.
30
1 2 3 4 5 6 7 8 9 10
Top 10 CART attributes
NN Results
100
90
NB
80
kNN
Many NN configurations (in terms of number of layers,
70
60
number of neurons in each layer, training method, and ac-
50 tivation function) for each of the three data sets obtained
40 from selecting the subsets of attributes shown in Table 2
30
1 2 3 4 5 6 7 8 9 10 were tried. However, only the ones leading to the best re-
sults are reported in Table 3. Among the subsets of attributes
studied here, the 8 attributes resulting from the decision tree
classification lead to the best classification in the NN case
Figure 3: Accuracy of NB and kNN for top k (k=1,10) (92.7%). The confusion matrix associated with this case is
ANOVA attributes (top figure), top k (k=1,10) RELIEF at- shown in Figure 2. From this matrix, it can be seen that the
tributes, and top k (k=1,8) CART attributes (k is shown on largest error comes from misclassifying 3 test data points of
x-axis). class 7 (Rectangle) as class 1 (Ellipsoid). Indeed, Figure 1
shows that these two clases are the most similar in terms of
shape.
Table 3: Best NN configurations and their corresponding
classification accuracies.
No. of attributes No. of layers No. of neurons Accuracy
Top 10 ANOVA 1 10 84.5%
Top 10 RELIEF 2 25+15 88.2%
Top 8 CART 1 10 92.7%
All 34 2 25 +15 89.1%
Figure 5: Confusion matrix of NB for top 9 RELIEF at-
tributes. This case achieved best classification accuracy
when using NB (81.1%).
Conclusions and Future Work
Several machine learning algorithms for classifying the 8-
Figure 4: Confusion matrix of kNN for top 5 ANOVA at- class tomato data are investigated here. In addition, 3 at-
tributes. This case achieved the highest classification accu- tribute selection strategies are combined with these learning
racy when using kNN (83.6%). algorithms to reduce the data set dimensionality. The best
combination of attribute selection and classification method
among the ones investigated here leads to a 92.7% classi-
NB and kNN Results fication accuracy (for the NN classifier on the 8 CART at-
tributes).
Figure 3 shows the accuracy of NB and kNN for top k
(k=1,10) ANOVA attributes (top figure), top k (k=1,10) RE- The confusion matrix analysis points out that class 7 (Rect-
LIEF attributes, and top k (k=1,8) CART attributes (k is angle) is the one most frequently miscclassified (or very
shown on x-axis). The two largest accuracy values are ob- poorly learned) across all three classifiers. It is more often
tained for kNN (83.6%) for the top 5 ANOVA attributes, and missclasified as class 1. This is consistent with the observa-
for NB (81.1%) in the case of top 9 RELIEF attributes. For tion that (1) based on Figure 1, these two classes are very
these two cases, the confusion matrices showing the mis- similar, and (2) since class 1 is larger in terms of available
clasifications across the 8 classes are shown in Figures 4 and examples (110 versus only 34 in class 7, see Table 1), we
5, respectively. Similar to NN classifier, NB and kNN both can conclude that the classifiers are biased toward the larger
misclasify class 7 as class 1 (by 4 and 3 examples, respec- class. This situation is known in literature as learning with
tively). However, contrary to NN, NB and kNN carry some imbalanced data (Visa and Ralescu 2004). As a future direc-
additional class confusions: tion, we point out that, for imbalanced data sets, classifiers
less sensitive to the imbalance can be used such as the one
• NB misclassifies class 3 as class 1 (3 instances) and as proposed in (Visa and Ralescu 2004). Also, the imbalance
class 8 (3 instances); can be corrected by intentional upsampling (if possible) of
the underrepresented class.
• Additional error for kNN comes from misclassifying class
3 as class 2 (3 examples) and class 7 as class 2 (3 exam- A similar study that considers some additional classifica-
ples). tion techniques applied to a larger overall data set (the 416
examples in the current data sets poorly covers the 34-
dimensional space) in which the classes are less imbalanced
will provide more insight as to what attributes should be se-
lected for better classification accuracy. Also, a more thor-
ough analysis of the confusion matrices will identify com-
plementary classification techniques which can be subse- Kira, K., and Rendell, L. 1992a. The Feature Selection
quently combined to obtain a larger classification accuracy Problem: Traditional Methods and a New Algorithm. In
for the data set of interest. Proceedings of AAAI, 129–134.
Kira, K., and Rendell, L. 1992b. A practical approach to
feature selection. In International Conference on Machine
Acknowledgments Learning, 368–377.
Michie, D.; Spiegelhalter, D.; and Taylor, C. e., eds. 1999.
This research was partially supported by the NSF grant DBI- Machine Learning, Neural and Statistical Classification.
0922661(60020128) (E.Van Der Knaap, PI and S.Visa, Co- http://www.amsta.leeds.ac.uk/ charles/statlog/.
PI) and by the College of Wooster Faculty Start-up Fund Rodriguez, S.; Moyseenko, J.; Robbins, M.;
awarded to Sofia Visa in 2008. Huarachi Morejn, N.; Francis, D.; and van der Knaap, E.
2010. Tomato Analyzer: A Useful Software Application
to Collect Accurate and Detailed Morphological and Col-
orimetric Data from Two-dimensional Objects. Journal of
References Visualized Experiments 37.
Breiman, L.; Friedman, J.; Olshen, R.; and Stone, C., eds. Sun, I., and Wu, D. 2009. Feature extraction through local
1984. Classification and Regression Trees. CRC Press, learning. In Statistical Analysis and Data Mining, 34–47.
Boca Raton, FL. Visa, S., and Ralescu, A. 2004. Fuzzy Classifiers for Imbal-
Gonzalo, M.; Brewer, M.; Anderson, C.; Sullivan, D.; anced, Complex Classes of Varying Size. In Proceedings
Gray, S.; and van der Knaap, E. 2009. Tomato Fruit Shape of the Information Processing and Management of Uncer-
Analysis Using Morphometric and Morphology Attributes tainty in Knowledge-Based Systems Conference, Perugia,
Implemented in Tomato Analyzer Software Program. Jour- Italy, 393–400.
nal of American Society of Horticulture 134:77–87. Visa, S.; Ramsay, B.; Ralescu, A.; and Van der Knaap,
Guyon, I., and Elisseeff, A. 2003. An Introduction to Vari- E. 2011. Confusion Matrix-based Feature Selection. In
able and Feature Selection. Journal of Machine Learning Proceedings of the 23rd Midwest Artificial Intelligence and
Research 3:1157–1182. Cognitive Science Conference, Cincinnati.
Hogg, R., and Ledolter, J., eds. 1987. Engineering Statis-
tics. New York:MacMillan.