Learning Morphological Data of Tomato Fruits Joshua Thomas, Matthew Lambert, Bennjamin Snyder, Michael Janning, Jacob Haning, Yanglong Hu, Mohammad Ahmad, Sofia Visa Computer Science Department College of Wooster jet4416@gmail.com, mlambert13@wooster.edu, benn.snyder@gmail.com, mjanning13@wooster.edu,jacob.haning1@gmail.com, yhu12@wooster.edu, mahmad12@wooster.edu, svisa@wooster.edu Abstract best classification algorithm (or ensemble of algorithms) for this particular data set. Three methods for attribute reduction in conjunction with The research presented here is part of a bigger project, with Neural Networks, Naive Bayes, and k-Nearest Neighbor clas- the classification of morphological tomato data (i.e. data sifiers are investigated here when classifying a particularly challenging data set. The difficulty encountered with this data describing the shape and size of tomato fruits such as the set is mainly due to the high dimensionality and to some in- data set used here) being the first step. Namely, having the balance between classes. As a result of this research, a subset morphological and gene expression data, the dependencies of only 8 attributes (out of 34) is identified leading to a 92.7% between these two sets are to be investigated. The goal of classification accuracy. The confusion matrix analysis identi- such computational study is to reveal genes that affect par- fies class 7 as the one poorly learned across all combinations ticular shapes (e.g. elongated or round tomato) or sizes (e.g. of attributes and classifiers. This information can be further cherry versus beef tomato) in the tomato fruit. However, as used to upsample this underrepresented class or to investigate mentioned above, a high classification accuracy of tomatoes a classifier less sensitive to imbalance. based on their morphological attributes is required first. Keywords: classification, attribute selection, confusion ma- trix; Table 1: Data distribution across classes. In total, there are 416 examples each having 34 attributes. (Table from (Visa et al. 2011)) Introduction Class Class label No. of examples 1 Ellipse 110 2 Flat 115 Knowing (or choosing) the best machine learning algo- 3 Heart 29 rithm for classifying a particular real world data set is still 4 Long 36 an ongoing research topic. Researchers have tackled this 5 Obvoid 32 problem more as experimental studies, such as the ones 6 Oxheart 12 shown in (Michie, Spiegelhalter, and Taylor 1999) and (Visa 7 Rectangular 34 and Ralescu 2004), than as theoretical ones. Currently, it 8 Round 48 is difficult to study the problem of the best classification method given a particular data set (or the reverse problem The 8 classes are illustrated in Figure 1 and the distribution for that matter), because data classification depends on many of the 416 examples is shown in Table 1 (Visa et al. 2011). variables, e.g. number of attributes, number of examples and their distribution across classes, underlying distribution along each attribute, etc. Additionally, it is difficult to study classifier induction in general, because different classifiers learn in different ways, or stated differently, different classi- fiers may have different learning biases. Thus, this research The Tomato Fruit Morphological Data focuses on finding the best classification method for a par- ticular data set of interest through experimental means. The experimental data set was obtained from the Ohio Agri- cultural Research and Development Center (OARDC) re- We investigate several machine learning techniques for search group led by E. Van Der Knaap (Rodriguez et al. learning a particular 8-class domain having 34 attributes and 2010). only 416 examples. We also combine these methods with various subsets of attributes selected based on their discrim- This morphological data of tomato fruits consists of 416 ex- inating power. The main goal of this research is to find the amples having 34 attributes and distributed in 8 classes. The classify class 7 with class 1) is also of interest in this ex- perimental study. Our hypothesis is that (some) different classifiers misclassify different data-examples and thus, by combining different classifiers, one can achieve better accu- racy merely through their complementarity. The misclassifi- cation error for each individual class is tracked through the use of confusion matrices. Attribute Selection Techniques Two filter-methods (analysis of variance ANOVA (Hogg and Ledolter 1987) and the RELIEF method (Kira and Ren- dell 1992b), (Kira and Rendell 1992a)) and one wrapper- method are used in our experiments for attribute-ranking. The first two algorithms are independent of the choice of Figure 1: Sketch of the 8 morphological classes of the classifier (Guyon and Elisseeff 2003), whereas the third one tomato fruits. is ”wrapped” around a classifier - here the attributes selected by the CART decision tree are used (Breiman et al. 1984). 34 attributes numerically quantify morphological properties The first attribute ranking method considered here is based of the tomato fruits such as perimeter, width, length, circu- on the analysis of variance which estimates the mean value larity (i.e. how well a transversal cut of a tomato fits a cir- of each attribute by comparing the variation within the data cle), rectangle (similarly, how well it fits a rectangle), angle (Hogg and Ledolter 1987). at the tip of the tomato, etc. A more detailed description of the 34 attributes and how they are calculated can be found in The second ranking method we use is the RELIEF algo- (Gonzalo et al. 2009). rithm, introduced by (Kira and Rendell 1992b) and de- scribed and expanded upon by (Sun and Wu 2009). In short, the algorithm examines all instances of every attribute and Problem Description and Methodology calculates the distance to each instance’s nearest hit (nearest instance that has the same classification) and nearest miss The focus of this research is to find the best (defined here as (nearest instance that has a different classification). It then high classification accuracy, e.g. 90%) classification tech- calculates the differences of the nearest misses and the near- nique (or combination of classifiers) for the morphological est hits over all instances of each attribute. as shown in equa- tomato data set. In addition to tomato fruit classification, it tion (1). concentrates on finding which attributes have more discrim- inative power and finding a ranking of these attributes. dn = kxn −N earestM iss(xn )k−kxn −N earestHit(xn )k (1) As seen in Figure 1, the round class and several others may where dn is an instance of an attribute, N earestM iss(xn )) have smaller or much larger instances of tomato fruits. Thus, is the nearest miss of the instance, and N earestHit(xn )) attributes 1 (perimeter) and 2 (area), for example, might have is the nearest hit of the instance. Then, the d-values are no positive influence in classifying these classes; at worst, summed over all instances and the attributes are ranked from it may hinder classification. The tomato data set of inter- largest value to smallest value. Zero may provide an appro- est here has 34 attributes and only 416 examples available priate cut-off point when selecting attributes. for learning and testing. One can argue that many more ex- amples are needed to have effective learning in such high- The third ranking is obtained as a result of decision trees dimensional space. Furthermore, the class-distribution is classification (CART), which through a greedy approach imbalanced with the largest and the smallest classes hav- places the most important attributes (based on information ing 115 (class 2, Flat tomatoes) and 12 examples (class 6, gain) closer to the root of the tree. Oxheart tomatoes), respectively (see Table 1). For these reasons, our strategy is to investigate several ma- chine learning classifiers on subsets of top-ranked attributes in an effort to reduce the data-dimensionality and to achieve better data classification. Finding if different classification algorithms make identical errors (for example, they all mis- Table 2: Top 10 ANOVA and RELIEF attribute rankings. Column 3 shows the top 8 ranked attributes resulted from classification and regression trees (CART) (Visa et al. 2011) ANOVA RELIEF CART 17 21 7 20 18 13 18 7 12 21 8 11 2 33 14 1 13 10 28 11 8 26 9 1 6 19 - 5 22 - Classification Techniques We use Matlab to conduct these experiments. For each ex- periment 75% of data is randomly selected for training, and Figure 2: Confusion matrix of NN for top 8 CART attributes. the remaining 25% of data is used for testing. This case achieved the highest classification accuracy when using NN (92.7%). Matlab implementations of the Naive Bayes (NB), k-nearest Neighbors (kNN) for k=4, and various Artificial Neural Net- work (NN) configurations are tested in conjunction with the three reduced-attribute tomato data sets, as well as with the whole data sets (i.e. having all 34 attributes). For the latter case, the classifiers are ordered by their accuracies as fol- 100 Top 10 ANOVA attributes lows: NN (89.1%), NB (80%), kNN (79.1%). Here, kNN is 90 investigated for k=4 only because (Visa et al. 2011) shows 80 70 that it achieves the lowest error over a larger range of k. 60 50 NB 40 kNN 30 1 2 3 4 5 6 7 8 9 10 Results Top 10 RELIEF attributes 100 90 80 The top 10 ANOVA and RELIEF attribute rankings are 70 shown in the first two columns of Table 2. Column 3 shows 60 50 NB the top 8 ranked attributes resulted from classification and 40 kNN regression trees. 30 1 2 3 4 5 6 7 8 9 10 Top 10 CART attributes NN Results 100 90 NB 80 kNN Many NN configurations (in terms of number of layers, 70 60 number of neurons in each layer, training method, and ac- 50 tivation function) for each of the three data sets obtained 40 from selecting the subsets of attributes shown in Table 2 30 1 2 3 4 5 6 7 8 9 10 were tried. However, only the ones leading to the best re- sults are reported in Table 3. Among the subsets of attributes studied here, the 8 attributes resulting from the decision tree classification lead to the best classification in the NN case Figure 3: Accuracy of NB and kNN for top k (k=1,10) (92.7%). The confusion matrix associated with this case is ANOVA attributes (top figure), top k (k=1,10) RELIEF at- shown in Figure 2. From this matrix, it can be seen that the tributes, and top k (k=1,8) CART attributes (k is shown on largest error comes from misclassifying 3 test data points of x-axis). class 7 (Rectangle) as class 1 (Ellipsoid). Indeed, Figure 1 shows that these two clases are the most similar in terms of shape. Table 3: Best NN configurations and their corresponding classification accuracies. No. of attributes No. of layers No. of neurons Accuracy Top 10 ANOVA 1 10 84.5% Top 10 RELIEF 2 25+15 88.2% Top 8 CART 1 10 92.7% All 34 2 25 +15 89.1% Figure 5: Confusion matrix of NB for top 9 RELIEF at- tributes. This case achieved best classification accuracy when using NB (81.1%). Conclusions and Future Work Several machine learning algorithms for classifying the 8- Figure 4: Confusion matrix of kNN for top 5 ANOVA at- class tomato data are investigated here. In addition, 3 at- tributes. This case achieved the highest classification accu- tribute selection strategies are combined with these learning racy when using kNN (83.6%). algorithms to reduce the data set dimensionality. The best combination of attribute selection and classification method among the ones investigated here leads to a 92.7% classi- NB and kNN Results fication accuracy (for the NN classifier on the 8 CART at- tributes). Figure 3 shows the accuracy of NB and kNN for top k (k=1,10) ANOVA attributes (top figure), top k (k=1,10) RE- The confusion matrix analysis points out that class 7 (Rect- LIEF attributes, and top k (k=1,8) CART attributes (k is angle) is the one most frequently miscclassified (or very shown on x-axis). The two largest accuracy values are ob- poorly learned) across all three classifiers. It is more often tained for kNN (83.6%) for the top 5 ANOVA attributes, and missclasified as class 1. This is consistent with the observa- for NB (81.1%) in the case of top 9 RELIEF attributes. For tion that (1) based on Figure 1, these two classes are very these two cases, the confusion matrices showing the mis- similar, and (2) since class 1 is larger in terms of available clasifications across the 8 classes are shown in Figures 4 and examples (110 versus only 34 in class 7, see Table 1), we 5, respectively. Similar to NN classifier, NB and kNN both can conclude that the classifiers are biased toward the larger misclasify class 7 as class 1 (by 4 and 3 examples, respec- class. This situation is known in literature as learning with tively). However, contrary to NN, NB and kNN carry some imbalanced data (Visa and Ralescu 2004). As a future direc- additional class confusions: tion, we point out that, for imbalanced data sets, classifiers less sensitive to the imbalance can be used such as the one • NB misclassifies class 3 as class 1 (3 instances) and as proposed in (Visa and Ralescu 2004). Also, the imbalance class 8 (3 instances); can be corrected by intentional upsampling (if possible) of the underrepresented class. • Additional error for kNN comes from misclassifying class 3 as class 2 (3 examples) and class 7 as class 2 (3 exam- A similar study that considers some additional classifica- ples). tion techniques applied to a larger overall data set (the 416 examples in the current data sets poorly covers the 34- dimensional space) in which the classes are less imbalanced will provide more insight as to what attributes should be se- lected for better classification accuracy. Also, a more thor- ough analysis of the confusion matrices will identify com- plementary classification techniques which can be subse- Kira, K., and Rendell, L. 1992a. The Feature Selection quently combined to obtain a larger classification accuracy Problem: Traditional Methods and a New Algorithm. In for the data set of interest. Proceedings of AAAI, 129–134. Kira, K., and Rendell, L. 1992b. A practical approach to feature selection. In International Conference on Machine Acknowledgments Learning, 368–377. Michie, D.; Spiegelhalter, D.; and Taylor, C. e., eds. 1999. This research was partially supported by the NSF grant DBI- Machine Learning, Neural and Statistical Classification. 0922661(60020128) (E.Van Der Knaap, PI and S.Visa, Co- http://www.amsta.leeds.ac.uk/ charles/statlog/. PI) and by the College of Wooster Faculty Start-up Fund Rodriguez, S.; Moyseenko, J.; Robbins, M.; awarded to Sofia Visa in 2008. Huarachi Morejn, N.; Francis, D.; and van der Knaap, E. 2010. Tomato Analyzer: A Useful Software Application to Collect Accurate and Detailed Morphological and Col- orimetric Data from Two-dimensional Objects. Journal of References Visualized Experiments 37. Breiman, L.; Friedman, J.; Olshen, R.; and Stone, C., eds. Sun, I., and Wu, D. 2009. Feature extraction through local 1984. Classification and Regression Trees. CRC Press, learning. In Statistical Analysis and Data Mining, 34–47. Boca Raton, FL. Visa, S., and Ralescu, A. 2004. Fuzzy Classifiers for Imbal- Gonzalo, M.; Brewer, M.; Anderson, C.; Sullivan, D.; anced, Complex Classes of Varying Size. In Proceedings Gray, S.; and van der Knaap, E. 2009. Tomato Fruit Shape of the Information Processing and Management of Uncer- Analysis Using Morphometric and Morphology Attributes tainty in Knowledge-Based Systems Conference, Perugia, Implemented in Tomato Analyzer Software Program. Jour- Italy, 393–400. nal of American Society of Horticulture 134:77–87. Visa, S.; Ramsay, B.; Ralescu, A.; and Van der Knaap, Guyon, I., and Elisseeff, A. 2003. An Introduction to Vari- E. 2011. Confusion Matrix-based Feature Selection. In able and Feature Selection. Journal of Machine Learning Proceedings of the 23rd Midwest Artificial Intelligence and Research 3:1157–1182. Cognitive Science Conference, Cincinnati. Hogg, R., and Ledolter, J., eds. 1987. Engineering Statis- tics. New York:MacMillan.