Feature Engineering for Tree Leaf Classifier Daria Korotaeva and Maksim Khlopotov ITMO University, Saint-Petersburg, Russia, daria.korotayeva@yandex.ru Abstract. The article presents an approach to classify trees based on the image of a leaf. In this work, 1464 images of 18 species of trees that are typical for Russian flora were used, and the method of k-nearest neighbors was used for classification model. The process of feature re- trieval includes image verification, binarization and removal of the leaf petiole. The selected 22 features are based on the analysis of the image moments and distances from the centroid to the boundary coordinates. An accuracy of more than 95% is obtained on the testing set. Keywords: tree recognition, leaf image classification, feature engineer- ing, machine learning. 1 Introduction Trees are an important part of terrestrial ecosystems and have a great impact on the environment. Some species of deciduous trees can represent a source of allergens for people prone to pollinosis, and the ability to determine such a tree is necessary for them. Trees are also of interest to collectors of herbariums and parents with children. Despite the importance of the ability to determine the tree type, most people do not have the necessary skills for this. The proposed method may be used for creating a tree recognition application. Despite the large variety of plant organs that can be used for determining the type of a plant, a leaf is the most convenient object in terms of image analysis. Unlike fruits and inflorescences that are present on the tree for a short period of time, leaves can be collected for a long period throughout the year, and can also be used in a dried form. In this article, feature engineering for the k-NN classifier is presented, as well as the method of image preprocessing. Section 2 deals with the existing research in this area. Section 3 is focused on data collection and the method of image verification. The processing of the image, removal of the petiole and computation of the one-dimensional array of distances, is described in section 4. The final set of features is presented in section 5, section 6 deals with the achieved results, conclusion and further work are described in section 7. 2 Related Work The existing solutions offer different approaches to leaf image analysis. Most authors prefer to analyze leaf shape for extracting basic features. For example, the Flavia Plant Leaf Recognition System [1] is based on a probabilistic neural network to classify leaves based on the 12 features describing the shape of a leaf. The authors achieved an accuracy classification of more than 90% for 32 plant species growing in China. J.-X. Du et al. [2] also extracted features from the leaf shape, but a new method is proposed for classification, referred to as a hy- persphere classifier. The authors compare this method to the k-NN method for classifying 20 plant species. The study of Prof. M. Kumar et al. [3] is focused on various classification techniques for plant classification task. The authors have concluded that the simplest method is the k-NN classifier, the main disadvan- tage of which is sensitiveness to noise. Various ways of solving the classification problem on one dataset are presented in the study by H. Goëau et al. [4]. The 55 species of plants in a dataset are rep- resented by both scans and photos with a natural background. Participants from different countries presented their ways of solving the problem, the most popu- lar solution was the shape boundaries analyses. At the same time, the authors note the perspectives of using metadata, in particular, geo-tags. The best result was achieved by INRIA [5], the authors of which used a contour-based shape descriptor called Directional Fragment Histogram. The essence of the method is to represent the leaf shape as groups of elementary components having the same direction. The Swedish tree leave dataset [6][7], was used to implement and test the descriptor. The mentioned dataset was also partially used in this study. B. Wang et al [8] also investigate the leaf shape focusing on the convexity and concavity properties of the leaf arches as the major features. The achieved accuracy is estimated more than 96% for the Swedish tree leave dataset. A simi- lar approach was used by the authors of the application LeafSnap [9] to identify leaves of the 184 tree species of the Northeastern United States. The dataset [10] collected specifically for this task is estimated as the largest leaf dataset for today. The curvature-based shape descriptor is used for extracting features, high results are obtained within the top 5 results shown to the user. For the correct work of the application, the authors implemented a verification of the image uploaded by the user. The method of removing the petiole described in the study was applied in this work for the correct extraction of features. A research by P. Novotný and T. Suk [11] suggests applying a Fourier descrip- tor to a leaf shape. The accuracy of more than 88% was obtained on a dataset called Middle European Woody Plants containing 153 species [12]. The analyses of the distances from centroid to boundaries described in this paper was also presented in a study by J. Chaki and R. Parekh [13]. The feature vector was obtained by describing the 36 radii and the evaluation method differs from the one presented in this article. The classification of 3 plant species is described in the paper, which does not allow to fully evaluate the efficiency of the method. Despite the fact that most authors do not consider leaf color analysis a reliable method of classification, T. Munisami et al. [14] included the color histogram as one of the features, resulting in a classification accuracy of more than 87% for 32 species of plants. Accuracy of more than 97% was obtained on the same dataset by implementing leaf venation analyses described by K.-B. Lee and K.-S. Hong [15], although the study [9] noted that most mobile phone cameras are unable to detect leaf venation. The leaf texture analysis is presented in a recent study by Vijayashree T. and A. Gopal [16] with an obtained accuracy of 89% for 50 images. 3 Collecting Data The dataset1 used in this research was formed basing on the following publicly available datasets: LeafSnap [10], MEW 2012 [12] and the Swedish tree dataset [7]. The choice of species was made regarding their specificity for the territory of Russia. Although species with different leaf configurations are present in the list, the leaf image dataset consists only of simple leaves and terminal leaflets of compound leaves, as suggested in the MEW 2012 dataset. The species considered in this article are listed in Table 1. Table 1: Tree species represented in the dataset (# stands for the number of images in the dataset Latin name Example Image # Latin name Example image # Acer pla- 119 Ilex aquifolium 73 tanoides Aesculus hip- 63 Populus nigra 75 pocastanum 1 https://goo.gl/3zBxzR Populus trem- Anus incana 62 64 ula Betula pendula 62 Prunus padus 68 Betula 127 Quercus robur 101 pubescens Corylus avel- 55 Salix alba 75 lana Crataegus Syringa vul- 66 62 monogyna garis Fraxinus excel- 60 Tilia cordata 134 sior Ginkgo biloba 79 Ulmus laevis 59 To ensure the correct work of the assumed application and to avoid errors during the processing of images from the dataset, a binary classifier was created to check whether the image meets the following criteria: – one leaf must be present on the image; – at least 10% of the image area must be occupied by a leaf; – a leaf should not touch the image borders (except for thin parts, such as petiole); – the image must be taken on a light and neutral background. An image which doesn’t meet any of these criteria will not be proceeded for further evaluation unless it contains more than one object: then the algorithm is applied to the largest object on an image. 4 Image Processing To retrieve the characteristics, the image that was successfully verified is passed through several processing steps. First, the threshold binarization using the Otsu method is applied to the image [17]. This allows us to highlight the objects on the image. The next step is to eliminate the noise, since small objects are present in most in-situ photos of leaves, not excluding the images used in this research as part of the dataset. In most cases, photos of leaves contain petiole, since its removal requires addi- tional manipulation. However, when analyzing the contour of the leaf, the petiole can seriously affect the extracted characteristics, for example, eccentricity or con- vex hull. The solution to this problem is to remove the petiole during the image processing. For this, the top-hat operation is applied to the binary image. The top-hat transformation of a binary leaf image (L) is L minus its opening: That (L) = L − (L ◦ SE), (1) where L ◦ SE is an opening operation of L by structuring element SE, defined as the erosion of L by SE, followed by a dilation of the result by SE [18]. This method was proposed in [9] and is effective for most images of leaves. The shape thus obtained corresponds to the shape of a leaf plate without a petiole and can be used for further analysis. In case the petiole is absent on the image, the longest object remained after the top-hat operation will be removed from the image. To evaluate the primary characteristic of a leaf, we search for centroid of the ob- tained shape, and then the coordinates of the points on the contour boundaries. The distances from the centroid to each of these points form a one-dimensional array, which we investigate for further feature extraction. However, since leaf position and orientation on an image may vary, in order to obtain informative data, it is necessary to start calculating the distances at the same point for all leaves. The most convenient point in this case may be the base of the petiole, which we removed, so the area occupied by the petiole is examined for proximity to the main object. The closest point is considered the base of the petiole and the distance to it is set as the first during the formation of a one-dimensional array of distances (AOD). If more than one point of the petiole border with the main figure (petioles on most leaf images have a thickness of several pixels), the first one found is considered the petiole base. The algorithm is thus invariant to leaf rotation. The processed image and the bar graph of AOD are shown in Fig. 1 and Fig. 2. Fig. 1. Processing of a leaf of Crataegus monogyna Fig. 2. Processing of a leaf of Tilia cordata 5 Extracting features All the retrievable features are invariant to rotation. They can be divided into two groups: the proposed in this paper features extracted from AOD, and described by many authors image moments obtained after analyzing the binary image of a leaf after petiole removal. The following features were used for building the learning model: 1. Eccentricity of the ellipse that has the same second-moments as the leaf shape. 2. Extent: ratio of area of the leaf to the smallest rectangle (bounding box) containing the leaf (as shown in Fig. 3). 3. Solidity: ratio of area of the leaf to its convex hull (see Fig. 4). 4. Diameter equivalent: the diameter of a circle with the same area as the leaf. 5. Ratio of the leaf area to a circle with a radius of minimal centroid-boundary distance (shown in Fig. 5). Fig. 3. Bounding box of a leaf Fig. 4. Convex hull of a leaf Fig. 5. Feature #5 6. Expectation: mean value of AOD 7. Variance of AOD 8. Median of AOD 9. Mode of AOD 10. Vertical symmetry: ratio of areas of leaf halves divided vertically. 11. Horizontal symmetry: ratio of areas of leaf halves divided horizontally. 12. Minimal distance: ratio of the minimal value of the AOD to its mean value. 13. Maximal distance: ratio of the maximal value of the AOD to its mean value. 14. Length ratio: ratio of length of the AOD to its maximum. For features 15-22 local maximums and minimums were analyzed (see Fig. 6). 15. Number of peaks: number of local maximums of the AOD. 16. Peak width: mean of peak width of the AOD. 17. Peak prominence: mean of peak prominence of the AOD. 18. Minimal peak: the minimal value in the array of the local maximums of the AOD. 19. Number of valleys: number of local minimums of the AOD. 20. Valley width: mean of valley width of the AOD. 21. Valley prominence: mean of valley prominence of the AOD. 22. Maximal valley: the maximal value in the array of the local minimums of the AOD. Fig. 6. Peaks, peak width and peak prominence (AOD of a Quercus robur leaf) 6 Results Fig. 7. Confusion matrix for classification results of the testing set The k-nearest neighbors classifier was implemented, with the dataset divided into training and testing sets in a 3:1 ratio. A k = 1 was chosen based on the model performance. The number of extracted features was reduced to 22 according to the achieved results, as many features extracted during the study proved to be uninformative. A learning model based only on the features extracted from the AOD showed the classification accuracy up to 90,5% on the testing set, while using only the features 1-5 allowed to obtain an accuracy of 80,4%. Using a complete set of features allows to achieve 95.5%, depending on the partitioning of the dataset, the confusion matrix for this result is shown in Fig. 7. For the Swedish tree dataset [7] the accuracy of 94% is obtained for the testing set. Other machine learning algorithms, including random forest algorithm and bagging, allowed to achieve the same results. 7 Conclusion and Future Work In this article, a method for extracting features for classifying trees based on a leaf image has been described. A dataset was formed based on the images from MEW 2012, LeafSnap the Swedish leaf dataset. The developed verification algorithm allows to exclude errors during image processing. For the 18 classes considered, the classification accuracy of more than 95% was obtained with a 1-NN method, based on 22 features. The method was also applied to the Swedish tree dataset and showed the accuracy of 94%. The learning model has showed resistance to increasing the number of classes. The considered method can be used in combination with the previously described methods of leaf image analysis to develop an application focused on a larger number of tree species. The additional features may be extracted from the AOD to achieve better results on a larger dataset, for example, the fast Fourier transform may be applied to the AOD. It is also expected that the Russian species that are not represented in the existing datasets will be added after a collaboration with the botanists. References 1. Wu, S.G., Bao, F.S., Xu, E.Y., Wang, Y., Chang, Y., Xiang, Q.: A Leaf Recognition Algorithm for Plant Classification Using Probabilistic Neural Network. In: IEEE International Symposium on Signal Processing and Information Technology, pp. 11-16 (2007). 2. Du, J.X., Wang, X.F., Zhang, G.J.: Leaf shape based plant species recognition. Applied Mathematics and Computation 185, pp. 883-893 (2007). 3. Kumar M., Kamble M., Pawar S., Patil P., Bonde N.: Survey Techniques for Plant Leaf Classification. International Journal of Modern Research (IJMER), Vol.1, Issue.2, pp-538-544 (2011). 4. Goëau, H., Bonnet, P., Joly, A., Boujemaa, N., Barthelemy, D., Molino, J.F., Birn- baum, P., Mouysset, E., Picard, M.: The CLEF 2011 plant images classification task. In: CLEF (Notebook Papers/Labs/Workshop) (2011). 5. Yahiaoui, I., Hervé, N., Boujemaa, N.: Shape-based image retrieval in botanical collections. In: Advances in Multimedia Information Processing - PCM 2006, vol. 4261, pp. 357-364 (2006). 6. Swedish Leaf Dataset, http://www.cvl.isy.liu.se/en/research/datasets/swedish- leaf/, last accessed 2017/04/20. 7. Söderkvist, O.J.O.: Computer Vision Classification of Leaves from Swedish Trees.Master’s thesis, Linköping University, SE-581 83 Linköping, Sweden (2001). 8. Wang, B., Brown, D., Gao, Y., La Salle, J.: Mobile plant leaf identification using smart-phones. In: 2013 20th IEEE international conference on image processing (ICIP), pp 4417-4421 (2013). 9. Kumar, N., Belhumeur, P.N., Biswas, A., Jacobs, D.W., Kress, W.J., Lopez, I.C., Soares, J.V.B.: Leafsnap: A Computer Vision System for Automatic Plant Species Identification. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 502-516. Springer, Heidelberg (2012). 10. Leafsnap Dataset — Leafsnap: An Electronic Field Guide, http://leafsnap.com/dataset/, last accessed 2016/09/15. 11. Novotný, P., Suk, T.: Leaf recognition of woody species in central europe. Biosyst Eng 115(4):444-452 (2013). 12. Download Middle European Woods (MEW 2012, 2014) — Department of Image Processing, http://zoi.utia.cas.cz/node/662, last accessed 2017/01/15. 13. Chaki, J., Parekh, R.: Plant Leaf Recognition using Shape based Features and Neu- ral Network classifiers. In: International Journal of Advanced Computer Science and Applications (IJACSA), vol. 2, no. 10, pp. 41-47. (2011). 14. Munisami, T., Ramsurn, M., Kishnah, S. and Pudaruth, S.: Plant leaf recognition using shape features and colour histogram with k-nearest neighbour classifiers. In: Procedia Computer Science, vol. 58, pp. 740-747 (2015). 15. Lee, K.-B., Hong, K.-S.: An Implementation of Leaf Recognition System using Leaf Vein and Shape. In: International Journal of Bio-Science and Bio-Technology, vol. 5, No. 2, pp. 58-66 (2013). 16. Vijayashree, T., Gopal, A.: Authentification of Leaf Image Using Image Processing Technique. In: ARPN Journal of Engineering and Applied Sciences, vol. 10, No. 9, pp. 4287-4291 (2015). 17. Otsu, N.: A threshold selection method from gray level histograms. IEEE Trans. Systems, Man and Cybernetics 9, pp. 6266 (1979) 18. Gonzalez, R., Woods, R.: Digital image processing. Pearson/Prentice Hall (2008)