Introduction

Feature Engineering for Tree Leaf Classi er

Daria Korotaeva

daria.korotayeva@yandex.ru 0

Maksim Khlopotov

0 0 ITMO University , Saint-Petersburg , Russia

The article presents an approach to classify trees based on the image of a leaf. In this work, 1464 images of 18 species of trees that are typical for Russian ora were used, and the method of k-nearest neighbors was used for classi cation model. The process of feature retrieval includes image veri cation, binarization and removal of the leaf petiole. The selected 22 features are based on the analysis of the image moments and distances from the centroid to the boundary coordinates. An accuracy of more than 95% is obtained on the testing set.

tree recognition leaf image classi cation feature engineering machine learning

Introduction

Trees are an important part of terrestrial ecosystems and have a great impact on the environment. Some species of deciduous trees can represent a source of allergens for people prone to pollinosis, and the ability to determine such a tree is necessary for them. Trees are also of interest to collectors of herbariums and parents with children. Despite the importance of the ability to determine the tree type, most people do not have the necessary skills for this. The proposed method may be used for creating a tree recognition application. Despite the large variety of plant organs that can be used for determining the type of a plant, a leaf is the most convenient object in terms of image analysis. Unlike fruits and in orescences that are present on the tree for a short period of time, leaves can be collected for a long period throughout the year, and can also be used in a dried form.

In this article, feature engineering for the k-NN classi er is presented, as well as the method of image preprocessing. Section 2 deals with the existing research in this area. Section 3 is focused on data collection and the method of image veri cation. The processing of the image, removal of the petiole and computation of the one-dimensional array of distances, is described in section 4. The nal set of features is presented in section 5, section 6 deals with the achieved results, conclusion and further work are described in section 7. The existing solutions o er di erent approaches to leaf image analysis. Most authors prefer to analyze leaf shape for extracting basic features. For example, the Flavia Plant Leaf Recognition System [ 1 ] is based on a probabilistic neural network to classify leaves based on the 12 features describing the shape of a leaf. The authors achieved an accuracy classi cation of more than 90% for 32 plant species growing in China. J.-X. Du et al. [ 2 ] also extracted features from the leaf shape, but a new method is proposed for classi cation, referred to as a hypersphere classi er. The authors compare this method to the k-NN method for classifying 20 plant species. The study of Prof. M. Kumar et al. [ 3 ] is focused on various classi cation techniques for plant classi cation task. The authors have concluded that the simplest method is the k-NN classi er, the main disadvantage of which is sensitiveness to noise.

Various ways of solving the classi cation problem on one dataset are presented in the study by H. Goeau et al. [ 4 ]. The 55 species of plants in a dataset are represented by both scans and photos with a natural background. Participants from di erent countries presented their ways of solving the problem, the most popular solution was the shape boundaries analyses. At the same time, the authors note the perspectives of using metadata, in particular, geo-tags. The best result was achieved by INRIA [ 5 ], the authors of which used a contour-based shape descriptor called Directional Fragment Histogram. The essence of the method is to represent the leaf shape as groups of elementary components having the same direction. The Swedish tree leave dataset [6][7], was used to implement and test the descriptor. The mentioned dataset was also partially used in this study. B. Wang et al [8] also investigate the leaf shape focusing on the convexity and concavity properties of the leaf arches as the major features. The achieved accuracy is estimated more than 96% for the Swedish tree leave dataset. A similar approach was used by the authors of the application LeafSnap [9] to identify leaves of the 184 tree species of the Northeastern United States. The dataset [10] collected speci cally for this task is estimated as the largest leaf dataset for today. The curvature-based shape descriptor is used for extracting features, high results are obtained within the top 5 results shown to the user. For the correct work of the application, the authors implemented a veri cation of the image uploaded by the user. The method of removing the petiole described in the study was applied in this work for the correct extraction of features. A research by P. Novotny and T. Suk [11] suggests applying a Fourier descriptor to a leaf shape. The accuracy of more than 88% was obtained on a dataset called Middle European Woody Plants containing 153 species [12]. The analyses of the distances from centroid to boundaries described in this paper was also presented in a study by J. Chaki and R. Parekh [13]. The feature vector was obtained by describing the 36 radii and the evaluation method di ers from the one presented in this article. The classi cation of 3 plant species is described in the paper, which does not allow to fully evaluate the e ciency of the method. Despite the fact that most authors do not consider leaf color analysis a reliable method of classi cation, T. Munisami et al. [14] included the color histogram as one of the features, resulting in a classi cation accuracy of more than 87% for 32 species of plants. Accuracy of more than 97% was obtained on the same dataset by implementing leaf venation analyses described by K.-B. Lee and K.-S. Hong [15], although the study [9] noted that most mobile phone cameras are unable to detect leaf venation. The leaf texture analysis is presented in a recent study by Vijayashree T. and A. Gopal [16] with an obtained accuracy of 89% for 50 images. 3

Collecting Data

The dataset1 used in this research was formed basing on the following publicly available datasets: LeafSnap [10], MEW 2012 [12] and the Swedish tree dataset [7]. The choice of species was made regarding their speci city for the territory of Russia. Although species with di erent leaf con gurations are present in the list, the leaf image dataset consists only of simple leaves and terminal lea ets of compound leaves, as suggested in the MEW 2012 dataset. The species considered in this article are listed in Table 1. 1 https://goo.gl/3zBxzR 119 Ilex aquifolium 63

Populus nigra # 73 75

Anus incana

Populus tremula Betula pendula 62 Prunus padus Betula

pubescens Corylus lana avel127 Quercus robur 55 Salix alba 68 101

Syringa garis vul60 Tilia cordata Ginkgo biloba 79 Ulmus laevis

To ensure the correct work of the assumed application and to avoid errors during the processing of images from the dataset, a binary classi er was created to check whether the image meets the following criteria: { one leaf must be present on the image; { at least 10% of the image area must be occupied by a leaf; { a leaf should not touch the image borders (except for thin parts, such as petiole); { the image must be taken on a light and neutral background.

An image which doesn't meet any of these criteria will not be proceeded for further evaluation unless it contains more than one object: then the algorithm is applied to the largest object on an image. 134 59

Image Processing

To retrieve the characteristics, the image that was successfully veri ed is passed through several processing steps. First, the threshold binarization using the Otsu method is applied to the image [17]. This allows us to highlight the objects on the image. The next step is to eliminate the noise, since small objects are present in most in-situ photos of leaves, not excluding the images used in this research as part of the dataset.

In most cases, photos of leaves contain petiole, since its removal requires additional manipulation. However, when analyzing the contour of the leaf, the petiole can seriously a ect the extracted characteristics, for example, eccentricity or convex hull. The solution to this problem is to remove the petiole during the image processing. For this, the top-hat operation is applied to the binary image. The top-hat transformation of a binary leaf image (L) is L minus its opening: That(L) = L (L

SE); (1) where L SE is an opening operation of L by structuring element SE, de ned as the erosion of L by SE, followed by a dilation of the result by SE [18]. This method was proposed in [9] and is e ective for most images of leaves. The shape thus obtained corresponds to the shape of a leaf plate without a petiole and can be used for further analysis. In case the petiole is absent on the image, the longest object remained after the top-hat operation will be removed from the image.

To evaluate the primary characteristic of a leaf, we search for centroid of the obtained shape, and then the coordinates of the points on the contour boundaries. The distances from the centroid to each of these points form a one-dimensional array, which we investigate for further feature extraction. However, since leaf position and orientation on an image may vary, in order to obtain informative data, it is necessary to start calculating the distances at the same point for all leaves. The most convenient point in this case may be the base of the petiole, which we removed, so the area occupied by the petiole is examined for proximity to the main object. The closest point is considered the base of the petiole and the distance to it is set as the rst during the formation of a one-dimensional array of distances (AOD). If more than one point of the petiole border with the main gure (petioles on most leaf images have a thickness of several pixels), the rst one found is considered the petiole base. The algorithm is thus invariant to leaf rotation. The processed image and the bar graph of AOD are shown in Fig. 1 and Fig. 2. All the retrievable features are invariant to rotation. They can be divided into two groups: the proposed in this paper features extracted from AOD, and described by many authors image moments obtained after analyzing the binary image of a leaf after petiole removal. The following features were used for building the learning model: 1. Eccentricity of the ellipse that has the same second-moments as the leaf shape. 2. Extent: ratio of area of the leaf to the smallest rectangle (bounding box) containing the leaf (as shown in Fig. 3). 3. Solidity: ratio of area of the leaf to its convex hull (see Fig. 4). 4. Diameter equivalent: the diameter of a circle with the same area as the leaf. 5. Ratio of the leaf area to a circle with a radius of minimal centroid-boundary distance (shown in Fig. 5). 6. Expectation: mean value of AOD 7. Variance of AOD 8. Median of AOD 9. Mode of AOD 10. Vertical symmetry: ratio of areas of leaf halves divided vertically. 11. Horizontal symmetry: ratio of areas of leaf halves divided horizontally. 12. Minimal distance: ratio of the minimal value of the AOD to its mean value. 13. Maximal distance: ratio of the maximal value of the AOD to its mean value. 14. Length ratio: ratio of length of the AOD to its maximum.

For features 15-22 local maximums and minimums were analyzed (see Fig. 6). 15. Number of peaks: number of local maximums of the AOD. 16. Peak width: mean of peak width of the AOD. 17. Peak prominence: mean of peak prominence of the AOD. 18. Minimal peak: the minimal value in the array of the local maximums of the

AOD. 19. Number of valleys: number of local minimums of the AOD. 20. Valley width: mean of valley width of the AOD. 21. Valley prominence: mean of valley prominence of the AOD. 22. Maximal valley: the maximal value in the array of the local minimums of the AOD. The k-nearest neighbors classi er was implemented, with the dataset divided into training and testing sets in a 3:1 ratio. A k = 1 was chosen based on the model performance. The number of extracted features was reduced to 22 according to the achieved results, as many features extracted during the study proved to be uninformative. A learning model based only on the features extracted from the AOD showed the classi cation accuracy up to 90,5% on the testing set, while using only the features 1-5 allowed to obtain an accuracy of 80,4%. Using a complete set of features allows to achieve 95.5%, depending on the partitioning of the dataset, the confusion matrix for this result is shown in Fig. 7. For the Swedish tree dataset [7] the accuracy of 94% is obtained for the testing set. Other machine learning algorithms, including random forest algorithm and bagging, allowed to achieve the same results. 7

Conclusion and Future Work

In this article, a method for extracting features for classifying trees based on a leaf image has been described. A dataset was formed based on the images from MEW 2012, LeafSnap the Swedish leaf dataset. The developed veri cation algorithm allows to exclude errors during image processing. For the 18 classes considered, the classi cation accuracy of more than 95% was obtained with a 1-NN method, based on 22 features. The method was also applied to the Swedish tree dataset and showed the accuracy of 94%. The learning model has showed resistance to increasing the number of classes. The considered method can be used in combination with the previously described methods of leaf image analysis to develop an application focused on a larger number of tree species. The additional features may be extracted from the AOD to achieve better results on a larger dataset, for example, the fast Fourier transform may be applied to the AOD. It is also expected that the Russian species that are not represented in the existing datasets will be added after a collaboration with the botanists. 6. Swedish Leaf Dataset, http://www.cvl.isy.liu.se/en/research/datasets/swedishleaf/, last accessed 2017/04/20. 7. Soderkvist, O.J.O.: Computer Vision Classi cation of Leaves from Swedish

Trees.Master's thesis, Linkoping University, SE-581 83 Linkoping, Sweden (2001). 8. Wang, B., Brown, D., Gao, Y., La Salle, J.: Mobile plant leaf identi cation using smart-phones. In: 2013 20th IEEE international conference on image processing (ICIP), pp 4417-4421 (2013). 9. Kumar, N., Belhumeur, P.N., Biswas, A., Jacobs, D.W., Kress, W.J., Lopez, I.C., Soares, J.V.B.: Leafsnap: A Computer Vision System for Automatic Plant Species Identi cation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 502-516. Springer, Heidelberg (2012). 10. Leafsnap Dataset | Leafsnap: An Electronic Field Guide, http://leafsnap.com/dataset/, last accessed 2016/09/15. 11. Novotny, P., Suk, T.: Leaf recognition of woody species in central europe. Biosyst

Eng 115(4):444-452 (2013). 12. Download Middle European Woods (MEW 2012, 2014) | Department of Image

Processing, http://zoi.utia.cas.cz/node/662, last accessed 2017/01/15. 13. Chaki, J., Parekh, R.: Plant Leaf Recognition using Shape based Features and Neural Network classi ers. In: International Journal of Advanced Computer Science and Applications (IJACSA), vol. 2, no. 10, pp. 41-47. (2011). 14. Munisami, T., Ramsurn, M., Kishnah, S. and Pudaruth, S.: Plant leaf recognition using shape features and colour histogram with k-nearest neighbour classi ers. In: Procedia Computer Science, vol. 58, pp. 740-747 (2015). 15. Lee, K.-B., Hong, K.-S.: An Implementation of Leaf Recognition System using Leaf Vein and Shape. In: International Journal of Bio-Science and Bio-Technology, vol. 5, No. 2, pp. 58-66 (2013). 16. Vijayashree, T., Gopal, A.: Authenti cation of Leaf Image Using Image Processing Technique. In: ARPN Journal of Engineering and Applied Sciences, vol. 10, No. 9, pp. 4287-4291 (2015). 17. Otsu, N.: A threshold selection method from gray level histograms. IEEE Trans.

Systems, Man and Cybernetics 9, pp. 6266 (1979) 18. Gonzalez, R., Woods, R.: Digital image processing. Pearson/Prentice Hall (2008)

1. Wu , S.G. , Bao , F.S. , Xu , E.Y. , Wang , Y. , Chang , Y. , Xiang , Q. : A Leaf Recognition Algorithm for Plant Classi cation Using Probabilistic Neural Network . In: IEEE International Symposium on Signal Processing and Information Technology , pp. 11 - 16 ( 2007 ).

2. Du , J.X. , Wang , X.F. , Zhang , G.J.: Leaf shape based plant species recognition . Applied Mathematics and Computation 185 , pp. 883 - 893 ( 2007 ).

3. Kumar

, Kamble

, Pawar

, Patil

, Bonde

: Survey Techniques for Plant Leaf Classi cation . International Journal of Modern Research (IJMER) , Vol. 1 , Issue .2, pp- 538 - 544 ( 2011 ).

4. Goeau, H., Bonnet , P. , Joly , A. , Boujemaa , N. , Barthelemy , D. , Molino , J.F. , Birnbaum , P. , Mouysset , E. , Picard , M.: The CLEF 2011 plant images classi cation task . In: CLEF (Notebook Papers/Labs/Workshop) ( 2011 ).

5. Yahiaoui , I. , Herve , N. , Boujemaa , N.: Shape-based image retrieval in botanical collections . In: Advances in Multimedia Information Processing - PCM 2006 , vol. 4261 , pp. 357 - 364 ( 2006 ).