Participation of LSIS/DYNI to ImageCLEF 2012 plant images classification task Sébastien Paris1 ? , Xanadu Halkias2 , and Hervé Glotin23 1 LSIS/DYNI, Aix-Marseille University sebastien.paris@lsis.org 2 LSIS/DYNI, University of South Toulon-Var halkias@univ-tln.fr 3 Institut National de France glotin@univ-tln.fr Abstract. This paper presents the participation of the LSIS/DYNI team for the ImageCLEF 2012 plant identification challenge. Image- CLEF’s plant identification task provides a testbed for the system-oriented evaluation of tree species identification based on leaf images. The goal is to investigate image retrieval approaches in the context of crowd sourced images of leaves collected in a collaborative manner. The LSIS/DYNI team submitted three runs to this task and obtained the best evalua- tion scores (S = 0.32) for the ”photograph” image category with an automatic method. Our approach is based on a modern computer vi- sion framework involving local, highly discriminative visual descriptors, sophisticated visual-patches encoder and large-scale supervised classifi- cation. The paper presents the three procedures employed, and provides an analysis of the obtained evaluation results. Keywords: LSIS, DYNI, ImageCLEF, plant, leaves, images, collection, identification, classification, evaluation, benchmark 1 Introduction This paper presents the contribution of the LSIS/DYNI group for the plant identification task that was organized within ImageCLEF 20124 for the system- oriented evaluation of visual based plant identification. Similar to the Image- CLEF 2011 challenge, this second year pilot task was also precisely focused on tree species identification based on leaf images. This year, the challenge was or- ganized as a classification task over 126 tree species with visual content being the main available information. Three types of image content were considered: leaf ”scans”, leaf photographs with a white uniform background (referred to as ”scan-like” pictures) and unconstrained leaf ”photographs” acquired on trees with natural background (see Fig. 1). The LSIS/DYNI team submitted three ? Granded by COGNILEGO ANR 2010-CORD-013 and PEPS RUPTURE Scale Swarm Vision 4 http://www.imageclef.org/2012/plant runs, all of them based on local feature extraction and large-scale supervised classification. We obtained the best score for the ”photographs” category with an automatic method (S = 0.32). Fig. 1. From left to right: ”scans”, ”scan-like” and ”photographs” category. 2 Task description The task has been evaluated as a plant species retrieval task. 2.1 Training and Test data A part of Pl@ntLeaves II dataset was provided as training data whereas the remaining part was used later as test data. The training subset was built by including the training AND test subsets of last year’s Pl@ntLeaves I dataset, and by randomly selecting 2/3 of the individual plants for each NEW species (several pictures might belong to the same individual plant but cannot be split across training and test data). – The training data is comprised of 8422 images (4870 ”scans”, 1819 ”scan- like” photos, 1733 natural photos) with full xml files associated to them (see previous section for few examples). A ground-truth file listing all images of each species was provided complementary. – The test data is comprised of 3150 images (1760 ”scans”, 907 ”scan-like” photos, 483 natural photos) with purged xml files (i.e. without the taxon information that has to be predicted). 2.2 Task objective and evaluation metric The goal of the task was to retrieve the correct species among the top k species of a ranked list of retrieved species for each test image. Each participant was allowed to submit up to 4 runs built from different methods. As many species as possible can be associated to each test image, sorted by decreasing confidence score. However, only the most confident species were used in the primary evalu- ation metric described below. Providing an extended ranked list of species was encouraged in order to derive complementary statistics (e.g. recognition rate at other taxonomic levels, suggestion rate on top k species, etc.). The primary metric used to evaluate the submitted runs was a normalized classification rate evaluated on the 1st species returned for each test image. Each test image is attributed with a score of 1 if the 1st returned species is correct and 0 if it is wrong. An average normalized score is then computed on all test images. A simple mean on all test images would indeed introduce some bias with regard to a real world identification system. Indeed, we recollect that the Pl@ntLeaves II dataset was built in a collaborative manner; So that few contributors might have provided much more pictures than many other contributors who provided few. Since we want to evaluate the ability of a system to provide correct answers for all users, we would rather measure the mean of the average classification rate per author. Furthermore, some authors sometimes provided many pictures of the same individual plant (to enrich training data with less efforts). Since we want to evaluate the ability of a system to provide the correct answer based on a single plant observation, we also decided to average the classification rate on each individual plant. Finally, our primary metric was defined as the following average classification score S: U Pu Nu,p 1 X 1 X 1 X S= su,p,n , (1) U u=1 Pu p=1 Nu,p n=1 where – U : number of users (who have at least one image in the test data) – P u : number of individual plants observed by the uth user – N u, p : number of pictures taken from the pth plant observed by the u-th user – su,p,n : classification score (1 or 0) for the nth picture taken from the pth plant observed by the uth h user Finally, to isolate and evaluate the impact of the image acquisition type (”scans”, ”scan-like”, ”photograph”), a normalized classification score S was computed for each type separately. Participants were therefore allowed to train distinct classifiers, use different training subsets or use distinct methods for each data type. 3 Description of used methods For all submitted runs, whatever the particular image type, we followed the same pipeline: i) feature extraction coupled with spatial pyramid (SP) for local analysis and a linear large-scale supervised classification. For our first participation, we didn’t performe any (supervised) segmentation leading to the extraction of more elaborate and specific descriptors for leaf classification. 3.1 Common procedures Spatial pyramid local analysis We define our SP matrix Λ with L levels such as Λ , [r y , r x , dy , dx , λ]. Λ is a matrix of size (L × 5). For a level l ∈ {0, . . . , L − 1}, the image I, with size (ny × nx ), is divided into potentially overlapping sub-windows Rl,v of size (hl × wl ). All these windows are sharing the same associated weight λl . In our implementation, hl , bny .ry,l c and wl , bnx .rx,l c where ry,l , rx,l and λl are the lth element of vectors r y , r x and λ respectively. Sub-window shifts in x − y axis are defined by integers δy,l , bny .dy,l c and δx,l , bnx .dx,l c where dy,l and dx,l are elements of dy and dx respectively. Overlapping can be performed if dy,l ≤ ry,l and/or dx,l ≤ rx,l . The total number of sub-windows is equal to L−1 L−1 X X (1 − ry,l ) (1 − rx,l ) V = Vl = b + 1c.b + 1c. (2) dy,l dx,l l=0 l=0   1 1 1 11 Fig. 2 shows an example of SP with our particular choice Λ = 1 1 1 1 . 2 4 4 8 1 For this particular Λ matrix, we divided twice more the vertical axis than the horizontal one according to the aspect ratio distribution of images in the dataset. Linear support vector machines for large-scale classification Let’s assume available a training data set {xi , yi }N d i=1 , where xi ∈ R is a descrip- tor extracted from image I i and yi ∈ {1, . . . , M }, where M = 126 is the number of classes and N = 8422 is the number of training samples. As in [13, 1], we will use a simple large-scale linear SVM such as LIBLINEAR [6] with the 1-vs-all multi-class strategy. The associated binary unconstrained convex optimization problem to solve is: ( N ) 1 T X  2 min w w+C max 1 − yi wT xi , 0 , (3) w 2 i=1 where the parameter C controls the generalization error and is tuned on a specific validation set. LIBLINEAR converges to a solution linearly in O(dN ) 2 compared to O(dNsv ). Moreover, in order to obtain an estimate of p(y = l|x), we performed an SVM regression given the output of the previous classification stage for each binary classifier. Fig. 2. Example of SPM Λ with L = 2 and V = 1 + 21. Upper-left corner of each window Rl,v is indicated with a red cross. Left: R0,0 = I for l = 0 (first level). Right: {R1,v }, v = 0, . . . , 20 for l = 1 (second level). 3.2 Multiscale Color Local Phase Quantization (MSCLPQ) → LSIS DYNI run 1 Following [4, 5], we extend the basic decorrelated Local Phase Quantization (LPQ) descriptor for a multi-scale and color channel analysis over a spatial pyramid. In LPQ, Short Fourier Transforms (SFT) are computed over M ×M windows T T T centered on z at four frequencies u1 = [a, 0] , u2 = [0, a] , u3 = [a, a] and T 1 u4 = [a, −a] with a = M such that f (z − y)e−j2πu y , X T F (u, z) = (4) y∈Nz where z ∈ R ⊂ I. For each pixel, we compute the LPQ code as5 3 X 3 X LP Q(z) = 22i 1{Re(F (u,z ))≥0} + 22i+1 1{Im(F (u,z ))≥0} , (5) i=0 i=0 where LP Q(z) ∈ {0, . . . , 255}. Local histograms of LPQ codes are retrieved by counting occurrences of each individual LPQ code j such as: X xLP Q (j, R) = 1{LP Q(z )=j} , j = 0, . . . , 255. (6) z∈R The local histogram vector is defined by xLP Q (R) , [xLP Q (0, R), . . . , xLP Q (255, R)] , (7) 5 1{x} = 1 if event x is true, 0 otherwise. where xLP Q (R) is furthermore `2 normalized. The full vector x is obtained by concatenating previous normalized  histograms for 4 different scales M ∈ 1 1 1 1 1 {3, 5, 7, 9}, Λ = 1 1 1 1 1 (V = 1 + 21) and the 3 (R, G, B) color channels. 2 4 4 8 8 The total dimension of this vector is equal to d = 256.(1 + 21).4.3 = 67584. Finally, we normalize each element of xi such that xi,l ∈ [−1, 1] , l = 1, . . . , d, i = 1, . . . , N followed by `2 normalization on xi . The a posteriori probabilities associated with the MSCLPQ approach are denoted p1 (y = l|x). 3.3 Late fusion of MSCLPQ, MSCILBP and MSILBP+ScSPM → LSIS DYNI run 2 Multiscale Color Local Phase Quantization See sec. 3.2 Multiscale Color Improved Local Binary Pattern (MSCILBP) Basically, the operator ILBP encodes the relationship between a central block T of (s × s) pixels located in z c = [yx , xc ] with its 8 neighboring blocks [8] and also adds a ninth bit encoding a term homogeneous to the differential excitation. This operator can be considered as a non-parametric local texture encoder for scale s. In order to capture information at different scales, the range analysis s ∈ S, is typically set at S = [1, 2, 3, 4] for this task, where S = Card(S). This micro-codes are defined as follows: i=7 X ILBP (z c , s) = 2i 1{Ai ≥Ac } + 28 1 P 7 , (8) Ai ≥8Ac i=0 i=0 where ∀z c ∈ R ⊂ I, ILBP (z c ) ∈ N29 . The different areas {Ai } and Ac in eq.(8) can be computed efficiently using the image integral technique [12]. Let’s define II the image integral of I by: 0 0 yX