-

A comparative study of ne-grained classi cation methods in the context of the LifeCLEF plant identi cation challenge 2015

Julien Champ

0 1

Titouan Lorieul

0 1

Maximilien Servajean

0 1

Alexis Joly

0 1 0 Inria ZENITH team , France 1 LIRMM , Montpellier , France

This paper describes the participation of Inria to the plant identi cation task of the LifeCLEF 2015 challenge. The aim of the task was to produce a list of relevant species for a large set of plant observations related to 1000 species of trees, herbs and ferns living in Western Europe. Each plant observation contained several annotated pictures with organ/view tags: Flower, Leaf, Fruit, Stem, Branch, Entire, Scan (exclusively of leaf). To address this challenge, we experimented two popular families of classi cation techniques, i.e. convolutional neural networks (CNN) on one side and sher vectors-based discriminant models on the other side. Our results show that the CNN approach achieves much better performance than the sher vectors. Beyond, we show that the fusion of both techniques, based on a Bayesian inference using the confusion matrix of each classi er, did not improve the results of the CNN alone.

LifeCLEF plant leaves leaf ower fruit bark stem branch species retrieval images collection species identi cation citizen-science ne-grained classi cation evaluation benchmark

Content-based image retrieval and computer vision approaches are considered as one of the most promising solutions to help bridging the taxonomic gap, as discussed in [ 5,1,36,34,17 ]. We therefore see an increasing interest in this transdisciplinary challenge in the multimedia community (e.g. in [ 26,10,2,25,20,12 ]. Beyond the raw identi cation performances achievable by state-of-the-art computer vision algorithms, recent visual search paradigms actually o er much more e cient and interactive ways of browsing large ora than standard eld guides or online web catalogs ([ 3 ]). Smartphone applications relying on such image-based identi cation services are particularly promising for setting-up massive ecological monitoring systems, involving thousands of contributors at a very low cost. A rst step in this way has been achieved by the US consortium behind LeafSnap3, an i-phone application allowing the identi cation of 184 common american 3 http://leafsnap.com/ plant species based on pictures of cut leaves on an uniform background (see [ 23 ] for more details). Then, the French consortium supporting Pl@ntNet ([ 17 ]) went one step beyond by building an interactive image-based plant identi cation application that is continuously enriched by the members of a social network specialized in botany. Inspired by the principles of citizen sciences and participatory sensing, this project quickly met a large public with more than 300K downloads of the mobile applications ([ 8,7 ]). A related initiative is the plant identi cation evaluation task organized since 2011 in the context of the international evaluation forum CLEF4 and that is based on the data collected within Pl@ntNet. This paper presents the participation of Inria ZENITH team to the 2015-edition of this challenge [ 9,19 ]. 2

Related work

From a computer vision and technological perspective, our work is more generally related to image classi cation. Most popular methods for this problem are typically based on the pooling of local visual features into global image representations and the use of powerful classi ers in the resulting high-dimensional embedded space such as linear support vector machines ([ 24,28 ]). The Bag-ofword representation (BoW) notably remains a key concept although the raw initial scheme of ([ 33 ]) is now outperformed by several alternative new schemes ([ 24,16,27,6,14 ]). Its principle is to rst train a so called visual vocabulary thanks to an unsupervised clustering algorithm computed on a given training set of local features. The produced partition is then used to quantize the visual features of a given new image into visual words that are aggregated within a single high-dimensional histogram. Partial geometry can be embedded in the image representation by using the Spatial Pyramid Matching scheme of ([ 24 ]). As it relies on vector quantization, the BoW representation is however a ected by quantization errors. Very similar visual features might be split across distinct clusters whereas more dissimilar ones might be a ected to the same visual word. This results in both mismatches and potentially irrelevant matches. To alleviate this problem, several improvements have been proposed in the literature. The rst one consists in expanding the assignment of a given local feature to its nearest visual words ([ 16,29,6,14 ]). This allows reducing the number of mismatches without degrading much the encoding time. Other researchers have investigated alternative ways to avoid the vector quantization step, using sparse coding ([38]) or locality-constrained linear coding ([37]). Such methods optimize the a ectation of a given local feature to a small number of visual words thanks to sparsity or locality constraints on the global representation. Another alternative is to use aggregation-based models such as the improved Fisher Vector of [ 27 ] or the VLAD encoding scheme ([ 14 ]). Such methods do not only encode the number of occurrences of each visual word but also encode additional information about 4 http://www.clef-initiative.eu/ the distribution of the descriptors by aggregating the component-wise di erences. When used with discriminative linear classi ers, such high-dimensional representations bene t of both generative and discrimination approaches leading to state-of-the-art classi cation performances on ne-grained classi cation benchmarks ([ 11 ]).

A radically di erent approach to image classi cation is the use of deep convolutional neural networks. Rather than extracting the features according to hand-tuned or psycho-vision oriented lters, such methods directly work on the image signal. The weights learned by the rst convolutional layers allows to automatically build relevant image lters whereas the intermediate layers are in charge of pooling these raw responses into high-level visual patterns. The last fully connected layers work more traditionally as any discriminative classi er on the image representation resulting from the previous layers. Deep convolutional neural networks have been recently proved to achieve better results on large-scale image classi cation datasets such as ImageNet ([ 22 ]) and do attract more and more interest in the computer and multimedia vision communities. A known drawback of Deep Convolutional Neural Networks is however that they require a lot of training data mainly because of the huge number of parameters to be learned. Their performances on ne-grained classi cation are consequently more controversial and they are still often outperformed by local features based approaches, as shown in our experiments. Besides, it is important to notice that they inspire the investigation of new deep learning models making use of more traditional visual features embedding methods (e.g. [ 31 ]). 3

Experimented ne-grained image classi cation systems

We did experiment two families of image classi cation techniques that are known to provide state-of-the-art classi cation performances, in particular in ne-grained recognition challenges ([ 11,18 ]). 3.1

Convolutional neural networks Convolutional Neural Networks (CNN) have been mainly used since the 90's for their performances in digit classi cation. But since a few years, they appear to have now surpassed all state of the art methods for large-scale image classi cation [ 22 ]. In this experimentation, we have used Ca e [ 15 ], a Deep Learning Framework, allowing us to use CNN architectures and models from the literature. We have chosen in the Ca e model Zoo the "GoogLeNet GPU implementation" model, based on Google winning architecture in the ImageNet 2014 Challenge [35], and we ne-tuned this model on the LifeCLEF datasets.

The GoogLeNet architecture consists of a 22 layers deep network with a softmax loss as the top classi er. It is composed of three "inception modules" stacked on top of each other. Each intermediate inception module is connected to an auxiliary classi er during training, so as to encourage discrimination in the lower stages of the classi er, increase the gradient signal that gets propagated back, and provide additional regularization. These auxiliary classi ers are only used during the training part, and then discarded.

Experiments Setup The previously described GoogLeNet CNN uses square images as input. For each image in the training and test sets, we therefore cropped the largest square in the center, and re-sized it to 256x256 pixels. Instead of starting to train our CNN from scratch only on plant images, and as it was authorized in this year's challenge, we started with a CNN trained on the popular generalist ImageNet dataset. We only removed its top layers (the fully connected ones), changed the number of outputs, and trained this new model using the desired dataset. As it was implemented within Ca e library, it makes also use of a simple data augmentation technique, consisting in cropping randomly a 224x224 pixels image, and eventually mirroring it horizontally.

During our preliminary experiments, we have tried several training strategies that are presented are presented in Table 1.

We have tested all these con gurations using the PlantCLEF 2014 data and groundtruth (500 species, 47815 train images and 13146 test images). CNN1 con guration was the simplest and the rst that we have tested, but nally also the one providing the best results. The Data Augmentation method proposed for CNN2 con guration increased signi cantly the number of train images as we generated 8 new images by applying rotations, and a set of colorimetric transformations with randomized parameters, i.e. brightness & saturation modulation in the HSL color space (multiplier factor randomized between 0:8 and 1:2), and contrast modulation (multiplier factor randomized between 0:7 and 1:3). Even with additional iterations to train the CNN, results remained nearly the same than those for CNN1. The CNN3 con guration consisted in training several CNNs, one for each view type (thanks to the tags provided in the meta-data). On one hand, as some species haven't images for all views, the number of output for each CNN is lower than 1000 and that could help to obtain better results because of the reduction of the confusion risk. On the other hand, some images from a given view (Branch for example) can really help to identify some images tagged with another view (Entire for example). Results were slightly lower for the Branch, Entire, Leaf, Fruit, and Flower views than what was obtained with the standalone CNN. This could be explained by a less important number of images to train the network, and proves that images from a given view can help when identifying an image tagged with another view. This conclusion is not true for the Stem and LeafScan views. The reason is probably that the LeafScan view is speci c, very di erent from other views, and does not contain background information, and as the Stem tag identi es a closeup view of the plant which is not really apparent on other images.

Training parameters As a reminder, here are the most important parameters for Ca e to obtain our submitted run (CNN1). The base learning rate parameter was set to 10 5. The learning rate is divided by 10 every 60k iterations. After 150k iterations the training is over, and the batch size was xed to 32. All other parameters were unchanged. 3.2

Fisher vectors & Logistic Regression Fisher vectors (FV) were rst introduced in image classi cation by [ 27 ] and proved to be very e cient in ne-grained classi cation tasks later on ([ 11 ]). According to recent surveys such as [ 13 ], it is the best performing pooling strategy currently available. We will only recall here the main steps used to extract Fisher vectors, for detailed explanations of the theoretical derivation and for performance analysis we redirect the readers to [ 30 ]. The pipeline for computing the Fisher vector describing an image consists in: 1. Dense extraction of local features : descriptors, often SIFT descriptors, are extracted on densely sampled overlapping patches at several scales. 2. PCA transformation: the descriptors are then de-correlated and compressed using a Principal Component Analysis. 3. Feature space density estimation : the distribution of features is modeled as a Gaussian Mixture Model (GMM) that is learned using the popular Expectation-Maximisation (EM) algorithm. We thus obtain a probability distribution of the form of u(x) = PK

k=1 wkuk(x) where uk follows a Gaussian distribution of mean k and covariance matrix k, uk N ( k; k), with k being diagonal because the features are decorrelated, and where wk is the weight of the k-th Gaussian, these weights satisfy Pk wk = 1. 4. Encoding and pooling : the features are encoded and pooled using 1 XN k(xi) xi G k = pwk i=1

1 XN G k = pwk i=1 k(xi) p2 xi k

k k k 2 1 where all the divisions and squaring are element-wise operations and where k(x) = PkK0=w1kwukk(0xu)k0 (x) . Theses 2K vectors are concatenated to produce the nal representation of dimension 2dK. 5. Post-processing : the vectors are L2-normalized and element-wise squarerooted using x 7! sign(x):pjxj.

Usually, the classi cation of Fisher Vectors is performed using a linear classi er as it has been shown that using kernelization techniques on such highdimensional spaces does not improve signi cantly the performances. In our experiments, we used the Logistic Regression classi er implemented within the LibLinear library ([ 4 ]). This method was preferred over Support Vectors Machine because it directly outputs probabilities which then can be used for fusion purposes.

Here, we used two types of Fisher Vectors with two di erent types of descriptors. The rst system was built with RootSIFT descriptors, l2-normalized and square-rooted SIFT descriptors, of 128 dimensions which are then reduced to 80 dimensions through PCA. The second one was based on some complementary descriptors used in the Pl@ntNet application [ 17 ]. It consists in the concatenation of several basic descriptors such as Fourier histograms, Edge Orientation Histograms (EOH), HSV histograms and Hough transform histograms. This concatenation was then compressed and de-correlated using PCA. The association of descriptors used depends on the organ, for Branch, Entire, Leaf, LeafScan, Stem only Fourier, EOH and Hough histograms are used resulting in 44-dimension nal descriptors compressed to 14 dimensions after PCA while Flower and Fruit add HSV histograms giving descriptors of dimension 74 reduced to 38 after compression. In both systems, the GMM used to estimate the probability distribution of the features learns a codebook of 128 words. 4

Fusion methods

Combining multiple classi ers or even multiple results (i.e. several images of a single observation) from a single classi er is a way to increase the classi cation quality. This section presents three main approaches we used to merge the various results from our classi ers. 4.1

Max and Borda Maximum and Borda Count are two approaches used to merge top-k lists. While the maximum relies on the score of each class with the lists, Borda Count uses their rank.

More precisely, the maximum based approach associates to each class the maximum score it reaches among the di erent lists. In the Borda Count approach, we have associated each class within a list to a score decreasing while the rank increases. In more details, since we only retrieve the top-K most likely classes, the score of a given species s is computed as follows: score(s) = X K

rc(s) c2C (1) where rc(s) is the ranking of species s returned by the classi er c. Framework presentation This fusion method is inspired by what is done in crowdsourcing multi-labeled classi cation tasks [ 21,32 ]. For this purpose we used the Bayesian inference framework described in Figure 1.

! ( k )

# "(k)

t i ci(k) k = 1, …, K

i = 1, …, N

In such inference framework, we are given a set of classi ers k 2 1; :::; K and a confusion matrix (k) is assigned to each one of them. Such matrix enables to (k) evaluate the classi cation quality of each classi er. In a more precise way, i;j refers to the probability that the classi er k, given an image, will answer class j while the right class is i. The set of all confusion matrices is noted . Notice that, as presented in Figure 1, the confusion matrix (k) is directly derived from the parameters matrix (k). The set of all parameters matrices is noted A. In parallel, each observation (i.e. set of images corresponding to a single plant) is associated to a distribution probability, noted ti for the ith observation. This probability depends on the proportion of each species in the database, and we note the vector referring to this proportion. Finally, based on the probabilities ti and on the confusion matrix of a given classi er k, we can infer the probability of the classi er's answer for the ith observation, noted ci(k).

Therefore, the joint probability of this Bayesian framework follows Equation 2.

N p( ; t; cjA; ) = Y

f ti i=1

K Y k=1 (k) ti;ci(k) gp( jA) (2) Once the classi ers answers (i.e. the set of answers c(k) for all k and i) are i known, the probabilities of A; ; and t can be updated, thus inferring the correct class of each observation (i.e. the one with the highest probability in ti). In the following, we suppose known thanks to the very large size of the training set.

Addressing the large dimensionality Generally, in the state of the art solutions, several approaches are proposed to compute the posterior probabilities such as Gibbs sampling [ 21 ] or Variational Bayes [ 32 ]. In our experiments we had to face the very large dimension of the problem: each confusion matrix being of size 1000 1000. Classical method are therefore intractable in our context. To address this challenge, we used a single-shot approach: only p(ti = jjrest) is computed and used to update A and { recall that is known and does not need to be updated. Thus, the confusion matrix of each classi er evolves while the number of identi cations increases and the quality of inference is re ned more and more.

Experiments Setup In this subsection, we present three aspects of the setup: parameters initialization, parameters re nement and classi er's confusion re nement.

An important part of the fusion is to learn the confusion matrix (and its parameters). To do so, we have initialized each parameters matrix A with a value of S in the diagonal and S=(dimension 1) in the other cells, meaning that there is a 50% probability that the classi er will be correct and that given the correct class and a wrong one, it is more likely that the classi er will return the correct one. In our experiments the value of S has been xed to 5 (best choice among several runs).

Then, we tried to enhance the confusion matrix quality based on the training data. For each image of the set, we asked the classi ers to re-propose a top-30 classi cation, and, given the correct class i, we have added in each cell ai;j of the matrices A a value inversely proportional to the species rank in the top-30: ra1nk .

Finally, to be as ne-grained as possible, each classi er was associated to several confusion matrices corresponding to each plants organs. Thus, the system knows the confusion of each classi er for all possible organs. In a way, we consider each couple forgan; classif ierg as a single classi er. 5 5.1

cial Results

Runs details 3 runs were nally submitted to the LifeCLEF 2015 plant challenge: { INRIA Zenith Run 1 is based on the results provided by the single Convolutionnal Neural Network netuned using all provided data (CNN1), and described in 3.1. Observations composed of several images, are combined using a Max function to provide Observation Results. { INRIA Zenith Run 2 is based on Fisher Vectors described in 3.2. To obtain

Observation Results we used the Borda Count Algorithm. { INRIA Zenith Run 3 is the combination of the results obtained by previous methods (CNN and Fisher Vectors) using the Bayesian inference method described in 4.2.

If we compare the best runs of each team, the INRIA Zenith Run 1, the one using CNN, is ranked 3rd regarding to observation results. We can note that all the 4 best teams used Deep Neural Networks. Our second run, INRIA Zenith Run 2, the one using Fisher Vectors, is disappointingly distanced by the CNN runs: its nal score is two times lower (0.3 instead of 0.609 for INRIA Zenith Run 1 ). In LifeCLEF 2014, the best performances were obtained by Fisher Vectors, but the use of external training data was not allowed which explains why CNN were not performing better.

Our nal run, INRIA Zenith Run 3, is the Bayesian inference fusion method using previous runs. It was made in order to bene t from both technologies. Unfortunately, the results obtained are a little bit lower than the standalone CNN of INRIA Zenith Run 1 (0.592 instead of 0.609). Two main reasons can be highlighted to explain this quality loss. First, the two classi ers are not necessarily independent, thus, there combination does not enable to obtain quality gain. Second, building a confusion matrix for such high dimension problems (i.e. 1000 1000) is very challenging and the size of the test set is not enough to learn an accurate confusion. 6

Conclusion

Inria Zenith team submitted 3 runs, using di erent strategies. The rst run was based on the well-known GoogLeNet CNN architecture, netuned over Imagenet dataset, and using a max method to fuse image results to observation results. Our second run did not used external data, and was based on sher vectors which was last year winning technology. The conclusion is that Deep Neural Networks outperforms sher vectors for such classi cation tasks, particularly with an important number of classes, and when you have large training datasets. Our last run consisted in trying a new fusion method, based on Bayesian inference, to merge results of the two previous runs. However results were not as good as expected, probably because the rst run is already two times better than the second one. 7

Appendix: Complementary Results

34. Spampinato, C., Mezaris, V., van Ossenbruggen, J.: Multimedia analysis for ecological data. In: Proceedings of the 20th ACM international conference on Multimedia. pp. 1507{1508. ACM (2012) 35. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. CoRR abs/1409.4842 (2014), http://arxiv.org/abs/1409.4842 36. Trifa, V.M., Kirschel, A.N.G., Taylor, C.E., Vallejo, E.E.: Automated species recognition of antbirds in a Mexican rainforest using hidden Markov models. Journal of The Acoustical Society of America 123 (2008) 37. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classi cation. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. pp. 3360{3367. IEEE (2010) 38. Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classi cation. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. pp. 1794{1801. IEEE (2009)

1. Cai , J. , Ee , D. , Pham , B. , Roe , P. , Zhang, J.: Sensor network for the monitoring of ecosystem: Bird species recognition . In: Intelligent Sensors, Sensor Networks and Information , 2007 . ISSNIP 2007 . 3rd International Conference on. pp. 293 { 298 (Dec 2007 )

2. Cerutti , G. , Tougne , L. , Vacavant , A. , Coquin , D. : A Parametric Active Polygon for Leaf Segmentation and Shape Estimation . In: 7th International Symposium on Visual Computing . p. 1 .

Las

Vegas , United States ( Sep 2011 ), https://hal. archives-ouvertes.fr/hal-00622269

3. Ellison , A.M. , Farnsworth , E.J. , Chu , M. , Kress , W.J. , Neill , A.K. , Best , J.H. , Pickering , J. , Stevenson , R.D. , Courtney , G.W. , VanDyk, J.K. : Next-generation eld guides ( 2013 )

4. Fan , R.E. , Chang , K.W. , Hsieh , C.J. , Wang , X.R. , Lin , C.J.: Liblinear: A library for large linear classi cation . The Journal of Machine Learning Research 9 , 1871 { 1874 ( 2008 )

5. Gaston , K.J. , O 'Neill , M.A. : Automated species identi cation: why not? Philosophical Transactions of the Royal Society of London B: Biological Sciences 359 ( 1444 ), 655 { 667 ( 2004 )

6. van Gemert, J.C. , Veenman , C.J. , Smeulders , A.W. , Geusebroek , J.M. : Visual word ambiguity . Pattern Analysis and Machine Intelligence , IEEE Transactions on 32(7) , 1271 { 1283 ( 2010 )

7. Goeau, H., Bonnet , P. , Joly , A. , A ouard, A ., Bakic , V. , Barbe , J. , Dufour , S. , Selmi , S. , Yahiaoui , I. , Vignau , C. , et al.: Pl@ ntnet mobile 2014: Android port and new features . In: Proceedings of International Conference on Multimedia Retrieval . p. 527 . ACM ( 2014 )

8. Goeau, H., Bonnet , P. , Joly , A. , Bakic , V. , Barbe , J. , Yahiaoui , I. , Selmi , S. , Carre , J. , Barthelemy , D. , Boujemaa , N. , et al.: Plantnet mobile app . In: Proceedings of the 21st ACM international conference on Multimedia . pp. 423 { 424 . ACM ( 2013 )

9. Goeau, H., Joly , A. , Bonnet , P. : Lifeclef plant identi cation task 2015 . In: CLEF working notes 2015 ( 2015 )

10. Goeau, H., Joly , A. , Selmi , S. , Bonnet , P. , Mouysset , E. , Joyeux , L. : Visual-based plant species identi cation from crowdsourced data . In: MM'11 - ACM Multimedia 2011 . pp. 0 { 0 . ACM, Scottsdale, United States ( Nov 2011 ), https://hal.inria. fr/hal-00642236

11. Gosselin , P.H. , Murray , N. , Jegou , H. , Perronnin , F. : Revisiting the sher vector for ne-grained classi cation . Pattern Recognition Letters 49 , 92 { 98 ( 2014 )

12. Hsu , T.H. , Lee , C.H. , Chen , L.H.: An interactive ower image recognition system . Multimedia Tools Appl . 53 ( 1 ), 53 {73 (May 2011 ), http://dx.doi.org/10.1007/ s11042-010-0490-6

13. Huang , Y. , Wu , Z. , Wang , L. , Tan , T. : Feature coding in image classi cation: A comprehensive study . Pattern Analysis and Machine Intelligence , IEEE Transactions on 36(3) , 493 { 506 ( 2014 )

14. Jegou , H. , Perronnin , F. , Douze , M. , Sanchez , J. , Perez , P. , Schmid , C. : Aggregating local image descriptors into compact codes . Pattern Analysis and Machine Intelligence , IEEE Transactions on 34(9) , 1704 { 1716 ( 2012 )

15. Jia , Y. , Shelhamer , E. , Donahue , J. , Karayev , S. , Long , J. , Girshick , R. , Guadarrama , S. , Darrell , T.: Ca e: Convolutional architecture for fast feature embedding . arXiv preprint arXiv:1408.5093 ( 2014 )

16. Jiang , Y.G. , Ngo , C.W. , Yang , J. : Towards optimal bag-of-features for object categorization and semantic video retrieval . In: Proceedings of the 6th ACM international conference on Image and video retrieval . pp. 494 { 501 . ACM ( 2007 )

17. Joly , A. , Goeau, H., Bonnet , P. , Bakic , V. , Barbe , J. , Selmi , S. , Yahiaoui , I. , Carre , J. , Mouysset , E. , Molino , J.F. , et al.: Interactive plant identi cation based on social image data . Ecological Informatics 23 , 22 { 34 ( 2014 )

18. Joly , A. , Goeau, H., Glotin , H. , Spampinato , C. , Bonnet , P. , Vellinga , W.P. , Planque , R. , Rauber , A. , Fisher, R., Muller, H.: Lifeclef 2014: multimedia life species identi cation challenges . In: Information Access Evaluation . Multilinguality, Multimodality, and Interaction, pp. 229 { 249 . Springer ( 2014 )

19. Joly , A. , Muller, H., Goeau, H., Glotin , H. , Spampinato , C. , Rauber , A. , Bonnet , P. , Vellinga , W.P. , Fisher, B. : Lifeclef 2015: multimedia life species identi cation challenges

20. Kebapci , H. , Yanikoglu , B. , Unal , G. : Plant image retrieval using color, shape and texture features . Comput. J . 54 ( 9 ), 1475 {1490 (Sep 2011 ), http://dx.doi.org/ 10.1093/comjnl/bxq037

21. Kim , H.C. , Ghahramani , Z. : Bayesian classi er combination . In: International conference on arti cial intelligence and statistics . pp. 619 { 627 ( 2012 )

22. Krizhevsky , A. , Sutskever , I. , Hinton , G.E.: Imagenet classi cation with deep convolutional neural networks . In: Advances in neural information processing systems . pp. 1097 { 1105 ( 2012 )

23. Kumar , N. , Belhumeur , P.N. , Biswas , A. , Jacobs , D.W. , Kress , W.J. , Lopez , I.C. , Soares , J.V. : Leafsnap: A computer vision system for automatic plant species identi cation . In: Computer Vision{ECCV 2012 , pp. 502 { 516 . Springer ( 2012 )

24. Lazebnik , S. , Schmid , C. , Ponce , J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories . In: Computer Vision and Pattern Recognition , 2006 IEEE Computer Society Conference on. vol. 2 , pp. 2169 { 2178 . IEEE ( 2006 )

25. Mouine , S. , Yahiaoui , I. , Verroust-Blondet , A. : Advanced shape context for plant species identi cation using leaf image retrieval . In: Ip, H.H.S. , Rui , Y . (eds.) ICMR ' 12 - 2nd ACM International Conference on Multimedia Retrieval. ACM, Hong

Kong

, China (Jun 2012 ), https://hal.inria.fr/hal-00726785

26. Nilsback , M.E. , Zisserman , A. : Automated ower classi cation over a large number of classes . In: Computer Vision , Graphics Image Processing , 2008 . ICVGIP ' 08 . Sixth Indian Conference on. pp. 722 { 729 (Dec 2008 )

27. Perronnin , F. , Dance , C. : Fisher kernels on visual vocabularies for image categorization . In: Computer Vision and Pattern Recognition , 2007 . CVPR'07. IEEE Conference on. pp. 1 { 8 . IEEE ( 2007 )

28. Perronnin , F. , Sanchez , J. , Mensink , T. : Improving the sher kernel for largescale image classi cation . In: Computer Vision{ECCV 2010 , pp. 143 { 156 . Springer ( 2010 )

29. Philbin , J. , Chum , O. , Isard , M. , Sivic , J. , Zisserman , A. : Lost in quantization: Improving particular object retrieval in large scale image databases . In: Computer Vision and Pattern Recognition , 2008 . CVPR 2008 . IEEE Conference on. pp. 1 { 8 . IEEE ( 2008 )

30. Sanchez , J. , Perronnin , F. , Mensink , T. , Verbeek , J.: Image classi cation with the sher vector: Theory and practice . International journal of computer vision 105(3) , 222 { 245 ( 2013 )

31. Simonyan , K. , Vedaldi , A. , Zisserman , A. : Deep Fisher networks for large-scale image classi cation . In: Advances in Neural Information Processing Systems ( 2013 )

32. Simpson , E. , Roberts , S. , Psorakis , I. , Smith , A. : Dynamic Bayesian Combination of Multiple Imperfect Classiers . In: Decision Making with Imperfect Decision Makers Springer ( 2012 )

33. Sivic , J. , Zisserman , A. : Video google: A text retrieval approach to object matching in videos . In: Computer Vision , 2003 . Proceedings. Ninth IEEE International Conference on. pp. 1470 { 1477 . IEEE ( 2003 )