Introduction

Learning with Noisy and Trusted Labels for Fine-Grained Plant Recognition

Milan Sulc

sulcmila@cmp.felk.cvut.cz 0 1

Jir Matas

matas@cmp.felk.cvut.cz 0 1 0 Center for Machine Perception, Dept. of Cybernetics, Faculty of Electrical Eng., Czech Technical University in Prague , Czech Republic 1 EOL fine tuning (Accuracy) EOL fine tuning

2016

The paper describes the deep learning approach to automatic visual recognition of 10 000 plant species submitted to the PlantCLEF 2017 challenge. We evaluate modi cations and extensions of the state-ofthe-art Inception-ResNet-v2 CNN architecture, including maxout, bootstrapping for training with noisy labels, and ltering the data with noisy labels using a classi er pre-trained on the trusted dataset. The nal pipeline consists of a set of CNNs trained with di erent modi cations on di erent subsets of the provided training data. With the proposed approach, we were ranked as the third best team in the LifeCLEF 2017 challenge.

Introduction

The plant identi cation challenge PlantCLEF 2017 [ 1 ] is a part of the LifeCLEF activity [ 2 ] organized within CLEF 2017 { The Conference and Labs of the Evaluation Forum. The task of the challenge is automatic plant identi cation using computer vision. A similar task has been the subject of previous challenges [ 3,4 ], yet PlantCLEF 2017 aims at a signi cantly larger scale: recognizing plants from 10 000 species.

Two sets of training data, with di erent properties and sources but both covering the same 10 000 plant species, were provided by the organizers: 1. A set based on the online collaborative Encyclopedia Of Life (EoL) containing 256 287 images and corresponding xml les with meta-information. An important eld in the meta-information is the "Observation ID", which is an identi er connecting images of the same specimen (object of observation). This dataset is considered \trusted", i.e. the ground truth labels should allbe assigned correctly. 2. A noisy training set built using web crawlers, or more precisely, obtained by google and bing image search. It thus contains images not related to the given plant species. This set is provided in the form of a list of more than 1442k image URLs. We obtained nearly 1405k images from the list, the remaining images failed to download.

The evaluation is performed on a test set containing 25 170 images of 13 471 observations (specimen).

The rest of the paper is structured as follows: the deep learning approach and all proposed modi cations are described in Section 2. Preliminary experiments are described and their evaluation is discussed in Section 3. Post-processing steps are described in Section 4. The run les submitted to PlantCLEF are listed in 5. Conclusions are drawn in Section 6. 2

The Proposed Methods

In recent years, Deep Convolutional Neural Networks (CNNs) have become the core of state-of-the-art solutions of many computer vision tasks, especially those related to recognition and detection of objects. This is also the case for plant recognition, where in previous PlantCLEF challenges 2015 [ 4 ] and 2016 [ 5,3 ] the deep learning submissions [ 6,7,8,9,10,11,12 ] outperformed combinations of hand-crafted methods signi cantly. 2.1

Inception-ResNet-v2

The submitted model is based on the state-of-the-art convolutional neural network architecture, the Inception-ResNet-v2 model [13] which introduced residual Inception modules, i.e.inception modules with residual connections. Both the paper [13] and our preliminary experiments show that this network architecture leads to superior results compared with other state-of-the-art CNN architectures. The publicly available1 Tensor ow model pretrained on ImageNet was used for initial values of network parameters. The main hyperparameters were set as follows:

Optimizer Weight decay Learning rate Batch size

RMSProp with momentum 0.9 and decay 0.9. 0.00004.

Starting LR 0.01, decay factor 0.94, exponential decay, ending LR 0.0001. 32. 2.2

MaxOut

We experimented with adding maxout to the end of the network, which was helpful in our submission to PlantCLEF 2016: an additional fully-connected (FC) layer was added on top of the network, before the classi cation FC layer. The activation function in the added layer is maxout [14], maximum over slices of the layer: hi(x) = max zij ; j2[1;k] (1) 1 https://github.com/tensorflow/models/blob/master/slim/README.md# pre-trained-models where zij = xTW::ij + bij is a standard FC layer with parameters W 2 Rd m k, b 2 m k.

One can understand maxout as a piecewise linear approximation to a convex function, speci ed by the weights of the previous layer. This is illustrated in Figure 1.

We added a FC layer with 4096 units. The maxout activation operates over k = 4 linear pieces of the FC layer, i.e. m = 1024. Dropout with a keep probability of 80% is applied before the FC layers. The nal layer is a 10000-way softmax classi er corresponding to the number of plant species needed to be recognized.

We observed is that the additional FC layer has to be batch normalized [15]. Without normalization, the architecture becomes unstable with the default setting of hyperparameters, leading to unexpected drop in accuracy. In order to improve learning from noisy labels, Reed et. al. [16] proposed a simple consistency objective, which does not require an explicit information about the noise distribution.

Intuitively, the new objective(s) takes into account the current predictions of the network, lowering the damage done by incorrect labels. Reed al. propose two variants of the objective, denoted as Bootstrapping for consistency in multi-class prediction: { soft bootstrapping uses the probabilities qk estimated by the network (softmax):

Lsoft(q; t) =

N X [ tk + (1 k=1 )qk] log qk (2) Reed et al. [16] point out that the objective is equivalent to softmax regression with minimum entropy regularization, which was previously studied in [17]; encouraging high con dence in predicting labels. { hard bootstrapping uses the strongest prediction zk = (1 if k = argmaxqi

0 otherwise )zk] log qk (3) Lhard(q; t) =

N X [ tk + (1 k=1

The experiments of [16] show that the two objectives improve learning in the case of label noise, achieving the best accuracy with hard bootstrapping. We decided to follow the result of [16] and use hard booststrapping with = 0:8 in our experiments. The search for the optimal value of was ommited for computational reasons and limited time for the competition, yet the dependence between the amount of label noise and the optimal setting of hyperparameter is an interesting topic of future work. 3

Experiments

We used a subset of the test data from the previous year's PlantCLEF 2016 challenge to thoroughly evaluate the proposed methods. We only used 2583 images from the previous year dataset, for which we found species-correspondences in the 2017 task. This small validation set covers only a small subset of the classes, but should be su cient for an approximate evaluation of the method.

The sections below describe the experiments and corresponding design choices: 3.1

Fine-tuning vs. Training from Scratch

The rst issue tested was whether the network should be trained from scratch, or ne-tuned from an ImageNet-pretrained model. We compared the two scenarios by training only on the "trusted" dataset. As illustrated in Figure 2, training from scratch converges very slow. After 150k iterations (mini-batches) ne-tuning leads to 65.1% accuracy, while training from scratch only gets to 44.5%. For illustration 150k training iterations take ca. 65 hours on an NVIDIA Titan X GPU. Therefore we decided for ne-tuning. 3.2

Training on Trusted and Noisy Data

We ne-tuned the system with di erent settings described in Section 2 on the "trusted" (EOL) data only, as well as on the combination of both "trusted" and "noisy" data (EOL+WEB). The soft- and hard- bootstrapping were used for training with "noisy" data. Figure 3 shows that after 200k iterations, the networks trained only on the "trusted" data performed slightly better. The two best performing networks trained on the "trusted" (EOL) dataset will be used in the follow-up experiments. 0.8 0.7 20000 30000 40000 50000 60000 70000 80000 In order to lter out wrongly labeled examples from the "noisy" part of the training set, we used the network pretrained on the "trusted set" (from Section 3.2) to predict the labels from images. Only images, where the network prediction was equal to the label were kept in the " ltered noisy" dataset. This reduced the size of the "noisy" set from ca 1405k images to ca 425k images.

Let us denote the two networks ne-tuned on the "trusted" (EOL) dataset in Section 3.2 as follows: { Net #1: Fine-tuned on "trusted" (EOL) set without maxout for 200k iterations. { Net #2: Fine-tuned on "trusted" (EOL) set with maxout for 200k iterations.

Further ne-tuning was performed from these models pre-trained ( ne-tuned) on the "trusted" set. In order to perform bagging from several networks, we divide the data into 3 disjoint folds. Then each setting is used to further netune three networks, each on di erent 2 of the 3 folds. Each network is further ne-tuned for 50k iterations. { Net #3,#4,#5: Fine-tuned from #1 for 50k iterations on the "trusted" dataset. { Net #6,#7,#8: Fine-tuned from #2 for 50k iterations on the "trusted" dataset, with maxout. 0.8 0.6 0.4 As shown by the previous year's challenge winner [ 12 ] and con rmed by the experiments described in this report, averaging the predictions over images of the same observation (specimen) increases accuracy signi cantly. Therefore we also average scores per observations in all submitted run les. 20000 30000 40000 50000 0.8 0.6 0.4 0.2 Given the fact that we are evaluating the whole test set of images, we decided to experiment with adjusting the prediction distribution over the test set. Some plant species are certainly much rarer to observe than other. We assumed that the species in the test set might not follow the same distribution as the species in the training set. We computed the prior p(K) for each class K among the observations in the "trusted" dataset, and estimated the prior pt(K) of on the test set. Let q(KjX) be the prediction con dence for class K, given input image X. The nal prediction taking into account the possible shift in the distributions was: q (KjX) = q(KjX) s p(K) pt(K) ; (4) where the square root is used to make the adjustment less severe. 5

Description of the Submitted Run les

In PlantCLEF 2017, each participant is allowed to submit up to four run les with the results. We submitted the following run les: { CMP Run 1 combines all 17 networks by summimg their results. { CMP Run 2 uses the prediction distribution adjustment from Section 4.2 on top of the results from the rst run le. { CMP Run 3 combines only networks trained on the "trusted" data. { CMP Run 4 again adds the prediction distribution adjustment on top of results from the third run le. The di culties of the challenge lie in the high number of classes, high intra-class variations, small inter-class variations, and learning from noisy data downloaded by web crawlers.

To overcome these di culties, we employed a state-of-the-art deep learning architecture and compared a number of approaches to increase the accuracy of very ne-grained classi cation when learning from noisy data. The results of the challenge are depicted in Figure 5. Based on our evaluation, the following steps increase the classi cation accuracy: { Maxout [14] with batch normalisation [15] of the added FC layer. { Filtering the noisy data using a model trained on a trusted database. { Bagging of several networks ne-tuned under di erent conditions. Adjusting the species distribution on the test set, on the other hand, has decreased the recognition accuracy noticeably.

Acknowledgements

Milan Sulc was supported by Electrolux Student Support Programme and by CTU student grant SGS17/185/OHK3/3T/13, Jir Matas was supported by The Czech Science Foundation Project GACR P103/12/G084. 13. Christian Szegedy, Sergey Io e, Vincent Vanhoucke, and Alex Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261, 2016. 14. Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua

Bengio. Maxout networks. arXiv preprint arXiv:1302.4389, 2013. 15. Sergey Io e and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 16. Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596, 2014. 17. Yves Grandvalet and Yoshua Bengio. Entropy regularization. Semi-supervised learning, pages 151{168, 2006.

1. Herve Goeau, Pierre Bonnet, and

Alexis

Joly . Plant identi cation based on noisy web data: the amazing performance of deep learning (lifeclef 2017) . In CLEF working notes 2017 , 2017 .

Alexis

Joly , Herve Goeau, Herve Glotin, Concetto Spampinato, Pierre Bonnet, Willem-Pier

Vellinga

, Jean-Christophe

Lombardo

, Robert Planque, Simone Palazzo, and Henning Muller. Lifeclef 2017 lab overview: multimedia species identi cation challenges . In Proceedings of CLEF 2017 , 2017 .

Alexis

Joly , Herve Goeau, Herve Glotin, Concetto Spampinato, Pierre Bonnet, Willem-Pier

Vellinga

, Julien Champ, Robert Planque, Simone Palazzo, and Henning Muller. Lifeclef 2016 : multimedia life species identi cation challenges . In Proceedings of CLEF 2016 , 2016 .

4. Herve Goeau, Pierre Bonnet, and

Alexis

Joly . Lifeclef plant identi cation task 2015 . In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum , Toulouse, France, September 8- 11 , 2015 . CEUR-WS, 2015 .

5. Herve Goeau, Pierre Bonnet, and

Alexis

Joly . Plant identi cation in an open-world (lifeclef 2016) . In CLEF working notes 2016 , 2016 .

Sungbin

Choi . Plant identi cation with deep convolutional neural network: Snumedinfo at lifeclef plant identi cation task 2015 . In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum , Toulouse, France, September 8- 11 , 2015 . CEUR-WS, 2015 .

ZongYuan

Ge , Chris

McCool

Condrad

Sanderson , and

Peter

Corke . Content speci c feature learning for ne-grained plant classi cation . In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum , Toulouse, France, September 8- 11 , 2015 . CEUR-WS, 2015 .

Julien

Champ , Titouan Lorieul, Maximilien Servajean, and

Alexis

Joly . A comparative study of ne-grained classi cation methods in the context of the lifeclef plant identi cation challenge 2015 . In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum , Toulouse, France, September 8- 11 , 2015 . CEUR-WS, 2015 .

9. Angie

Reyes , Juan C. Caicedo , and Jorge E. Camargo . Fine-tuning deep convolutional networks for plant recognition . In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum , Toulouse, France, September 8- 11 , 2015 . CEUR-WS, 2015 .

10. Milan

Sulc

, Dmytro Mishkin, and

Jir

Matas . Very deep residual networks with maxout for plant identi cation in the wild . In Working notes of CLEF 2016 conference , 2016 .

11. Mostafa Mehdipour Ghazi, Berrin Yanikoglu, and

Erchan

Aptoula . Open-set plant identi cation using an ensemble of deep convolutional neural networks . Working notes of CLEF , 2016 .

12. Siang Thye Hang, Atsushi Tatsuma, and

Masaki

Aono . Blue eld (kde tut) at lifeclef 2016 plant identi cation task . In Working notes of CLEF 2016 conference , 2016 .