Non-local DenseNet for PlantCLEF 2019 Contest? Dat Nguyen Thanh, Georges Quénot, and Lorraine Goeuriot Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, F-38000 Grenoble France datnt.hust59@gmail.com, georges.quenot@imag.fr, lorraine.goeuriot@imag.fr Abstract. Image-based plant identification is a promising tool consti- tuting the automation of agriculture and environmental conservation as stated in. As an attempt to tackle the data deficient challenge in Plant- CLEF 2019, the DenseNet architecture with competitive performance and relatively low number of parameters is augmented with a non-local block. A variety of data sampling schemes are also evaluated as a part of the work. The evaluation of the model and the methods is detailed in the content of the paper. Keywords: DenseNet · Non-local block · Plant Identification. 1 Introduction Various types of plants grow all around us, yet, little amongs us are plant ex- perts. Indeed, knowing what plant available and where they are will be extremely helpful in pharmacy, from productional and academical perspective, environ- ment protection. The rising of machine learning with artificial neural networks and convolutional neural networks which, are able to performs at near-human capability in image processing task, the popular use case of such technologies are for the automation of the task which human already excels: face recognition, image classification, etc. Still, it is would be highly beneficial if we can leverage these technologies in the task that human are yet to excel at in mass: Plant Identification. The image-based plant identification can be formulated as a plant classifica- tion problem, where the input is an image containing the plant and the output is the id of the plant pre-defined by user. Formulating the problem of PlantCLEF contest as an image-classification task, the task itself in general has observed drastic improvement with the deep learning based methods, in the summariza- tion of PlantCLEF 2017[2], it is shown that the best competitors have got over 90% accuracy using the aforementioned method. Notably, in the LifeCLEF 2018 Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem- ber 2019, Lugano, Switzerland. ? Supported by MRIM Team contest[3], the are quite a number of software that achieved comparable accuracy to that of the top experts. In this work, we present our proposed methods for the PlantCLEF 2019 [4] which is part of the LifeCLEF 2019 [1] which focus on 10,000 species from data deficient regions. The rest of this paper is structured as follows: Section 2 gives an overview of related works on automatic plant-identification in deep learning from previous contests, section 3 describe the proposed architecture for prediction, section 4 provides additional information on data augmentation and data sampling schemes and finally we conclude our works in section 5. The source code and trained models are made available under the github link: https://github.com/datvo06/PlantCLEF2019MRIM. 2 Related Work Ever since AlexNet[9] won the competition of ImageNet classification 2012, Con- volutional Neural Networks(CNNs) has always been at the center of image clas- sification. Following AlexNet, there have been three lines of research focusing on the CNNs: modifying the operations in the CNNs, dividing the networks into several sub-modules and make improvement on each of them, and finally, altering the information flow by adding connections. Fine-tuning modules and adding auxiliary loss The inception model [12,13,11] follows the principle of repeating many carefully designed block of filter stacked horizontally (receive the same input and the output feature map are concate- nated). Each time with new version of Inception Net, the authors often optimiz- ing one of these blocks so that the number of computations, memory consump- tion, number of parameters can be optimized. The Inception-v1 is used for the baseline of PlantCLEF 2017, achieving the Top 1 accuracy of 0.513 Adding Residual connections One of the problem with original deep neural net- works is that the more layers added, there more model prone to gradient vanish- ing. Various works have been proposed to amend this problem, (i.e LSTM [7] for sequenced input, highway network [7] which introduce a gated mechanism for ANNs), for convolutional neural network, residual additive connection proposed by [5] is one of them, the author later analyzed carefully the effect of the order of each Residual Block, resulting in [6], a modified version of the ResNet used in PlantCLEF 2017 [2], achieved the best score among non-ensemble runs with top 1 accuracy of 0.853. Combining Inception and ResNet The inception design and the ResNet design has merged together, first in the Inception-ResNet design [11]. The network architecture still bases on the original principle of carefully designed block, the authors did this by adding the residual connection in a few variant of inception blocks. Inception-ResNet v2 achieved similar score to ResNet modified in the PlantCLEF 2017 [2] with MRR 0.847, Top 1 0.794, Top 5 0.913 and are used by the majority participants in PlantCLEF 2018. Ensemble prediction The top performer of PlantCLEF 2017 [10] utilized ensem- ble prediction of multiple predictions with bagged averaging, the models used are ResNet, ResNeXt [10] and Inception-v1. DenseNet As the residual connections has been proven to allow better gra- dient flow and performance boost to the convolutional neural networks, the DenseNets author[8] has tested the idea of densely connected layers. The model capable of achieving state-of-the-arts accuracy in classifying tasks with a rela- tively low number of parameters, making it a potentially good baseline for the data-deficient context. For this reason we choose the DenseNet as the baseline for the model. Data Sampling Schemes To the best of our knowledge on data-sampling for training, there are little overlapping works with the strategies proposed. 3 Model Architecture 3.1 Non-local Networks The non-local neural network [14] was proposed to solve the problem of limited information propagation from CNN and LSTM. The idea is to performs inter- pixels correlations from different position in the feature maps, leading to generate more power pixel-wise representation. The non-local operation, according to [14] is defined as: 1 X yi = d (xi , xj ) h (xj ) (1) C (x) ∀j Where i is the index on the output feature maps (in space, time, or spacetime in the original case of video classification, annotation), j is the index on the input feature maps x and d computes the scalar representing the pairwise relationship between the entities in the items reside in these locations. We shall see on the next section where the non-local block is added to the DenseNet baseline. 3.2 Adding Non-local operation to the DenseNet The non-local operation is added between the output of the third dense block and the 1 × 1 channel-squashing convolution. The non-local block was added after the third dense block for several reason: – First, in the original introduction of the non-local block [14], multiple non- local position has been tested, of which, the best position is after the third Residual Convolution Block – We have known based on the mechanism of self-attention, the non-local block performs pairwise dot product between two transformation of every pair of pixels on the grid. That why it is necessary to place a few convolution blocks before the non-local block so that the operation may potentially leverage informations from local neighbors. Fig. 1. Non-local block, f (x), g(x), h(x) are three 1 × 1 convolutions, where f (x), g(x) are channel-squashing functions. Fig. 2. Placing Non-local block within DenseNet. 3.3 Ensemble prediction When applied into the final predictions, each instance of observation has multiple samples, so that there either has to be some middle layer to aggregate prediction in order to combine the prediction of multiple models on multiple instance. For this, a two-level pooling is leveraged: The first level of pooling is used for ensembling the predictions of multiple dif- ferent trained prediction instances and the second is used for aggregation of predictions from multiple observations. 4 Experiments and Results 4.1 Data Augmentation Several data augmentation strategies have been applied: – Randomly resizing image Fig. 3. Ensemble prediction using two-level pooling. – Randomly crop – Random Horizontal Flip – Random alternating the brightness and contrast Fig. 4. Illustration of data-augmentations. 4.2 Data Sampling Notation: – N : total number of samples – ni : number of samples for ith class – oi : oversampling factor for ith class – wi : sampling weight for ith class – m: the median number of samples – µ: the mean number of samples per classes. Minimum Threshold Resampling This strategy only focus on augmenting the classes having less number of samples than the average number of samples per classes. Here, for each class with number of samples ni , the oversampling factor oi will be assigned the value of µ/ni . The oversampling might make some samples in the classes appears too many times compared to the others, making the model prone to overfit and also, so on each epoch, the classes samples are reshuffled and resampled. Another problem is that the training times will be prolonged due to the increase in number of samples. For this, another strategy is also applied which is described below. Smoothed Re-sampling This strategy partly oversampling small classes while also performs subsampling on classes with large number of samples. All of the aforementioned parameters are constant during training. The number of total samples which will be used throughout the training session is the sample: N . On each epoch however, each of the classes will be under-sampled or oversampled based on the weight wi , total weight on one epoch willPbe normalized so that the number of total samples will always be equal to N : i wi oi ni . We will now turn to how to choose the oi and wi factors. With the m = 10 for examples, all the classes will initially applied the oversampling factor oi . The oversampling ensures a minimal number and diversity via data augmentation. – oi = 1 for ni > m (no oversampling beyond median). – oi = (1 + m/ni )/2 for ni ≤ m (oversampling for linear importance between m/2 and m). Oversampling reduces the imbalance from about 1000:1 to about 100:1. Weighting further ensures a better balancing using a power law. wi = (oi ni )γ−1 . – With γ = 1.0, no weighting, original case (except the oversampling effect). – With γ = 0.5, weighting further reduces the imbalance from about 100:1 to about 10:1. – With γ = 0.25, weighting further reduces the imbalance from about 100:1 to about 3:1. – With γ = 0.15, weighting further reduces the imbalance from about 100:1 to about 2:1. In all cases, re-normalize (divide each wi by the same value so that Σi (wi oi ni ) = N ). 4.3 Experiment Results on the PlantCLEF 2017 All the candidate models have been trained on the PlantCLEF 2017 for prelimi- nary testing before being used on the PlantCLEF 2019. The models are trained on the EOL set and tested on Web dataset with the data augmentation strategies mentioned in the subsection 4.1. The result is shown in the table 1. It is can be easily seen that the Non-local addition added an increase of accuracy in both the DenseNet-121 and DenseNet-201 and the DenseNet slightly out performs the ResNet. Table 1. Evaluation of trained models on PlantCLEF 2017 Web. Model Top 1 Accuracy ResNet-18 0.5111 ResNet-152 0.7888 ShuffleNet 0.7222 DenseNet-121 0.8126 Non-local DenseNet-121 0.8618 DenseNet-201 0.8515 Non-local DenseNet-201 0.8744 4.4 Experiment results on PlantCLEF 2019 Initial result The model are further tested on the PlantCLEF 2019 dataset. The initial result is shown in Table 2. Thus, we can easily observe a drastic Table 2. Model Performance on PlantCLEF 2019 Validation Set. Model Top 1 Accuracy DenseNet-121 0.2510 DenseNet-201 0.3503 Non-local DenseNet-201 0.4525 performance drop. The further inspection of the dataset shows some challenging properties: 1. The classes are imbalanced 2. Repeated samples across the classes makes the learning harder. 3. Noisy Samples Experiment Results on the Class-Filtered PlantCLEF 2019 We first test the effects of following strategies: – Temporary removing all the classes with less than 5 samples – Further remove noisy/incorrect formatted images. The result is a 8500-classes dataset with still over 400,000 samples. The evalua- tion of the model is shown is Table 3. The result does not show much differences. Experiment Results on the Repetition-Filtered PlantCLEF 2019 Fur- ther experiments are performed on the dataset with different thresholds for rep- etition, the following training/validation split strategy is applied: for each class, at least dnsamples /5e is taken as part of the validation set, if the class has only one samples, the training set for that class would be empty. Here, the mini- mum threshold sampling is applied. The evaluation result is shown in table 4. Table 3. Model evaluation from small-class-filtered dataset. Model Top 1 Accuracy Additional Condition DenseNet-121 0.3020 None DenseNet-201 0.4220 None DenseNet-201 0.4890 Balanced Sampling Non-local DenseNet-201 0.5215 Oversampling data-deficient classes Table 4. Filtering out inter-class repeated samples makes training and validating set different. Max repetitions Number of empty classes Top 1 Training Top 1 Testing 1 1539 0.8425 0.1925 2 1040 0.6530 0.1512 3 778 0.6930 0.1451 It can easily be seen that removing the all repetitions from duplication creates empty classes, which would heavily differentiates the training and validating set, making it hard to validate the model. Experiment Results on conditional repetition filtered PlantCLEF 2019 On the final try of filtering the dataset involves filtering out all repeated samples unless it creates empty class. The statistics of the resulting dataset is stated in Table 5. Table 5. Conditional Repetition Filtered PlantCLEF 2019. Attributes of Dataset Attribute Value Number of classes 10,000 Number of samples 279,832 Mean number of samples 27.98 Minimum number of samples per class 1 Median number of samples 5 Max number of samples per class 1202 Number of unique samples 278,906 Number of samples duplicated 158 With all the repeated samples trimmed, the distribution is still pretty im- balanced, Figure 5 shows the distributions with Smoothed Resampling strategy. Since this is the final try, the whole dataset has to be used for training, for this, other external datasets has to be used for testing. More inspections on the PlantCLEF 2017 dataset reveals that there are 551 common categories betweens the PlantCLEF 2017 and PlantCLEF 2019 dataset. The samples are sorted by sizes and filtered to avoid having them in the training set. The statistic of the dataset is shown in table 6. (a) γ = 0.5 (b) γ = 0.25 Fig. 5. Effect of different γ. Table 6. Validation Dataset Statistics. Dataset PlantCLEF 2017 EOL Common PlantCLEF 2017 Web Common Number of classes 551 551 Number of samples 10,803 63,242 The final obtained results before submission testing on these dataset are described in the table 7: Training Set γ No. instances Pooling 2017 EOL 2017 Web 2017 EOL + Web Conditional 0.25 4 Mean 0.9171 0.6635 0.6983 Filtered Max 0.9169 0.6637 0.6984 PlantCLEF 0.5 4 Mean 0.9455 0.6970 0.7311 2019 Max 0.9413 0.6941 0.7280 1 2 Mean 0.9138 0.6476 0.6842 Max 0.9011 0.6338 0.6705 Mixed 10 Mean 0.9478 0.6957 0.7303 Max 0.9371 0.6812 0.7163 All Data No 1 No 0.7852 0.5497 0.5821 Table 7. Non-local DenseNet 201 Evaluation on PlantCLEF 2017 Common Cate- gories. We can see that with the same model, trained on the same number of epochs, the filtering strategies shows the differences: The ensemble of 4 model trained with γ = 0.5 gives of the best performance, the model which trained with all data from PlantCLEF 2019 is also evaluated and compared. Final Test Results The final results are given by the top 1 accuracy on the test dataset and the hand-picked subset by experts. The detail of each run is given in table 8. The best accuracy of top 1 on the expert-chosen samples set is achieved with the mean of 4 instances trained with γ = 0.25 with 2 means pooling, and best accuracy of top 1 on all samples is chosen with γ = 0.5 and two max pooling. Run γ No. Pooling Pooling Top 1 Top 1 Top 3 instances 1 2 (expert) (all) (expert) 1 0.25 4 Mean Mean 0.043 0.042 0.051 2 0.5 4 Mean Max 0.017 0.036 0.043 3 0.25 4 Max Mean 0.017 0.030 0.060 4 0.25 4 Max Max 0.009 0.027 0.060 5 0.5 4 Mean Mean 0.017 0.036 0.043 6 0.25 4 Mean Max 0.017 0.028 0.051 7 0.5 4 Max Mean 0.026 0.042 0.085 8 0.5 4 Max Max 0.034 0.046 0.068 9 1 2 Mean Max 0.017 0.031 0.043 10 Mixed 10 Mean Max 0.026 0.034 0.068 Table 8. Final Run evaluation. 5 Conclusion Plant Identification is an important step in medical, agricultural and environ- ment resource planning. However, the problem is currently still a challenging to both human and computer vision-based technologies even with the development of deep learning. With data-deficient challenge, the problem is even harder to conquer. The work aims to provide a decent-performing model proven with ex- tensive experiments along with a variety of data-handling strategies, yet it still cannot solve the whole problem. The remaining problems are avoiding of bias between classes belonging to the same genus, this perhaps can be performed by adding hierarchical classification where the system first identifies the genus and then the species. The data-deficient challenge still need to be tackled, ei- ther by leveraging unsupervised or semi-supervised learning methods. On the model designing perspective, the authors believe that the model can potentially be improved by adding inter-channel correlations in the non-local block. References 1. Alexis Joly, Herv Goau, C.B.S.K.M.S.H.G.P.B.W.P.V.R.P.F.R.S.H.M.: Overview of lifeclef 2019: Identification of amazonian plants, south & north american birds, and niche prediction. In: Proceedings of CLEF 2019 (2019) 2. Goëau, H., Bonnet, P., Joly, A.: Plant identification based on noisy web data: The amazing performance of deep learning (LifeCLEF 2017). CEUR Workshop Proceedings 1866(LifeCLEF) (2017) Fig. 6. Top 1 On Test Set, the best automatic solution performed at 0.316, while best experts have the accuracy of 0.675. Despite obtaining the accuracy of only 4.3%, the team is in top 3, the top performing model on the top 1 accuracy is the first run. Fig. 7. Top 3 On Test Set. Fig. 8. Top 5 On Test Set. 3. Goëau, H., Bonnet, P., Joly, A.: Overview of ExpertLifeCLEF 2018: How far au- tomated identification systems are from the best experts? CEUR Workshop Pro- ceedings 2125 (2018) 4. Goëau, H., Bonnet, P., Joly, A.: Overview of lifeclef plant identification task 2019: diving into data deficient tropical countries. In: CLEF working notes 2019 (2019) 5. He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778. IEEE (jun 2016). https://doi.org/10.1109/CVPR.2016.90, http:// ieeexplore.ieee.org/document/7780459/ 6. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9908 LNCS, 630–645 (2016). https://doi.org/10.1007/978-3-319-46493-0 38 7. Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory. Neural Computation 9(8), 1735–1780 (nov 1997). https://doi.org/10.1162/neco.1997.9.8.1735, http:// www.mitpressjournals.org/doi/10.1162/neco.1997.9.8.1735 8. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings - 30th IEEE Confer- ence on Computer Vision and Pattern Recognition, CVPR 2017 (2017). https://doi.org/10.1109/CVPR.2017.243 9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet Classification with Deep Convolutional Neural Network. Proceedings of the 25th Interna- tional Conference on Neural Information Processing Systems 1, 1097—-1105 (2012). https://doi.org/10.1061/(ASCE)GT.1943-5606.0001284, http://dl.acm. org/citation.cfm?id=2999134.2999257 10. Lasseck, M.: Image-based plant species identification with deep Convolutional Neu- ral Networks. CEUR Workshop Proceedings 1866 (2017) 11. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.: Inception-v4, Inception- ResNet and the Impact of Residual Connections on Learning (2016). https://doi.org/10.1016/j.patrec.2014.01.008, http://arxiv.org/abs/1602. 07261 12. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going Deeper with Convolutions pp. 1–12 (sep 2014), http://arxiv.org/abs/1409.4842 13. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the Incep- tion Architecture for Computer Vision (dec 2015), http://arxiv.org/abs/1512. 00567 14. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local Neural Networks. Tech. rep. (2018)