=Paper=
{{Paper
|id=Vol-2696/paper_141
|storemode=property
|title=Domain Adaptation in the Context of Herbarium Collections: A submission to PlantCLEF 2020
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_141.pdf
|volume=Vol-2696
|authors=Juan Villacis,Hervé Goëau,Pierre Bonnet,Alexis Joly,Erick Mata-Montero
|dblpUrl=https://dblp.org/rec/conf/clef/Villacis-Llobet20
}}
==Domain Adaptation in the Context of Herbarium Collections: A submission to PlantCLEF 2020==
Domain Adaptation in the context of herbarium collections A submission to PlantCLEF 2020 Juan Villacis1 , Hervé Goëau2,3 , Pierre Bonnet2,3 , Alexis Joly4 , and Erick Mata-Montero1 1 Costa Rica Institute of Technology, Cartago, Costa Rica jvillacis@ic-itcr.ac.cr, emata@itcr.ac.cr 2 CIRAD, UMR AMAP, France, herve.goeau@cirad.fr, pierre.bonnet@cirad.fr 3 AMAP, Univ Montpellier, CIRAD, CNRS, INRAE, IRD, Montpellier, France 4 INRIA, Zenith Team, UMR LIRMM, Montpellier, France,alexis.joly@inria.fr Abstract. This paper describes a submission to the PlantCLEF 2020 challenge, whose topic was the classification of plant images in the field, based on a dataset composed mainly of herbaria.. This work proposes the usage of domain adaptation techniques to tackle the problem. In particular, it makes use of the Few-Shot Adversarial Domain Adapta- tion method proposed by Motiian et al. (9). Additionally, a modification of this architecture is proposed to take advantage of upper taxa relations between species in the dataset. Experiments performed show that do- main adaptation can provide very significant increases in accuracy when compared with traditional CNN-based approaches. 1 Introduction Recent approaches to automated plant identification have relied on deep learning- based techniques (1). These techniques can be very effective and compete with human experts if a large amount of labeled data is available, even if it is par- tially noisy (2; 6). However, in the path towards achieving the goal of universal plant species identification, a significant obstacle is posed by the large number of species for which there are none or very few samples of their appearance in their natural state, thus making it very difficult to use this kind of methods. Carrying out missions to collect more data, typically in tropical regions, is not a feasible solution due to the elevated cost, difficulty to access the areas where the species are located and vast amount of data still not labeled. Nonetheless, vast amounts of data about these species exist in the form of herbarium sheets, collected over centuries by botanists and which has been recently massively dig- itized and published online. This is the topic of the PlantCLEF 2020 challenge Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece. 5 . Given a large dataset of digitized herbarium sheets and very few photos in the field, the objective is to develop a classifier that can perform well on a test set consisting only of field photos after being trained primarily on herbariums. This article describes in detail the methods used for our submissions to the challenge (identified by the acronym aabab on the challenge web page6 ). (b) Testing stage (a) Training stage Fig. 1: Description of the problem 2 Methodology 2.1 Data Dataset The main dataset to be used is PlantCLEF 2020 (3), (8). This dataset has 320,752 herbarium images from 997 species, 4,482 field images from 375 species and 1,816 images from 244 species where for each specimen there are images both in its natural state and in herbarium sheets. In addition to this dataset, some experiments will include additional data from sources like GBIF7 and PlantCLEF 2019 (7). These images come from the dataset used by (11). Figure 2 shows examples of images in the PlantCLEF 2020 dataset. As can be observed in these examples, the herbarium and field pictures differ greatly, even if they come from the same species (and even the same specimen). This aspect makes the task at hand a particularly challenging one. 2.2 Architecture and Models Convolutional Neural Networks A common way to perform classification is to take a pretrained CNN and to re-train it on the new target classes. As mentioned earlier, this approach usually requires vast amounts of data. Given 5 https://www.imageclef.org/PlantCLEF2020 6 https://www.aicrowd.com/challenges/lifeclef-2020-plant/submissions 7 https://www.gbif.org Fig. 2: Herbarium and Field images from the Asystasia gangetica (L.) T.Anderson specie. Significant visual variations between them in color, back- ground and shape can be observed even though they are from the same species that the training set is comprised mostly of herbariums and the test set of field photos, we expect the performance of such an approach to be low. This reason motivates us to look for alternate solutions (which will be described in the next sections), but it is still necessary to measure the performance of the baseline CNN approach. Therefore, the submitted runs also include experiments with a CNN-based approach. The architecture chosen is Resnet50 (4) to maintain the same conditions as those used in the other experiments. Experiments will be per- formed using the PlantCLEF2020 dataset solely, or the union of all the datasets described in section 2.1. To take advantage of the data available, training will be performed in three stages. First the model will be trained on the ImageNet dataset, in a second stage only the herbariums will be used and in the last stage only the photos will be used. Domain Adaptation Architectures To tackle the problem of having very few photos in the field domain we base our solution on the architecture presented in (9). This architecture which was devised to tackle the problem of few-shot domain adaptation has the following elements (see Figure 3): – a CNN-based feature extractor E that maps from the source dataset (herbaria) and target dataset (field images) into a common space, in which it is expected for the features represented to be independent from the original domain. – a classifier F that performs species classification on the common space – a discriminator D that determines to which of the following categories a pair of samples from the common space belongs to 1. Samples from different domains and different classes 2. Samples from different domains but the same class 3. Samples from the same domain but different classes 4. Samples from the same domain and the same class The division into these four categories instead of just into two categories determined by the domain is done to take advantage of label information in the target domain (9). The feature extractor and the classifier are trained in an adversarial approach with the discriminator in order to guarantee a domain agnostic common space and a robust classifier. In addition to this strategy, data augmentation is done in the target domain in order to complete the feature space with more training samples from this domain. Fig. 3: Approach by (9) to tackle the problem of few-shot domain adaptation. It has an encoder E, classifier F and discriminator D The training is completed in 3 stages. During the first stage the encoder E and the classifier F are trained in a standard way with samples from only the source domain. In the second stage the discriminator D is trained to distinguish between samples from the 4 categories mentioned before. The objective of the first two stages is to initialize the weights of E F and D. Finally, during the third stage they are all trained together with the objective of performing domain adaptation. It can be said that domain adaptation has been achieved once the discriminator is not able to distinguish samples from categories 1 and 2 and categories 3 and 4. This means that once the samples have been encoded into the common domain, it is difficult for the discriminator to tell which was the original domain of such sample. The architecture used has a ResNet50 (4) based encoder, which provides a good compromise between performance, memory use and training time. This is done by removing the last fully connected layer from the ResNet50 architecture. After applying these changes, the dimensionality of the common domain becomes 2048 features. This decision affects the architecture of the classifier and the discriminator. The first one is composed of a single fully-connected layer with 2048 inputs and 997 outputs. The discriminator is a multilayer perceptron, the input is composed of two feature vectors of 2048 features stacked together and it has 6 fully-connected layers that reduce the input size from 4096 features to just 4 outputs. Figure 4 portrays these components. Fig. 4: Details of the FADA architecture used in the experiments 2.3 Additions to the main FADA architecture Data Augmentation To obtain better results, data augmentation is used to increase the performance of the model. The traditional data augmentation op- erations used are: random rotations of ± 15 degrees, color jittering and random horizontal flips. Additionally, special transformations are added for each domain. In the herbariums, a special tilling around the center is added. This operation creates a crop of the original herbarium that is centered around the center and can has a zoom level randomly set between 0.9 and 1.3. This is done because in herbariums the plant samples are commonly placed around the center of the sheet. Examples of this crops can be observed in figure 5. In the field domain, the transformation used is a center crop as large as the original picture permits it to be. Self Supervision Self supervision is a technique derived from unsupervised learning that tries to address situations found in supervised learning in which there might not be enough labeled data in order to train an efficient model. The objective of these tasks is to extract robust visual information from the pictures which can be useful either as initial weights or to help the main model during training. In our experiments self supervision is used following the ideas presented in (12), where it is used in a multi-task learning approach to help the main classifier. Figure 6 depicts how it is performed in the context of the FADA architecture. Self-supervision is only used during the third stage in the training of the encoder and the classifier. The self-supervision task is applied to the image after it has undergone the data augmentation process. This modification is also used in the traditional CNN approach. In this case, the main model is joined by an additional classifier in a multi-task learning approach. The new classifier tries to predict the correct self-supervision label for the data. In this case, the extra classifier loss is combined with the loss from the main classifier From the several self supervision tasks in existence (5), we used the jigsaw puzzle solving (10). This decision was taken based on the findings of (12) that when incorporating self-supervision into domain adaptation it is important to choose tasks that do not reinforce domain-dependent features and the fact that the spatial information learned from this showed to be useful when compared to other tasks like recolorization. This task consists in dividing the original image into tiles, rearranging them randomly into one of the 64 possible orderings with the largest distance between them and then having the network try to determine which of the rearrangements is used. Figure 6 shows this process. Upper taxa Given the nature of the dataset, it is possible to obtain taxo- nomical information from each species, like the genus or family name (= upper taxa). Because of the lack of data, we try to incorporate this information to the architecture on a multi-task learning approach, so that the features from the common domain are not only used to predict the species name, but also the genus or family of the specimen. This is done in two different approaches. For Fig. 5: Special tilling used for herbariums to take maximum advantage of the information they have the FADA architecture, both the classifier and the discriminator are extended with two additional sub-tasks, one for the genus level and one for the family level. These components have the same function as the original species classifier and discriminator, but they have to perform the discrimination and classifica- tion tasks with the genus or family instead. It is expected that specimens from the same group share partially similar visual content, and as such this training taking into account upper taxa can be indirectly used to increase performance on specimens that are poorly represented in the dataset but which might have related species in the dataset. In the flowers in figure 7 it is possible to observe the visual similarities between plants from a different specie and the same genus. This is the kind of information we hope to take advantage of. Fig. 6: Addition of jigsaw self-supervision complementary task (10) to the FADA architecture. Fig. 7: Multi-task learning with upper taxons 2.4 Training procedure details Several hyperparameters had to be tuned in order to obtain the best results possible. These are detailed in table 1 Hyperparameter Value Framework Pytorch Learning rate E and F in FADA 0.001 Learning rate D in FADA 0.001 Learning rate in CNN 0.03 Batch size in FADA 15 Batch size in CNN 64 Number of epochs, stage 1 FADA 60 Number of epochs, stage 2 FADA 60 Number of epochs, stage 3 FADA 30 Number of epochs CNN 60 Learning rate scheduler criterion Step Learning rate scheduler gamma 0.1 Learning rate scheduler step in 15 FADA stage 3 Learning rate scheduler step in 50 CNN and FADA stage 1 Table 1: Hyperparameters used in the training 3 Results Results from the runs submitted to the challenge can be seen in table 2 and figure 8. The metric used to present the results of the challenge is the Mean Reciprocal Rank (MRR), which measures the average rank of the correct answer in a series of predictions. It is described by the following formula |Q| 1 X 1 M RR = |Q| i=1 ranki Two distinct MRRs are computed. A first MRR is computed on the full test set. Then, a second one is computed on a subset of the whole test set whose classes have particularly few field (or none) images in the training set. As can be expected from the inherent difficulty of the challenge, the over- all results obtained are low compared to previous editions of PlantCLEF. The method described here obtains the best result on the whole test set, and the second best result on the difficult subset of the test set. In the runs submitted to the challenge, domain adaptation had a very signif- icant impact on the results. Between the runs with a CNN and those that used this technique there is a 2600% and 1850% increase in the MRR All and the MRR Few. These results can be observed in figure 9. Additional training data also seems to be a significant factor in obtaining higher values for the evaluation metric. In the MRR All there is a 5500% and a 165% increase in the values when comparing the results from a CNN and FADA approaches with the same techniques but adding the complementary dataset into the training process. This high increase can be observed in figure 10 The last improvement that this work highlights is the benefit of the proposed extensions of FADA. The usage of self supervision and upper taxa information actually leads to slight but consistent increases of performance. Among this, the most useful to improve the MRR All turns out to be the combination of self supervision and upper taxa, with a 12% increase in the metric. Looking at the MRR on the difficult species, the use of upper taxa alone leads to an increase of 59% on the obtained value. This results can be observed in figure 11 and shows that the visual similarities between species of the same genus or family are particularly useful for species with very few training samples. Fig. 8: Results of the PlantCLEF 2020 Challenge, the submissions by the ITCR Pl@ntNet team (AIcrowd username aabab) are highlighted 4 Conclusions and Future Work The main conclusions from the experiments are the following – Domain Adaptation can have a very significant impact in obtaining better results in scenarios where there is a very limited availability of data in one domain but a large dataset on the other – The addition of extra data showed to be a very significant factor in achieving higher MRRs on the complete test set. On the subset of the most difficult species, however, this conclusion is not as clear-cut. The additional data provided an consistent gain when using the CNN approach, but on the other hand, the gain when using FADA was very small. This is expected to occur due to the fact that FADA is very sensitive to data (even one additional Fig. 9: Impact of domain adaptation on the results Fig. 10: Impact of extra data on the results picture in the target domain showed to have a significant effect in modifying the results (9)) and the noisiness in the extra dataset. – The main modification performed to the FADA architecture, i.e. the intro- duction of multi-task learning, proved to be important in obtaining better results on both metrics. Extending the classifier and discriminator to extra tasks at upper taxa level was successful for boosting results, in particular on the few shot classes. Run MRR MRR Diffi- cult Species 1. Resnet50 trained on PlantCLEF20 0,002 0,002 2. Resnet50 trained on PlantCLEF20 + extra datasets 0,112 0,013 3. FADA trained on PlantCLEF20 0,054 0,039 4. FADA trained on PlantCLEF20 + extra datasets 0,143 0,036 5. FADA trained on PlantCLEF20 + extra datasets 0,148 0,039 with Self Supervision 6. FADA trained on PlantCLEF20 + extra datasets 0,161 0,037 with MTL from genus and family 7. FADA trained on PlantCLEF20 + extra datasets 0,134 0,062 with Self Supervision and MTL from genus and family 8. Ensemble or runs 6 and 7 0,167 0,06 9. Ensemble or runs 5 and 6 0,17 0,039 10. Ensemble or runs 4, 5, 6 and 7 0,18 0,052 Table 2: Results from the runs submitted to the challenge Fig. 11: Impact of performance improving techniques on the results – As well as upper taxa information, self supervision was an important factor in obtaining increases in performance. Although the increases were smaller, they were nonetheless consistent in particular when combined with multi- task at upper taxa and on the few shot classes. As future work, some other paths that can be explored are – Test different botanical or morphological information in addition to tax- onomy. For instance, whether a species is Woody/Non-Woody may be an additional task to be solved. The usage of this information is expected to help even more general features that can boost even more the results. – Exploit additional metadata contained in the dataset like geolocation infor- mation or individual pairs. This might require modifications to the architec- tures used. Bibliography [1] Carranza-Rojas, J., Goëau, H., Bonnet, P., Mata-Montero, E., Joly, A.: Going deeper in the automated identification of herbarium specimens. BMC evolutionary biology 17(1), 181 (2017) [2] Goëau, H., Bonnet, P., Joly, A.: Plant identification based on noisy web data: the amazing performance of deep learning (lifeclef 2017). In: CLEF task overview 2017, CLEF: Conference and Labs of the Evaluation Forum, Sep. 2017, Dublin, Ireland. (2017) [3] Goëau, H., Bonnet, P., Joly, A.: Overview of the lifeclef 2020 plant identi- fication task. In: CLEF task overview, CLEF: Conference and Labs of the Evaluation Forum, Sep. 2020, Thessaloniki, Greece. (2020) [4] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recog- nition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016) [5] Jing, L., Tian, Y.: Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) [6] Joly, A., Goëau, H., Botella, C., Glotin, H., Bonnet, P., Vellinga, W.P., Planqué, R., Müller, H.: Overview of lifeclef 2018: a large-scale evaluation of species identification and recommendation algorithms in the era of ai. In: International Conference of the Cross-Language Evaluation Forum for European Languages. pp. 247–266. Springer (2018) [7] Joly, A., Goëau, H., Botella, C., Kahl, S., Servajean, M., Glotin, H., Bonnet, P., Planqué, R., Robert-Stöter, F., Vellinga, W.P., et al.: Overview of lifeclef 2019: Identification of amazonian plants, south & north american birds, and niche prediction. In: International Conference of the Cross-Language Evaluation Forum for European Languages. pp. 387–401. Springer (2019) [8] Joly, A., Goëau, H., Kahl, S., Deneu, B., Servajean, M., Cole, E., Picek, L., Ruiz De Castañeda, R., é, Lorieul, T., Botella, C., Glotin, H., Champ, J., Vellinga, W.P., Stöter, F.R., Dorso, A., Bonnet, P., Eggel, I., Müller, H.: Overview of lifeclef 2020: a system-oriented evaluation of automated species identification and species distribution prediction. In: Proceedings of CLEF 2020, CLEF: Conference and Labs of the Evaluation Forum, Sep. 2020, Thessaloniki, Greece. (2020) [9] Motiian, S., Jones, Q., Iranmanesh, S., Doretto, G.: Few-shot adversarial do- main adaptation. In: Advances in Neural Information Processing Systems. pp. 6670–6680 (2017) [10] Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: European Conference on Computer Vision. pp. 69–84. Springer (2016) [11] Picek, L., Sulc, M., Matas, J.: Recognition of the amazonian flora by in- ception networks with test-time class prior estimation. In: CLEF working notes 2019, CLEF: Conference and Labs of the Evaluation Forum, Sep. 2019, Lugano, Switzerland. (2019) [12] Sun, Y., Tzeng, E., Darrell, T., Efros, A.A.: Unsupervised domain adapta- tion through self-supervision. arXiv preprint arXiv:1909.11825 (2019)