-

Domain Adaptation in the context of herbarium collections

Juan Villacis

jvillacis@ic-itcr.ac.cr 2

Herve Goeau

herve.goeau@cirad.fr 0 1

Pierre Bonnet

pierre.bonnet@cirad.fr 0 1

Alexis Joly

alexis.joly@inria.fr 3

Erick Mata-Montero

2 0 AMAP, Univ Montpellier , CIRAD, CNRS, INRAE, IRD, Montpellier , France 1 CIRAD, UMR AMAP , France 2 Costa Rica Institute of Technology , Cartago , Costa Rica 3 INRIA, Zenith Team, UMR LIRMM , Montpellier

This paper describes a submission to the PlantCLEF 2020 challenge, whose topic was the classi cation of plant images in the eld, based on a dataset composed mainly of herbaria.. This work proposes the usage of domain adaptation techniques to tackle the problem. In particular, it makes use of the Few-Shot Adversarial Domain Adaptation method proposed by Motiian et al. (9). Additionally, a modi cation of this architecture is proposed to take advantage of upper taxa relations between species in the dataset. Experiments performed show that domain adaptation can provide very signi cant increases in accuracy when compared with traditional CNN-based approaches.

Recent approaches to automated plant identi cation have relied on deep learningbased techniques ( 1 ). These techniques can be very e ective and compete with human experts if a large amount of labeled data is available, even if it is partially noisy (2; 6). However, in the path towards achieving the goal of universal plant species identi cation, a signi cant obstacle is posed by the large number of species for which there are none or very few samples of their appearance in their natural state, thus making it very di cult to use this kind of methods. Carrying out missions to collect more data, typically in tropical regions, is not a feasible solution due to the elevated cost, di culty to access the areas where the species are located and vast amount of data still not labeled. Nonetheless, vast amounts of data about these species exist in the form of herbarium sheets, collected over centuries by botanists and which has been recently massively digitized and published online. This is the topic of the PlantCLEF 2020 challenge 5. Given a large dataset of digitized herbarium sheets and very few photos in the eld, the objective is to develop a classi er that can perform well on a test set consisting only of eld photos after being trained primarily on herbariums. This article describes in detail the methods used for our submissions to the challenge (identi ed by the acronym aabab on the challenge web page6). (b) Testing stage (a) Training stage

Methodology

Data

Dataset The main dataset to be used is PlantCLEF 2020 ( 3 ), ( 8 ). This dataset has 320,752 herbarium images from 997 species, 4,482 eld images from 375 species and 1,816 images from 244 species where for each specimen there are images both in its natural state and in herbarium sheets. In addition to this dataset, some experiments will include additional data from sources like GBIF7 and PlantCLEF 2019 ( 7 ). These images come from the dataset used by ( 11 ). Figure 2 shows examples of images in the PlantCLEF 2020 dataset. As can be observed in these examples, the herbarium and eld pictures di er greatly, even if they come from the same species (and even the same specimen). This aspect makes the task at hand a particularly challenging one. 2.2

Architecture and Models

Convolutional Neural Networks A common way to perform classi cation is to take a pretrained CNN and to re-train it on the new target classes. As mentioned earlier, this approach usually requires vast amounts of data. Given 5 https://www.imageclef.org/PlantCLEF2020 6 https://www.aicrowd.com/challenges/lifeclef-2020-plant/submissions 7 https://www.gbif.org that the training set is comprised mostly of herbariums and the test set of eld photos, we expect the performance of such an approach to be low. This reason motivates us to look for alternate solutions (which will be described in the next sections), but it is still necessary to measure the performance of the baseline CNN approach. Therefore, the submitted runs also include experiments with a CNN-based approach. The architecture chosen is Resnet50 ( 4 ) to maintain the same conditions as those used in the other experiments. Experiments will be performed using the PlantCLEF2020 dataset solely, or the union of all the datasets described in section 2.1. To take advantage of the data available, training will be performed in three stages. First the model will be trained on the ImageNet dataset, in a second stage only the herbariums will be used and in the last stage only the photos will be used.

Domain Adaptation Architectures To tackle the problem of having very few photos in the eld domain we base our solution on the architecture presented in ( 9 ). This architecture which was devised to tackle the problem of few-shot domain adaptation has the following elements (see Figure 3): { a CNN-based feature extractor E that maps from the source dataset (herbaria) and target dataset ( eld images) into a common space, in which it is expected for the features represented to be independent from the original domain. { a classi er F that performs species classi cation on the common space { a discriminator D that determines to which of the following categories a pair of samples from the common space belongs to 1. Samples from di erent domains and di erent classes 2. Samples from di erent domains but the same class 3. Samples from the same domain but di erent classes 4. Samples from the same domain and the same class The division into these four categories instead of just into two categories determined by the domain is done to take advantage of label information in the target domain ( 9 ).

The feature extractor and the classi er are trained in an adversarial approach with the discriminator in order to guarantee a domain agnostic common space and a robust classi er. In addition to this strategy, data augmentation is done in the target domain in order to complete the feature space with more training samples from this domain.

The training is completed in 3 stages. During the rst stage the encoder E and the classi er F are trained in a standard way with samples from only the source domain. In the second stage the discriminator D is trained to distinguish between samples from the 4 categories mentioned before. The objective of the rst two stages is to initialize the weights of E F and D. Finally, during the third stage they are all trained together with the objective of performing domain adaptation. It can be said that domain adaptation has been achieved once the discriminator is not able to distinguish samples from categories 1 and 2 and categories 3 and 4. This means that once the samples have been encoded into the common domain, it is di cult for the discriminator to tell which was the original domain of such sample.

The architecture used has a ResNet50 ( 4 ) based encoder, which provides a good compromise between performance, memory use and training time. This is done by removing the last fully connected layer from the ResNet50 architecture. After applying these changes, the dimensionality of the common domain becomes 2048 features. This decision a ects the architecture of the classi er and the discriminator. The rst one is composed of a single fully-connected layer with 2048 inputs and 997 outputs. The discriminator is a multilayer perceptron, the input is composed of two feature vectors of 2048 features stacked together and it has 6 fully-connected layers that reduce the input size from 4096 features to just 4 outputs. Figure 4 portrays these components.

Fig. 4: Details of the FADA architecture used in the experiments 2.3

Additions to the main FADA architecture

Data Augmentation To obtain better results, data augmentation is used to increase the performance of the model. The traditional data augmentation operations used are: random rotations of 15 degrees, color jittering and random horizontal ips. Additionally, special transformations are added for each domain. In the herbariums, a special tilling around the center is added. This operation creates a crop of the original herbarium that is centered around the center and can has a zoom level randomly set between 0.9 and 1.3. This is done because in herbariums the plant samples are commonly placed around the center of the sheet. Examples of this crops can be observed in gure 5. In the eld domain, the transformation used is a center crop as large as the original picture permits it to be.

Self Supervision Self supervision is a technique derived from unsupervised learning that tries to address situations found in supervised learning in which there might not be enough labeled data in order to train an e cient model. The objective of these tasks is to extract robust visual information from the pictures which can be useful either as initial weights or to help the main model during training. In our experiments self supervision is used following the ideas presented in (12), where it is used in a multi-task learning approach to help the main classi er. Figure 6 depicts how it is performed in the context of the FADA architecture. Self-supervision is only used during the third stage in the training of the encoder and the classi er. The self-supervision task is applied to the image after it has undergone the data augmentation process. This modi cation is also used in the traditional CNN approach. In this case, the main model is joined by an additional classi er in a multi-task learning approach. The new classi er tries to predict the correct self-supervision label for the data. In this case, the extra classi er loss is combined with the loss from the main classi er

From the several self supervision tasks in existence ( 5 ), we used the jigsaw puzzle solving ( 10 ). This decision was taken based on the ndings of (12) that when incorporating self-supervision into domain adaptation it is important to choose tasks that do not reinforce domain-dependent features and the fact that the spatial information learned from this showed to be useful when compared to other tasks like recolorization. This task consists in dividing the original image into tiles, rearranging them randomly into one of the 64 possible orderings with the largest distance between them and then having the network try to determine which of the rearrangements is used. Figure 6 shows this process. Upper taxa Given the nature of the dataset, it is possible to obtain taxonomical information from each species, like the genus or family name (= upper taxa). Because of the lack of data, we try to incorporate this information to the architecture on a multi-task learning approach, so that the features from the common domain are not only used to predict the species name, but also the genus or family of the specimen. This is done in two di erent approaches. For the FADA architecture, both the classi er and the discriminator are extended with two additional sub-tasks, one for the genus level and one for the family level. These components have the same function as the original species classi er and discriminator, but they have to perform the discrimination and classi cation tasks with the genus or family instead. It is expected that specimens from the same group share partially similar visual content, and as such this training taking into account upper taxa can be indirectly used to increase performance on specimens that are poorly represented in the dataset but which might have related species in the dataset. In the owers in gure 7 it is possible to observe the visual similarities between plants from a di erent specie and the same genus. This is the kind of information we hope to take advantage of. Several hyperparameters had to be tuned in order to obtain the best results possible. These are detailed in table 1 Results from the runs submitted to the challenge can be seen in table 2 and gure 8.

The metric used to present the results of the challenge is the Mean Reciprocal Rank (MRR), which measures the average rank of the correct answer in a series of predictions. It is described by the following formula

M RR = 1 jQj

1 jQj i=1 ranki Two distinct MRRs are computed. A rst MRR is computed on the full test set. Then, a second one is computed on a subset of the whole test set whose classes have particularly few eld (or none) images in the training set.

As can be expected from the inherent di culty of the challenge, the overall results obtained are low compared to previous editions of PlantCLEF. The method described here obtains the best result on the whole test set, and the second best result on the di cult subset of the test set.

In the runs submitted to the challenge, domain adaptation had a very significant impact on the results. Between the runs with a CNN and those that used this technique there is a 2600% and 1850% increase in the MRR All and the MRR Few. These results can be observed in gure 9.

Additional training data also seems to be a signi cant factor in obtaining higher values for the evaluation metric. In the MRR All there is a 5500% and a 165% increase in the values when comparing the results from a CNN and FADA approaches with the same techniques but adding the complementary dataset into the training process. This high increase can be observed in gure 10

The last improvement that this work highlights is the bene t of the proposed extensions of FADA. The usage of self supervision and upper taxa information actually leads to slight but consistent increases of performance. Among this, the most useful to improve the MRR All turns out to be the combination of self supervision and upper taxa, with a 12% increase in the metric. Looking at the MRR on the di cult species, the use of upper taxa alone leads to an increase of 59% on the obtained value. This results can be observed in gure 11 and shows that the visual similarities between species of the same genus or family are particularly useful for species with very few training samples. { Domain Adaptation can have a very signi cant impact in obtaining better results in scenarios where there is a very limited availability of data in one domain but a large dataset on the other { The addition of extra data showed to be a very signi cant factor in achieving higher MRRs on the complete test set. On the subset of the most di cult species, however, this conclusion is not as clear-cut. The additional data provided an consistent gain when using the CNN approach, but on the other hand, the gain when using FADA was very small. This is expected to occur due to the fact that FADA is very sensitive to data (even one additional picture in the target domain showed to have a signi cant e ect in modifying the results ( 9 )) and the noisiness in the extra dataset. { The main modi cation performed to the FADA architecture, i.e. the introduction of multi-task learning, proved to be important in obtaining better results on both metrics. Extending the classi er and discriminator to extra tasks at upper taxa level was successful for boosting results, in particular on the few shot classes. 1. Resnet50 trained on PlantCLEF20 0,002 0,002 2. Resnet50 trained on PlantCLEF20 + extra datasets 0,112 0,013 3. FADA trained on PlantCLEF20 0,054 0,039 4. FADA trained on PlantCLEF20 + extra datasets 0,143 0,036 5. FADA trained on PlantCLEF20 + extra datasets 0,148 0,039 with Self Supervision 6. FADA trained on PlantCLEF20 + extra datasets 0,161 0,037 with MTL from genus and family 7. FADA trained on PlantCLEF20 + extra datasets 0,134 0,062 with Self Supervision and MTL from genus and family 8. Ensemble or runs 6 and 7 0,167 0,06 9. Ensemble or runs 5 and 6 0,17 0,039 10. Ensemble or runs 4, 5, 6 and 7 0,18 0,052 Table 2: Results from the runs submitted to the challenge { As well as upper taxa information, self supervision was an important factor in obtaining increases in performance. Although the increases were smaller, they were nonetheless consistent in particular when combined with multitask at upper taxa and on the few shot classes.

As future work, some other paths that can be explored are { Test di erent botanical or morphological information in addition to taxonomy. For instance, whether a species is Woody/Non-Woody may be an additional task to be solved. The usage of this information is expected to help even more general features that can boost even more the results. { Exploit additional metadata contained in the dataset like geolocation information or individual pairs. This might require modi cations to the architectures used. notes 2019, CLEF: Conference and Labs of the Evaluation Forum, Sep. 2019, Lugano, Switzerland. (2019) [12] Sun, Y., Tzeng, E., Darrell, T., Efros, A.A.: Unsupervised domain adaptation through self-supervision. arXiv preprint arXiv:1909.11825 (2019)

[1] Carranza-Rojas , J. , Goeau, H., Bonnet , P. , Mata-Montero , E. , Joly , A. : Going deeper in the automated identi cation of herbarium specimens . BMC evolutionary biology 17(1) , 181 ( 2017 )

[2] Goeau, H., Bonnet , P. , Joly , A. : Plant identi cation based on noisy web data: the amazing performance of deep learning (lifeclef 2017) . In: CLEF task overview 2017 , CLEF: Conference and Labs of the Evaluation Forum , Sep. 2017 , Dublin, Ireland. ( 2017 )

[3] Goeau, H., Bonnet , P. , Joly , A. : Overview of the lifeclef 2020 plant identication task . In: CLEF task overview , CLEF: Conference and Labs of the Evaluation Forum, Sep . 2020 , Thessaloniki, Greece. ( 2020 )

[4] He , K. , Zhang , X. , Ren , S. , Sun , J.: Deep residual learning for image recognition . In: Proceedings of the IEEE conference on computer vision and pattern recognition . pp. 770 { 778 ( 2016 )

[5] Jing , L. , Tian , Y. : Self-supervised visual feature learning with deep neural networks: A survey . IEEE Transactions on Pattern Analysis and Machine Intelligence ( 2020 )

[6] Joly , A. , Goeau, H., Botella , C. , Glotin , H. , Bonnet , P. , Vellinga , W.P. , Planque , R. , Muller, H.: Overview of lifeclef 2018: a large-scale evaluation of species identi cation and recommendation algorithms in the era of ai . In: International Conference of the Cross-Language Evaluation Forum for European Languages . pp. 247 { 266 . Springer ( 2018 )

[7] Joly , A. , Goeau, H., Botella , C. , Kahl , S. , Servajean , M. , Glotin , H. , Bonnet , P. , Planque , R. , Robert-Stoter, F. , Vellinga , W.P. , et al.: Overview of lifeclef 2019: Identi cation of amazonian plants, south & north american birds, and niche prediction . In: International Conference of the Cross-Language Evaluation Forum for European Languages . pp. 387 { 401 . Springer ( 2019 )

[8] Joly , A. , Goeau, H., Kahl , S. , Deneu , B. , Servajean , M. , Cole , E. , Picek , L. , Ruiz De Castan~eda, R., e, Lorieul, T. , Botella , C. , Glotin , H. , Champ , J. , Vellinga , W.P. , Stoter, F.R. , Dorso , A. , Bonnet , P. , Eggel , I. , Muller, H.: Overview of lifeclef 2020: a system-oriented evaluation of automated species identi cation and species distribution prediction . In: Proceedings of CLEF 2020 , CLEF: Conference and Labs of the Evaluation Forum , Sep. 2020 , Thessaloniki, Greece. ( 2020 )

[9] Motiian , S. , Jones , Q. , Iranmanesh , S. , Doretto , G.: Few-shot adversarial domain adaptation . In: Advances in Neural Information Processing Systems . pp. 6670 { 6680 ( 2017 )

[10] Noroozi , M. , Favaro , P. : Unsupervised learning of visual representations by solving jigsaw puzzles . In: European Conference on Computer Vision . pp. 69 { 84 . Springer ( 2016 )

[11] Picek , L. , Sulc , M. , Matas , J.: Recognition of the amazonian ora by inception networks with test-time class prior estimation . In: CLEF working