Contrastive Representation Learning for Natural World Imagery: Habitat prediction for 30,000 species Sachith Seneviratne1 1 Transport, Health and Urban Design Research Lab, Melbourne School of Design, The University of Melbourne, Parkville VIC 3010, Australia Abstract Recent work in contrastive representation learning has pushed the boundaries of classification tasks in computer vision, achieving state of the art results on many established benchmarks. However, their performance on natural imagery tasks which fall into the category of fine-grained image classification can be further improved. In this paper, I present a methodology that explores this issue and achieves state of the art results on species distribution modelling from remote sensing imagery as part of the GeoLifeCLEF2021 challenge. My method is able to beat the current state of the art on this challenge (trained on 4 types of imagery) using only base RGB imagery. Initial experiments indicate that modifying the architecture to include additional image modalities leads to further improvements in performance on the task of location-based species recommendation. Additionally, I introduce a consistency function, which relies on the strategy of withholding data from the model and is useful in checking for model generality without relying on a validation split. Keywords Fine Grained Visual Categorization, Representation Learning, Self Supervision, Transfer Learning, Do- main adaptation 1. Introduction Species Distribution Modelling (SDM) is the study of computational techniques to predict species distribution across both geographical locations and time using different forms of environmental data. Computer vision techniques have garnered attention in this area due to the ability to effectively incorporate contextual and geographic information to improve the modelling of species distribution[1]. Advances in this area have many implications in ecological analysis including the ability to more effectively engage with citizens regarding wildlife preservation and education[2]. Methods based in computer vision that allow large datasets of habitat imagery to be processed in order to generate a prediction of the most likely species inhabiting that area allow for significant theoretical and applied improvements in this area. However, the key challenges on this problem from a classification-based computer vision perspective are two fold: unbalanced data and having classes with only minute differences to distinguish one from another. Imagery in the environment can be broadly divided into two categories: built and natural CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " sachith.seneviratne@unimelb.edu.au (S. Seneviratne)  /0000-0001-9094-2736 (S. Seneviratne) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) world. Remote sensing datasets will generally contain imagery pertaining to both these types. Many challenging tasks in computer vision arise in the natural world imagery domain[3]. Such tasks usually fall under the domain of "fine-grained visual categorization" - an active area of research in computer vision. Imagery based classification problems with a fine distinction between classes can be difficult for computer vision techniques to perform robustly on, especially when combined with a large number of classes featuring unbalanced data and certain classes being heavily under-represented. This is termed the "long-tailed class distribution" problem. These difficulties are present in the classification problem explored in this paper, where using a satellite image of a habitat location, the species that inhabits that location must be predicted from a list of over 30,000 candidate species. In contrast to standard classification problems, the target candidate for classification is absent within the image in this particular task. Contrastive representation learning techniques have been extensively explored for classifica- tion problems. However, their performance on representation learning across different data domains is less well understood[4]. This work contributes to the body of existing literature ex- ploring self-supervised representation learning methods on remote sensing imagery and related data sources. These include methods exploring the performance of existing self-supervised methods on remote sensing data[5], self-supervision techniques which exploit location and time invariance of remote sensing data to perform representation learning[6] and methods which exploit the spatiotemporal structure of remote sensing data to perform self-supervision[7]. In this paper, I detail my workflow for the winning submission to GeoLifeCLEF2021 and summarize my performance representing the University of Melbourne at this challenge. This competition1 was organized as GeoLifeCLEF 2021[8], as part of LifeCLEF 2021[9] and in con- junction with FGVC82 workshop at CVPR3 2021. Comparisons of results are made primarily with existing benchmarks which include the state of the art for this problem. A comparison with other competitors is not included. Additionally, I explore the details around the transformations pipeline used for improving the feature representation learned by the model and also introduce a consistency-based model selection function. This function was useful for the purpose of model selection for evaluation on the public leaderboard. This work is derivative of a larger computer vision framework connecting aspects of the environment (built and natural). This framework draws high-level inspiration from [10] and the insights gained from both projects allowed a winning solution to be crafted for this problem. Further discussion of such insights is beyond the scope of this paper. 2. Data and Evaluation Metrics In this section, I explore the datasets and evaluation metrics that are used for the purposes of training and evaluating my models. An overall description of all datasets used in this work is also presented, as my workflow only uses either one or two of the available datasets for training and evaluation. Top-30 error is used for comparing different methods. Detailed discussion of the 1 https://www.kaggle.com/c/geolifeclef-2021 2 https://sites.google.com/view/fgvc8 3 IEEE/CVF Conference on Computer Vision and Pattern Recognition - http://cvpr2021.thecvf.com/ metrics used in the competition can be found in [11] with a detailed discussion of the datasets present in [12]. 2.1. Dataset This work builds upon the following types of imagery: • RGB remote sensing imagery • Altitude imagery These imagery types have a pixel-wise correspondence in terms of geographical overlap at each location and are 256x256 in size and have a spatial resolution of 1 meter per pixel. Therefore each image covers an area of 256x256 square meters. Altitude imagery was derived using elevation data from the NASA Shuttle Radar Topography Mission4 . RGB remote sensing imagery was from 2 sources: in the US - from the 2009-2011 cycle of the National Agriculture Imagery Program5 and in France - imagery from BD ORTHOR 2.0 and ORTHO HRR 1.0 databases from the French National Institute of Geographic and Forest Information6 . 2.2. Class Distribution One of the main difficulties in this problem arises due to the unbalanced class distribution. Interestingly, over 60% of all classes have fewer than 10 training images and nearly 8,000 classes have a single image to train on (about 25% of all classes). The data distribution shown by Figure 1, which shows the number of training records on the x-axis as a closed interval on a discretized logarithimic scale and the number of classes that belong to that range on the y-axis. 2.3. Consistency-based Model Selection Metric In this section, I introduce a metric which was used in lieu of a validation split on this problem. I use the proxy task of "country prediction" in order to derive an additional validation metric building on the "city prediction" task introduced in [13]. Given that many of the species were endemic to each country (US or France but not both), it is reasonable that a model with higher accuracy in terms of species prediction would be also be able to perform better on the pseudo-task of predicting which country. This is derived from the model’s understanding of which species can belong to a particular country. An error rate is calculated for each model corresponding to how many times the model makes an impossible prediction by assigning a species to a country that does not host that species (based on the training data). Note that this consistency only makes sense with the "variable-withholding" strategy described in Section 3, since if the model has access to any geographical information(GPS co-ordinates or country label), it would simply learn this information and not make such mistakes. By intentionally withholding such information from the model I gain two advantages: 4 https://lpdaac.usgs.gov/products/srtmgl1v003/ 5 https://www.fsa.usda.gov/programs-and-services/aerial-photography/imagery-programs/naip-imagery/ 6 https://geoservices.ign.fr Figure 1: Training dataset distribution. Most classes are heavily under-represented in the dataset. • I am able to use this consistency error as a pseudo-validation metric. • It is possible to incorporate withheld-data at a later stage of model training (for example during ensembling of individual models trained on all co-variates) in order to further improve model performance. The calculation of this function is straightforward: 1. For each species categorize them as "fr", "us" or "both" depending on country of occurrence 2. At validation time, for each predicted label in the top-30 predictions for a particular image do the following: • Count the number of "US” species : 𝑁𝑈 𝑆 • Count the number of "FR” species : 𝑁𝐹 𝑅 • Count the number of "US and FR” : 𝑁𝐵𝑂𝑇 𝐻 3. Count the number of instances where both 𝑁𝑈 𝑆 > 0 and 𝑁𝐹 𝑅 > 0 This count acts as the "confounder” count (or misclassification count) for that model variant where models with fewer confounders are better. This metric was used for model checkpoint selection for submission to the leaderboard, but its effectiveness requires further exploration with respect to performance against an actual validation split of the data. 3. Methodology The main problem explored in this paper is the overlapping value of the different data sources provided as part of the competition. Since the prediction problem was quite difficult, I focused on approaches that allowed the model to exploit all possible information present in each individual image type, starting with RGB imagery. I explore the following questions in this regard: • Given that the data consists of base imagery (RGB) augmented by 3 co-variates (NIR, land-use and altitude) at the same location, is it possible to derive most of the information present in all 4 data types using only the base RGB imagery? • Given the above is achievable, what further information regarding the prediction variable can be extracted from the co-variates? • What is the best way to combine this information to improve prediction performance? 3.1. Transformations Image transformations have often been touted as a means of providing more variety to the training process. As the input data used for training neural networks is often fixed, it can lead to the model seeing the same data epoch upon epoch leading to overfitting. This is especially true in fine-grained visual categorization problems with poorly represented classes (<10 images per class) making up the majority. In such cases approaches such as adverserial training and image transformations/augmentations have been shown to provide significant improvements on baseline methods. In this section I explore the image augmentation strategy that was used to combat overfitting. A discussion of modifications for multimodal analysis can be found under Section 3.3. The transformation pipeline is as follows: • Subtracting the per-channel ImageNet[14] mean and dividing by the per-channel Ima- geNet standard deviation. • Random horizontal flip • Random vertical flip • RandAugment[15] was used to augment images N, M with hyperparameters N set to 2, and M set to 9. N represents the number of augmentation transformations to be applied, while M controls the magnitude for all the transformations. 3.2. Unimodal Analysis In order to explore the possibility of extracting more information from the base RGB imagery, the initial experiment focused on creating a workflow that uses only RGB imagery and ignores all other information available to the model for training and evaluation purposes. This includes co-variate images, geographic (GPS) location, country tag and environmental feature vectors. Additionally, past work [16] indicates the benefits of using pretrained feature representations for fine-grained visual categorization tasks. MoCo[17] was used as a contrastive representation learning framework to initialize a feature representation for the model to build off of with pretraining carried out for 20 epochs using a single 4 GPU node on Spartan[18] using the hyperparameters in Table 1. The standard protocol for pretraining was followed, but combining all data across the US and France to form a combined representation, which is required for the combined (both countries at the same time) modeling approach followed in this paper. Further training was conducted for 7 epochs in a supervised manner to finetune the feature representation further. This training was performed with end-to-end finetuning of the ResNet50 using the parameters available in Table 2. Checkpoints were generated each epoch and the model with the lowest consistency error (as defined in section 2.3) was used to determine the best performing model. Table 1 Representation Learning Parameters Parameter Value Comments Architecture ResNet50 Smaller backbone for faster training Batch size 128 Learning rate 1.5e-2 Softmax temperature 0.2 Table 2 Training Parameters Parameter Value Comments Framework PyTorch[19] Architecture ResNet50 Same as above Batch size 128 Learning rate 1e-3 3.3. Multimodal Analysis In this section I explore how multimodal imagery was incorporated into the training workflow. Only the addition of altitude imagery is covered in this section with the other co-variates being left as future work for exploration. This section uses same workflow as in Section 3.2 with a few key differences. Figure 2: Generic Architecture applicable to this problem. Resnet50 is used as the Deep Feature Ex- tractor and the unimodal workflow only uses the top branch for training and analysis Pretraining using MoCo was carried out on altitude imagery as well, using an architecture identical to the bottom branch in Figure 3.3. The architecture was modified to include an identical architectural sister network as in the unimodal analysis, which was combined using concatenation at the final bottleneck layer of the ResNet50. The new layer containing 4096 nodes had a 31180 node linear layer with softmax applied in order to infer labels for the task at hand. In this regard, the architecture, which is shown in Figure 3.3, was identical to the unimodal case with the key difference being the number of inputs to the linear layer (multimodal - 4096 vs unimodal - 2048). The single altitude channel was replicated across 3 channels to be compatible with a standard ResNet-50. An advantage of this architecture is its extensibility to different image modalities with the added ability to create seperate filters for the individual image modalities and thereby combine higher level features rather than lower level features (which was the main reason for stacking near the end of the ResNet50 architecture as opposed to near the beginning). My intuition in doing so is that the architecture is able to process more refined knowledge about the different image domains instead of trying to learn an embedding that attempts to unify its representation of all domains combined. This has the marked disadvantage of increasing the GPU memory footprint of the architecture which significantly impacts training time and is perhaps the key weakness of this approach. The batch size was lowered to 64 to accommodate the larger architecture, leading to a roughly 3-fold slowdown on training the model. End-to-end finetuning of the ResNet50 was only conducted for 4 epochs because of these additional computational requirements. A Siamese network based representation learning approach based on the approach from [20] (where weights are shared between the branches, thereby reducing the model footprint on the GPU) was considered but quickly discarded on the basis that the image domains in this problem are too different to each other to benefit from shared knowledge from each other at the filter levels. One other key difference was the modification of the transformations pipeline to remove most augmentations during training. This is primarily an artefact of the implementation, which used two seperate PyTorch Dataloaders instead of a single dataloader. Therefore, horizontal and vertical flipping and other transformations would occur independently of each other, impacting the overall correspondence of the image patches due to not having the same orientation. There- fore, all transformations other than normalization (using ImageNet statistics) were removed from the dataloaders. 4. Results Several methods (including a random-forest based approach) were compared using prior work in this area. More details around low-level implementation details of these benchmarks can be found in [21]. The multimodal approach is able to beat existing supervised techniques by a considerable margin, while the unimodal implementation shows equivalent performance to the existing state of the art. In the results featured in table 3, public leaderboard and private leaderboard performance is indicated, with a 10% vs 90% data split respectively. Table 3 Results of Top-30 error rate across compared models Method Public leaderboard Private leaderboard Random Forest 0.78325 0.79711 Supervised CNN(multimodal) 0.75283 0.76680 Mine (unimodal) 0.75726 0.75188 Mine (multimodal) 0.73679 0.74838 5. Future Work While initial analysis on this problem is promising, there are many research directions still open to exploration. The impact of transformations was not fully explored in this work. For multi-modal analysis, a better implementation may be to ensure all transformations are consistently applied across all data sources, so that the image patches propagated through the neural network correspond to the exact same geographic region (which is not the case when the transformations are applied independently across data sources). While the consistency metric introduced in this work was useful for model selection, further comparison with standard validation splits would be useful in further evaluating its utility on this problem. Due to the absence of key ablations, it is unclear where some of the performance gains are being derived, and future work could shed further light on this issue. Additionally, for the consistency function introduced in this work, it is possible that certain species may inhabit nearly identical habitats across both geographies, which may affect the broader usability of this function in different situations. 6. Conclusion In this paper, I have presented a workflow for achieving state of the art results on computer vision based SDM. I have introduced a consistency-based model selection function that relies on the strategy of withholding information from the models during the training process in order to improve performance. Additionally, this work pushes the boundaries of using contrastive visual representation learning on remote sensing imagery: an area which is currently under- represented in research literature. This paper makes a significant contribution to the area of finely grained visual categorization. My methods are able to surpass the current state of the art using only a quarter of the data used by the current state of the art supervised work in this area, using only a single data modality whereas the current state of the art uses 4. I have also presented initial work on future research directions and provide a methodology and initial results for including further image modalities to drive increased model performance. Acknowledgments This project is supported by National Health and Medical Research Grant GA80134. This research was undertaken using the LIEF HPC-GPGPU Facility hosted at the University of Melbourne. This Facility was established with the assistance of LIEF Grant LE170100200. This research was undertaken using University of Melbourne Research Computing facilities established by the Petascale Campus Initiative. References [1] B. Deneu, M. Servajean, P. Bonnet, C. Botella, F. Munoz, A. Joly, Convolutional neural networks improve species distribution modelling by capturing the spatial structure of the environment, PLoS computational biology 17 (2021) e1008856. [2] P. Bonnet, A. Joly, J.-M. Faton, S. Brown, D. Kimiti, B. Deneu, M. Servajean, A. Affouard, J.-C. Lombardo, L. Mary, et al., How citizen scientists contribute to monitor protected areas thanks to automatic plant identification tools, Ecological Solutions and Evidence 1 (2020) e12023. [3] G. V. Horn, E. Cole, S. Beery, K. Wilber, S. Belongie, O. M. Aodha, Benchmarking represen- tation learning for natural world image collections, 2021. arXiv:2103.16483. [4] E. Cole, X. Yang, K. Wilber, O. M. Aodha, S. Belongie, When does contrastive visual representation learning work?, 2021. arXiv:2105.05837. [5] V. Stojnić, V. Risojević, Self-supervised learning of remote sensing scene representations using contrastive multiview coding, 2021. arXiv:2104.07070. [6] O. Mañas, A. Lacoste, X. G. i Nieto, D. Vazquez, P. Rodriguez, Seasonal contrast: Unsuper- vised pre-training from uncurated remote sensing data, 2021. arXiv:2103.16607. [7] K. Ayush, B. Uzkent, C. Meng, K. Tanmay, M. Burke, D. Lobell, S. Ermon, Geography-aware self-supervised learning, 2020. arXiv:2011.09980. [8] T. Lorieul, E. Cole, B. Deneu, M. Servajean, A. Joly, Overview of geolifeclef 2021: Predicting species distribution from 2 million remote sensing images, in: Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, 2021. [9] A. Joly, H. Goëau, S. Kahl, L. Picek, T. Lorieul, E. Cole, B. Deneu, M. Servajean, R. Ruiz De Castañeda, I. Bolon, H. Glotin, R. Planqué, W.-P. Vellinga, A. Durso, H. Klinck, T. Denton, I. Eggel, P. Bonnet, H. Müller, Overview of lifeclef 2021: a system-oriented evaluation of automated species identification and species distribution prediction, in: Proceedings of the Twelfth International Conference of the CLEF Association (CLEF 2021), 2021. [10] S. H. Seneviratne, Automatic Code Generation for Statistical Models with Augmentation and Collapsing, Ph.D. thesis, Monash University, 2020. [11] B. Deneu, T. Lorieul, E. Cole, M. Servajean, C. Botella, P. Bonnet, A. Joly, Overview of lifeclef location-based species prediction task 2020 (geolifeclef), in: CLEF 2020, 2020. [12] E. Cole, B. Deneu, T. Lorieul, M. Servajean, C. Botella, D. Morris, N. Jojic, P. Bonnet, A. Joly, The geolifeclef 2020 dataset, arXiv preprint arXiv:2004.04192 (2020). [13] M. Stevenson, J. Thompson, T. H. de Sá, R. Ewing, D. Mohan, R. McClure, I. Roberts, G. Tiwari, B. Giles-Corti, X. Sun, et al., Land use, transport, and population health: estimating the health benefits of compact cities, The lancet 388 (2016) 2925–2935. [14] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep con- volutional neural networks, in: F. Pereira, C. J. C. Burges, L. Bottou, K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems, volume 25, Curran Associates, Inc., 2012. URL: https://proceedings.neurips.cc/paper/2012/file/ c399862d3b9d6b76c8436e924a68c45b-Paper.pdf. [15] E. D. Cubuk, B. Zoph, J. Shlens, Q. V. Le, Randaugment: Practical automated data augmen- tation with a reduced search space, 2019. arXiv:1909.13719. [16] M. G. Krishnan, Impact of pretrained networks for snake species classification (2020). [17] X. Chen, H. Fan, R. Girshick, K. He, Improved baselines with momentum contrastive learning, 2020. arXiv:2003.04297. [18] L. Lafayette, G. Sauter, L. Vu, B. Meade, Spartan performance and flexibility: An hpc-cloud chimera, OpenStack Summit, Barcelona 27 (2016). [19] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., Pytorch: An imperative style, high-performance deep learning library, arXiv preprint arXiv:1912.01703 (2019). [20] S. Seneviratne, N. Kasthuriaarachchi, S. Rasnayaka, Multi-dataset benchmarks for masked identification using contrastive representation learning, 2021. arXiv:2106.05596. [21] B. Deneu, M. Servajean, P. Bonnet, F. Munoz, A. Joly, Participation of lirmm/inria to the geolifeclef 2020 challenge (2020).