Contrastive Representation Learning for Natural World Imagery: Habitat prediction for 30,000 species

Contrastive Representation Learning for Natural World Imagery: Habitat prediction for 30,000 species SachithSeneviratne sachith.seneviratne@unimelb.edu.au Melbourne School of Design Transport Health and Urban Design Research Lab The University of Melbourne

3010 Parkville VIC Australia

Contrastive Representation Learning for Natural World Imagery: Habitat prediction for 30,000 species 1613-0073 BA546AA05D5AD07A74F219F6B001A15F GROBID - A machine learning software for extracting information from scholarly documents Fine Grained Visual Categorization Representation Learning Self Supervision Transfer Learning Domain adaptation

Recent work in contrastive representation learning has pushed the boundaries of classification tasks in computer vision, achieving state of the art results on many established benchmarks. However, their performance on natural imagery tasks which fall into the category of fine-grained image classification can be further improved. In this paper, I present a methodology that explores this issue and achieves state of the art results on species distribution modelling from remote sensing imagery as part of the GeoLifeCLEF2021 challenge. My method is able to beat the current state of the art on this challenge (trained on 4 types of imagery) using only base RGB imagery. Initial experiments indicate that modifying the architecture to include additional image modalities leads to further improvements in performance on the task of location-based species recommendation. Additionally, I introduce a consistency function, which relies on the strategy of withholding data from the model and is useful in checking for model generality without relying on a validation split.

Introduction

Species Distribution Modelling (SDM) is the study of computational techniques to predict species distribution across both geographical locations and time using different forms of environmental data. Computer vision techniques have garnered attention in this area due to the ability to effectively incorporate contextual and geographic information to improve the modelling of species distribution [1]. Advances in this area have many implications in ecological analysis including the ability to more effectively engage with citizens regarding wildlife preservation and education [2]. Methods based in computer vision that allow large datasets of habitat imagery to be processed in order to generate a prediction of the most likely species inhabiting that area allow for significant theoretical and applied improvements in this area. However, the key challenges on this problem from a classification-based computer vision perspective are two fold: unbalanced data and having classes with only minute differences to distinguish one from another.

Imagery in the environment can be broadly divided into two categories: built and natural world. Remote sensing datasets will generally contain imagery pertaining to both these types. Many challenging tasks in computer vision arise in the natural world imagery domain [3]. Such tasks usually fall under the domain of "fine-grained visual categorization" -an active area of research in computer vision.

Imagery based classification problems with a fine distinction between classes can be difficult for computer vision techniques to perform robustly on, especially when combined with a large number of classes featuring unbalanced data and certain classes being heavily under-represented. This is termed the "long-tailed class distribution" problem. These difficulties are present in the classification problem explored in this paper, where using a satellite image of a habitat location, the species that inhabits that location must be predicted from a list of over 30,000 candidate species. In contrast to standard classification problems, the target candidate for classification is absent within the image in this particular task.

Contrastive representation learning techniques have been extensively explored for classification problems. However, their performance on representation learning across different data domains is less well understood [4]. This work contributes to the body of existing literature exploring self-supervised representation learning methods on remote sensing imagery and related data sources. These include methods exploring the performance of existing self-supervised methods on remote sensing data [5], self-supervision techniques which exploit location and time invariance of remote sensing data to perform representation learning [6] and methods which exploit the spatiotemporal structure of remote sensing data to perform self-supervision [7].

In this paper, I detail my workflow for the winning submission to GeoLifeCLEF2021 and summarize my performance representing the University of Melbourne at this challenge. This competition1 was organized as GeoLifeCLEF 2021 [8], as part of LifeCLEF 2021 [9] and in conjunction with FGVC82 workshop at CVPR3 2021. Comparisons of results are made primarily with existing benchmarks which include the state of the art for this problem. A comparison with other competitors is not included. Additionally, I explore the details around the transformations pipeline used for improving the feature representation learned by the model and also introduce a consistency-based model selection function. This function was useful for the purpose of model selection for evaluation on the public leaderboard. This work is derivative of a larger computer vision framework connecting aspects of the environment (built and natural). This framework draws high-level inspiration from [10] and the insights gained from both projects allowed a winning solution to be crafted for this problem. Further discussion of such insights is beyond the scope of this paper.

Data and Evaluation Metrics

In this section, I explore the datasets and evaluation metrics that are used for the purposes of training and evaluating my models. An overall description of all datasets used in this work is also presented, as my workflow only uses either one or two of the available datasets for training and evaluation. Top-30 error is used for comparing different methods. Detailed discussion of the metrics used in the competition can be found in [11] with a detailed discussion of the datasets present in [12].

Dataset

This work builds upon the following types of imagery:

• RGB remote sensing imagery • Altitude imagery These imagery types have a pixel-wise correspondence in terms of geographical overlap at each location and are 256x256 in size and have a spatial resolution of 1 meter per pixel. Therefore each image covers an area of 256x256 square meters. Altitude imagery was derived using elevation data from the NASA Shuttle Radar Topography Mission 4 . RGB remote sensing imagery was from 2 sources: in the US -from the 2009-2011 cycle of the National Agriculture Imagery Program 5 and in France -imagery from BD ORTHOR 2.0 and ORTHO HRR 1.0 databases from the French National Institute of Geographic and Forest Information6 .

Class Distribution

One of the main difficulties in this problem arises due to the unbalanced class distribution. Interestingly, over 60% of all classes have fewer than 10 training images and nearly 8,000 classes have a single image to train on (about 25% of all classes). The data distribution shown by Figure 1, which shows the number of training records on the x-axis as a closed interval on a discretized logarithimic scale and the number of classes that belong to that range on the y-axis.

Consistency-based Model Selection Metric

In this section, I introduce a metric which was used in lieu of a validation split on this problem. I use the proxy task of "country prediction" in order to derive an additional validation metric building on the "city prediction" task introduced in [13]. Given that many of the species were endemic to each country (US or France but not both), it is reasonable that a model with higher accuracy in terms of species prediction would be also be able to perform better on the pseudo-task of predicting which country. This is derived from the model's understanding of which species can belong to a particular country. An error rate is calculated for each model corresponding to how many times the model makes an impossible prediction by assigning a species to a country that does not host that species (based on the training data). Note that this consistency only makes sense with the "variable-withholding" strategy described in Section 3, since if the model has access to any geographical information(GPS co-ordinates or country label), it would simply learn this information and not make such mistakes. By intentionally withholding such information from the model I gain two advantages: • I am able to use this consistency error as a pseudo-validation metric.

• It is possible to incorporate withheld-data at a later stage of model training (for example during ensembling of individual models trained on all co-variates) in order to further improve model performance.

The calculation of this function is straightforward:

1. For each species categorize them as "fr", "us" or "both" depending on country of occurrence 2. At validation time, for each predicted label in the top-30 predictions for a particular image do the following:

• Count the number of "US" species : 𝑁 𝑈 𝑆 • Count the number of "FR" species : 𝑁 𝐹 𝑅 • Count the number of "US and FR" : 𝑁 𝐵𝑂𝑇 𝐻 3. Count the number of instances where both 𝑁 𝑈 𝑆 > 0 and 𝑁 𝐹 𝑅 > 0

This count acts as the "confounder" count (or misclassification count) for that model variant where models with fewer confounders are better. This metric was used for model checkpoint selection for submission to the leaderboard, but its effectiveness requires further exploration with respect to performance against an actual validation split of the data.

Methodology

The main problem explored in this paper is the overlapping value of the different data sources provided as part of the competition. Since the prediction problem was quite difficult, I focused on approaches that allowed the model to exploit all possible information present in each individual image type, starting with RGB imagery. I explore the following questions in this regard:

• Given that the data consists of base imagery (RGB) augmented by 3 co-variates (NIR, land-use and altitude) at the same location, is it possible to derive most of the information present in all 4 data types using only the base RGB imagery? • Given the above is achievable, what further information regarding the prediction variable can be extracted from the co-variates? • What is the best way to combine this information to improve prediction performance?

Transformations

Image transformations have often been touted as a means of providing more variety to the training process. As the input data used for training neural networks is often fixed, it can lead to the model seeing the same data epoch upon epoch leading to overfitting. This is especially true in fine-grained visual categorization problems with poorly represented classes (<10 images per class) making up the majority. In such cases approaches such as adverserial training and image transformations/augmentations have been shown to provide significant improvements on baseline methods. In this section I explore the image augmentation strategy that was used to combat overfitting. A discussion of modifications for multimodal analysis can be found under Section 3.3.

The transformation pipeline is as follows:

• Subtracting the per-channel ImageNet [14] mean and dividing by the per-channel Ima-geNet standard deviation. • Random horizontal flip • Random vertical flip • RandAugment [15] was used to augment images N, M with hyperparameters N set to 2, and M set to 9. N represents the number of augmentation transformations to be applied, while M controls the magnitude for all the transformations.

Unimodal Analysis

In order to explore the possibility of extracting more information from the base RGB imagery, the initial experiment focused on creating a workflow that uses only RGB imagery and ignores all other information available to the model for training and evaluation purposes. This includes co-variate images, geographic (GPS) location, country tag and environmental feature vectors. Additionally, past work [16] indicates the benefits of using pretrained feature representations for fine-grained visual categorization tasks. MoCo [17] was used as a contrastive representation learning framework to initialize a feature representation for the model to build off of with pretraining carried out for 20 epochs using a single 4 GPU node on Spartan [18] using the hyperparameters in Table 1. The standard protocol for pretraining was followed, but combining all data across the US and France to form a combined representation, which is required for the combined (both countries at the same time) modeling approach followed in this paper.

Further training was conducted for 7 epochs in a supervised manner to finetune the feature representation further. This training was performed with end-to-end finetuning of the ResNet50 using the parameters available in Table 2. Checkpoints were generated each epoch and the model with the lowest consistency error (as defined in section 2.3) was used to determine the best performing model.

Multimodal Analysis

In this section I explore how multimodal imagery was incorporated into the training workflow. Only the addition of altitude imagery is covered in this section with the other co-variates being left as future work for exploration. This section uses same workflow as in Section 3.2 with a few key differences. Pretraining using MoCo was carried out on altitude imagery as well, using an architecture identical to the bottom branch in Figure 3.3. The architecture was modified to include an identical architectural sister network as in the unimodal analysis, which was combined using concatenation at the final bottleneck layer of the ResNet50. The new layer containing 4096 nodes had a 31180 node linear layer with softmax applied in order to infer labels for the task at hand. In this regard, the architecture, which is shown in Figure 3.3, was identical to the unimodal case with the key difference being the number of inputs to the linear layer (multimodal -4096 vs unimodal -2048). The single altitude channel was replicated across 3 channels to be compatible with a standard ResNet-50. An advantage of this architecture is its extensibility to different image modalities with the added ability to create seperate filters for the individual image modalities and thereby combine higher level features rather than lower level features (which was the main reason for stacking near the end of the ResNet50 architecture as opposed to near the beginning). My intuition in doing so is that the architecture is able to process more refined knowledge about the different image domains instead of trying to learn an embedding that attempts to unify its representation of all domains combined. This has the marked disadvantage of increasing the GPU memory footprint of the architecture which significantly impacts training time and is perhaps the key weakness of this approach. The batch size was lowered to 64 to accommodate the larger architecture, leading to a roughly 3-fold slowdown on training the model. End-to-end finetuning of the ResNet50 was only conducted for 4 epochs because of these additional computational requirements. A Siamese network based representation learning approach based on the approach from [20] (where weights are shared between the branches, thereby reducing the model footprint on the GPU) was considered but quickly discarded on the basis that the image domains in this problem are too different to each other to benefit from shared knowledge from each other at the filter levels.

One other key difference was the modification of the transformations pipeline to remove most augmentations during training. This is primarily an artefact of the implementation, which used two seperate PyTorch Dataloaders instead of a single dataloader. Therefore, horizontal and vertical flipping and other transformations would occur independently of each other, impacting the overall correspondence of the image patches due to not having the same orientation. Therefore, all transformations other than normalization (using ImageNet statistics) were removed from the dataloaders.

Results

Several methods (including a random-forest based approach) were compared using prior work in this area. More details around low-level implementation details of these benchmarks can be found in [21]. The multimodal approach is able to beat existing supervised techniques by a considerable margin, while the unimodal implementation shows equivalent performance to the existing state of the art. In the results featured in table 3, public leaderboard and private leaderboard performance is indicated, with a 10% vs 90% data split respectively.

Future Work

While initial analysis on this problem is promising, there are many research directions still open to exploration. The impact of transformations was not fully explored in this work. For multi-modal analysis, a better implementation may be to ensure all transformations are consistently applied across all data sources, so that the image patches propagated through the neural network correspond to the exact same geographic region (which is not the case when the transformations are applied independently across data sources). While the consistency metric introduced in this work was useful for model selection, further comparison with standard validation splits would be useful in further evaluating its utility on this problem. Due to the absence of key ablations, it is unclear where some of the performance gains are being derived, and future work could shed further light on this issue. Additionally, for the consistency function introduced in this work, it is possible that certain species may inhabit nearly identical habitats across both geographies, which may affect the broader usability of this function in different situations.

Conclusion

In this paper, I have presented a workflow for achieving state of the art results on computer vision based SDM. I have introduced a consistency-based model selection function that relies on the strategy of withholding information from the models during the training process in order to improve performance. Additionally, this work pushes the boundaries of using contrastive visual representation learning on remote sensing imagery: an area which is currently underrepresented in research literature. This paper makes a significant contribution to the area of finely grained visual categorization. My methods are able to surpass the current state of the art using only a quarter of the data used by the current state of the art supervised work in this area, using only a single data modality whereas the current state of the art uses 4. I have also presented initial work on future research directions and provide a methodology and initial results for including further image modalities to drive increased model performance.

Figure 1 :1Figure 1: Training dataset distribution. Most classes are heavily under-represented in the dataset.

Figure 2 :2Figure 2: Generic Architecture applicable to this problem. Resnet50 is used as the Deep Feature Extractor and the unimodal workflow only uses the top branch for training and analysis

Table 11Representation Learning ParametersParameterValueCommentsArchitectureResNet50 Smaller backbone for faster trainingBatch size128Learning rate1.5e-2Softmax temperature0.2Table 2Training ParametersParameterValueCommentsFrameworkPyTorch[19]ArchitectureResNet50Same as aboveBatch size128Learning rate1e-3

Table 33Results of Top-30 error rate across compared modelsMethodPublic leaderboard Private leaderboardRandom Forest0.783250.79711Supervised CNN(multimodal)0.752830.76680Mine (unimodal)0.757260.75188Mine (multimodal)0.736790.74838

https://www.kaggle.com/c/geolifeclef-2021 https://sites.google.com/view/fgvc8 IEEE/CVF Conference on Computer Vision and Pattern Recognition -http://cvpr2021.thecvf.com/ https://lpdaac.usgs.gov/products/srtmgl1v003/ https://www.fsa.usda.gov/programs-and-services/aerial-photography/imagery-programs/naip-imagery/ https://geoservices.ign.fr

Acknowledgments

This project is supported by National Health and Medical Research Grant GA80134. This research was undertaken using the LIEF HPC-GPGPU Facility hosted at the University of Melbourne. This Facility was established with the assistance of LIEF Grant LE170100200. This research was undertaken using University of Melbourne Research Computing facilities established by the Petascale Campus Initiative.

Convolutional neural networks improve species distribution modelling by capturing the spatial structure of the environment BDeneu MServajean PBonnet CBotella FMunoz AJoly PLoS computational biology 17 e1008856 2021 How citizen scientists contribute to monitor protected areas thanks to automatic plant identification tools PBonnet AJoly J.-MFaton SBrown DKimiti BDeneu MServajean AAffouard J.-CLombardo LMary Ecological Solutions and Evidence 1 e12023 2020 GVHorn ECole SBeery KWilber SBelongie OMAodha arXiv:2103.16483 Benchmarking representation learning for natural world image collections 2021 When does contrastive visual representation learning work? ECole XYang KWilber OMAodha SBelongie arXiv:2105.05837 2021 Self-supervised learning of remote sensing scene representations using contrastive multiview coding VStojnić VRisojević arXiv:2104.07070 2021 OMañas ALacoste XGNieto DVazquez PRodriguez arXiv:2103.16607 Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data 2021 KAyush BUzkent CMeng KTanmay MBurke DLobell SErmon arXiv:2011.09980 Geography-aware self-supervised learning 2020 Overview of geolifeclef 2021: Predicting species distribution from 2 million remote sensing images TLorieul ECole BDeneu MServajean AJoly Working Notes of CLEF 2021 -Conference and Labs of the Evaluation Forum 2021 Overview of lifeclef 2021: a system-oriented evaluation of automated species identification and species distribution prediction AJoly HGoëau SKahl LPicek TLorieul ECole BDeneu MServajean RRuiz De Castañeda IBolon HGlotin RPlanqué W.-PVellinga ADurso HKlinck TDenton IEggel PBonnet HMüller Proceedings of the Twelfth International Conference of the CLEF Association the Twelfth International Conference of the CLEF Association

CLEF

2021. 2021 SHSeneviratne Automatic Code Generation for Statistical Models with Augmentation and Collapsing 2020 Monash University Ph.D. thesis BDeneu TLorieul ECole MServajean CBotella PBonnet AJoly Overview of lifeclef location-based species prediction task 2020 (geolifeclef) 2020 CLEF 2020 ECole BDeneu TLorieul MServajean CBotella DMorris NJojic PBonnet AJoly arXiv:2004.04192 The geolifeclef 2020 dataset 2020 arXiv preprint Land use, transport, and population health: estimating the health benefits of compact cities MStevenson JThompson THDe Sá REwing DMohan RMcclure IRoberts GTiwari BGiles-Corti XSun The lancet 388 2016 Imagenet classification with deep convolutional neural networks AKrizhevsky ISutskever GEHinton F. Pereira, C. J. C. Burges, L. Bottou, K. Q Weinberger Advances in Neural Information Processing Systems Curran Associates, Inc 2012 25 EDCubuk BZoph JShlens QVLe arXiv:1909.13719 Randaugment: Practical automated data augmentation with a reduced search space 2019 MGKrishnan Impact of pretrained networks for snake species classification 2020 Improved baselines with momentum contrastive learning XChen HFan RGirshick KHe arXiv:2003.04297 2020 Spartan performance and flexibility: An hpc-cloud chimera LLafayette GSauter LVu BMeade 2016 OpenStack Summit 27 Barcelona APaszke SGross FMassa ALerer JBradbury GChanan TKilleen ZLin NGimelshein LAntiga arXiv:1912.01703 Pytorch: An imperative style, high-performance deep learning library 2019 arXiv preprint Multi-dataset benchmarks for masked identification using contrastive representation learning SSeneviratne NKasthuriaarachchi SRasnayaka arXiv:2106.05596 2021 Participation of lirmm/inria to the geolifeclef BDeneu MServajean PBonnet FMunoz AJoly challenge 2020. 2020