<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Combining Present-Only and Present-Absent Data with Pseudo-Label Generation for Species Distribution Modeling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yi-Chia Chen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tai Peng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wei-Hua Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chu-Song Chen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Taiwan University</institution>
          ,
          <addr-line>Taipei</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <abstract>
        <p>Predicting the composition of plant species at specific times and locations is crucial for biodiversity management and conservation. In this report, we leverage data from the GeoLifeCLEF 2024 challenge, which includes approximately 5 million plant occurrence records from Europe, a training set of about 90,000 plots, and a test set with 5,000 plots. These data encompass various modalities, including satellite images, climatic time series, land cover, human footprint, bioclimatic, and soil variables. Our approach combines a pseudo-label training framework based on large-scale data and multimodal pretrained deep learning models to address challenges such as multi-label learning from single positive labels, strong class imbalance, and large-scale data processing. On the private test set, our method achieved a score of 0.36837, securing second place on the leaderboard, just 0.04 points behind first place. We discuss the design of our approach and reflect on the results. Our code is available on GitHub.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Species distribution modeling</kwd>
        <kwd>Presence-Only data</kwd>
        <kwd>Pseudo labels</kwd>
        <kwd>LifeCLEF</kwd>
        <kwd>multimodal deep learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Species distribution modeling (SDM) is a field of research focused on predicting species that are most
likely to be observed at a given location and time. In recent years, the research community has collected
a vast amount of species observations from various regions, providing us with the opportunity to train
deep learning model to predict species distribution.</p>
      <p>
        In GeoLifeCLEF 2024 [
        <xref ref-type="bibr" rid="ref1">1, 2</xref>
        ], a large-scale training set is provided, but most of these samples have
only single or partial positive labels (about 5 million of Presence-Only (PO) data and only 90,000 of
Presence-Absence (PA) data with exhaustive labels). Therefore, how to efectively integrate these PO
data with PA data is one of the challenging problems.
      </p>
      <p>In this report, we propose a hybrid model that combines diferent CNN-based architectures for SDM.
Furthermore, we will introduce the framework we employed during the competition, which efectively
utilized the abundant PO data provided by the organizers to generate pseudo-labels. These pseudo-labels
were sequentially integrated with PA data to finetune our models.</p>
      <p>The rest of this report is structured as follows. Section 2 reviews related work. Section 3 provides a
detailed description of the dataset and the evaluation metric for the competition. Section 4 introduces
the proposed method. Section 5 presents the experimental results and ablation study. Finally, section 6
concludes the report.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related works</title>
      <p>This section provides a brief overview of relevant works in the field of Single-Positive Multi-Label
Learning and SPMLL for Species Distribution Modeling.</p>
      <sec id="sec-2-1">
        <title>2.1. Single-Positive Multi-Label Learning, SPMLL</title>
        <p>Multi-label learning (MLL) [3] has had many practical applications in the past, aiming to enable models
to classify multiple labels. However, collecting large amount of the training data with complete
multilabel annotation is quite dificult and time-consuming, therefore, SPMLL have been proposed to alleviate
the burden of multi-label annotation.</p>
        <p>Diferent from MLL, the goal of SPMLL is to achieve multi-label learning through the samples are
annotated with only single positive label. In the field of computer vision, some works proposed to
utilize pseudo-label generation. For example, Zhou et al. [4] proposes entropy-maximization (EM)
loss and asymmetric pseudo-labeling. Xie et al. [5] proposed Label-Aware global Consistency (LAC)
regularization. Liu et al. [6] provides a theoretical guarantee for learning from pseudo-label on SPMLL
and proposes MIME, which can simulataneously train the model and update the pseudo-labels. Although
these methods are quite efective, they are not specifically designed for Species Distribution Modeling
research, and the number of categories they need to predict is smaller.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. SPMLL for Species Distribution Modeling</title>
        <p>In GeoLifeCLEF 2022, several CNN-based SDM models [7, 8] have been proposed for species distribution
modeling. However, training CNN-based models for multi-label prediction tasks using samples with
single positive labels is challenging, therefore, in GeoLifeCLEF 2023, Ung et al. [9] proposed the
threesteps training strategy. The three-step training process involves using PA data with BCELoss for
pre-training the model, then using PO data with Cross Entropy Loss for extensive training, and finally
ifne-tuning with PA data. Inspired by this work, we have also designed a three-steps process, aiming to
make good use of single-label PO data to assist us in performing species distribution modeling.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Data and Evaluation Metric</title>
      <p>In this section, we introduce the multimodal dataset provided by the GeoLifeCLEF 2024 competition
and the evaluation metrics used for the competition.
3.1. Data
The GeoLifeCLEF 2024 Challenge aims to predict plant species presence at specific locations using
various related features based on the GeoLifeCLEF 2023 multimodal dataset [10]. The dataset
encompasses 38 European countries, covering eight biogeographic regions: Alpine, Atlantic, Black Sea, Boreal,
Continental, Mediterranean, Pannonian, and Steppic. The data were collected between 2017 and 2021,
ensuring a comprehensive temporal and spatial coverage across Europe. The GeoLifeCLEF 2024 includes
approximately 10,000 plant species observed through both Presence-Only (PO) and Presence-Absence
(PA) surveys. The PO data consists of 5 million records extracted from trusted sources, while the PA
data comprises 90 thousand surveys conducted by botanical experts.</p>
      <p>The dataset incorporates several modalities of data, each providing unique insights into the
environmental conditions afecting plant species distribution:
• Satellite Raster Images:
– Sentinel-2 Images: These include RGB and Near-Infra-Red (NIR) bands, capturing data over
a 1280 meter × 1280 meter area at a 10-meter resolution, formatted into 128 × 128 pixel
patches.
– Landsat Time Series: This data spans from 2000 to 2020, ofering quarterly median composites
of six spectral bands (blue, green, red, NIR, SWIR1, and SWIR2) with a 30-meter resolution.
• Climatic Data:
– Bioclimatic Rasters: Nineteen low-resolution rasters describing various climatic variables,
such as mean annual air temperature and precipitation, provided as GeoTIFF files with a
30-arcsecond resolution (1 km).
• Soil Variables:
• Human Footprint:
– Soil-Grids: Nine low-resolution rasters detailing soil properties like pH, clay content, organic
carbon, nitrogen, bulk density, sand, silt, and cation exchange capacity, measured at a depth
range of 5 to 15 centimeters.
– Sixteen rasters representing human activities and their pressures on the environment,
including population density, road networks, and night-time lights. These data are provided
for two time periods (1993 and 2009), allowing for the assessment of changes over time.
• Elevation and Land Cover:
– Elevation Data: High-resolution elevation data provided as a single GeoTIFF file with a
1-arcsecond resolution (30 m).
– Land Cover: Multi-band raster files describing land cover classes using classifications like</p>
      <p>IGBP and LCCS, provided with a resolution of 500 meters.</p>
      <p>The dataset matches species observations with diferent environmental factors commonly used in
species distribution modeling, such as climate conditions, soil characteristics, land cover, and human
impact. All data are provided at suitable spatial resolutions to support accurate modeling.</p>
      <sec id="sec-3-1">
        <title>3.2. Evaluation Metric</title>
        <p>The evaluation metric for the GeoLifeCLEF 2024 competition is the samples-averaged 1-score,
calculated on the test set composed of species Presence-Absence (PA) samples. This metric addresses a
multi-label classification problem, providing an average measure of the overlap between the predicted
and actual sets of species present at specific locations and times.</p>
        <p>The micro 1-score is computed using the following formula:
1 =</p>
        <p>1 ∑︁</p>
        <p>TP
=1 TP + (FP + FN)/2</p>
        <p>In this formula:  is the total number of test PA samples. TP (True Positives) is the number of
species correctly predicted to be present. FP (False Positives) is the number of species incorrectly
predicted to be present. FN (False Negatives) is the number of species that are present but not predicted.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Proposed Method</title>
      <p>This section introduces our proposed multimodal deep learning model and pseudo-label training
framework.</p>
      <sec id="sec-4-1">
        <title>4.1. Model architectures</title>
        <p>To address the multi-label plant species prediction problem, we designed and experimented with a
multimodal ensemble neural network model. This model integrates various data sources, including
preprocessed tabular data (comprising metadata, Human Footprint, Landcover, and Soil), Landsat and
Sentinel satellite imagery, and Bioclimatic Rasters. Each data type is processed by specialized neural
networks before being fused for classification. The model architecture is illustrated in Figure 1.</p>
        <p>We use Multi-Layer Perceptron (MLP) to extract features for preprocessed high-value tabular data.
The input tabular data feature vector first passes through a fully connected layer, transforming the
number of features into 1,000 neurons. This is followed by batch normalization [11] to stabilize data
distribution and accelerate the training process. The activation function used is ReLU [12], which
introduces non-linearity to enhance the model’s expressive capability. This structure is repeated three
times, resulting in three hidden layers, each containing 1,000 neurons. Finally, a fully connected layer
reduces the feature dimension to 512, outputting a 512-dimensional feature vector.</p>
        <p>The Landsat data processing module adopts a convolutional neural network structure based on
ResNet18 [13]. To fully exploit the rich temporal series of remote sensing data, we modified the ResNet18
architecture to suit the characteristics of Landsat data. Initially, we apply layer normalization [14] to
the input Landsat data to stabilize the input data distribution. We then use the ResNet18 model for
feature extraction but modify the first convolutional layer to increase the kernel size from the original 3
channels to 6 channels, accommodating the multi-spectral nature of Landsat data. This modification
enables the model to capture the rich information within Landsat data better. To simplify the model
structure and focus on feature extraction, we removed the max-pooling layer and the fully connected
layer from the ResNet18 model, retaining only the convolutional layers for feature extraction. This
design ensures the model can eficiently process Landsat data and output high-quality feature vectors.</p>
        <p>To efectively utilize the Bioclimatic Rasters data, we designed a deep convolutional neural network
structure based on ResNet18 [13], which was modified to suit the characteristics of Bioclimatic Rasters
data. Initially, layer normalization is applied to the input data to stabilize its distribution, accommodating
the diversity of Bioclim data. We also modified the first convolutional layer of the ResNet18 model,
changing it from the default 3 channels to 4 channels. Additionally, we removed the max-pooling layer
and the fully connected layer to ensure eficient extraction of features from the Bioclimatic Rasters data.</p>
        <p>Handling Sentinel satellite imagery data is crucial for species prediction based on geographic location.
Sentinel-2 provides multispectral images, including red, green, blue, and near-infrared (NIR) bands.
To leverage these high-resolution multispectral images, we employed a self-supervised pretrained
ResNet18 model on the SSL4EO-S12 Earth observation dataset [15]. This approach takes advantage
of the of-the-shelf model’s learning capability on large-scale datasets, enhancing feature extraction
performance. We modified the first convolutional layer of ResNet18 from the default 3 channels to 4
channels to accommodate the four spectral bands of Sentinel data. Specifically, the first convolutional
layer was set with a kernel size of 7 × 7, a stride of 2, and padding of 3, enabling the extraction of more
local features while maintaining spatial resolution. To adapt to this modification, we concatenated
the convolution kernels without altering the original weight distribution. Through this design, the
Sentinel data processing module efectively extracts spatial and spectral features from high-resolution
multispectral images, providing rich feature representations for subsequent multimodal feature fusion.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Pseudo-label training framework</title>
        <p>The GeoLifeCLEF 2024 competition provides two types of data, PA (Presence-Absence) and PO
(PresenceOnly), which exhibit significant diferences in scale and quality. Although PO data sufers from biases
due to the lack of standardized sampling, its vast volume (approximately five million records) is of
immense value for model training, enriching the training data and enhancing prediction accuracy. The
efective utilization of PO data is undoubtedly crucial for improving performance in this competition.
To this end, we have designed a pseudo-label training framework based on both PO and PA data. This
framework comprises three steps, as shown in Figure 2.</p>
        <p>In the first step, we train our model with PA data to equip it with the initial ability to classify
multiple species. We can assume that each input  from  corresponds to a label vector  from
 = {1, 2, ..., } ∈ {0, 1}, where  denotes the total number of classes,  = 1 represents the
species  is present at the given location and  = 0 otherwise. The primary goal is to find a model
( ) that can accurately predict  for each . To achieve this, we use the common binary cross-entropy
(BCE) loss to train the model.</p>
        <p>In the second step, we utilize the pretrained model to derive pseudo-labels for each sample in the
PO data. Given a sample (, ), our model  predicts a label vector ˜ = {˜1, ˜2, ..., ˜} and the
corresponding probability  = {1, 2, ..., } for each class based on . To enhance the reliability
of positive pseudo-labels, we introduce an ignore label (∅) and filter positive labels based on their
confidence scores. We define two confidence thresholds,  − and  +, to aid in generating the final
pseudo-labels  = {1, 2, ..., } ∈ {0, 1, ∅}, where  can be expressed as follows:
Mixed Tabular data</p>
        <p>
          [
          <xref ref-type="bibr" rid="ref1">31, 1</xref>
          ]
Sentinel-2 images
        </p>
        <p>[4, 128, 128]
Bioclimatic rasters</p>
        <p>[4, 19, 12]
Landsat time series
[6, 4, 21]
ResNet18
ResNet18
ResNet18</p>
        <p>Avg.</p>
        <p>Pool
Avg.</p>
        <p>Pool
Avg.</p>
        <p>Pool
 =
⎧⎪1, if  + &lt; 
⎨</p>
        <p>∅, if  − &lt;  &lt;  +
⎪⎩0, if  &lt;  −</p>
        <p>With this filtering mechanism, we can obtain more reliable positive labels, as those with high
confidence scores are retained. For those positive labels with uncertain confidence levels (i.e., 
between  + and  − ), we do not include them in the loss calculation. Labels with very low confidence
levels are considered negative labels. Additionally, we also retain the original positive samples from the
PO data. Therefore, the final pseudo-label can be represented as:</p>
        <p>=  ∪</p>
        <p>Finally, we train our model using the PO data with pseudo-labels and the original PA data with
multi-labels. This enables the model to undergo training with a larger volume of data. This three-steps
process can significantly improve our performance. For related ablation experiments, please refer to
section 5.1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Results</title>
      <sec id="sec-5-1">
        <title>In this section, we present the details of the experiments.</title>
        <p>Dataset Split: We randomly split the original training set into a training set and validation set in
an 8:2 ratio. The training set is used for model training, while the validation set is used for model
performance evaluation and hyperparameter tuning. The final model is trained on the complete training
set and evaluated on the oficially provided test set.</p>
        <p>Data Proprocessing: We normalize the tabular data to have a mean of 0 and a standard deviation of
1. We do not use any data augmentation.</p>
        <p>Hyperparameters: We use the AdamW [16] optimizer with an initial learning rate of 0.00025 and a
weight decay of 0.01. The batch size is set to 64, and the total number of training epochs is 10. The
(1)
(2)
learning rate is decayed using cosine annealing. The confidence threshold  − and  + are set to 0.05
and 0.4, respectively. The model checkpoint with the highest F1 score on the validation set is selected
for testing.</p>
        <p>Hardware Environment: Our experiments are conducted on a computer equipped with an NVIDIA
RTX 4090 GPU. We use the PyTorch deep learning framework for model training and inference.</p>
        <sec id="sec-5-1-1">
          <title>5.1. Ablation Study</title>
          <p>To better understand the impact of diferent components and design choices in our proposed method,
we conducted an ablation study. The results are presented in Table 3.</p>
          <p>We began with a baseline model submitted by organizer1 and gradually added various components
to observe their efect on the model’s performance. The baseline achieved a Kaggle Private Score of
0.31535. By utilizing 5-fold cross-validation, we saw an improvement of 0.02075, reaching a score of
0.33610. Next, we investigated the efect of diferent threshold values on the model’s performance, as
shown in Table 2. We found that setting the threshold to 0.2 yielded the best result, with a Kaggle
Private Score of 0.34886, an improvement of 0.01276 over the previous step.</p>
          <p>Incorporating tabular data into the model provided a slight boost in performance, increasing the
score by 0.00618 to 0.35504. Using a self-supervised pretrained ResNet (SSL pretrained ResNet) to the
model resulted in an improvement, raising the score by 0.00520 to 0.36024. Finally, the introduction of</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>1https://www.kaggle.com/code/picekl/sentinel-landsat-bioclim-baseline-0-31626</title>
        <p>our pseudo-labeling technique led to a significant improvement, raising the Kaggle Private Score to
0.36837, an increase of 0.00813 compared to the previous step.</p>
        <p>These results demonstrate that each component of our proposed method contributes to the overall
performance, with the pseudo-labeling technique being the most influential. The ablation study
highlights the efectiveness of our design choices and validates the importance of utilizing both the PA and
PO data through our pseudo-labeling framework.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This paper presents our participation in the 2024 GeolifeCLEF competition. For the multi-label plant
species prediction task, we propose our multimodal deep learning model, which consists of multiple
ResNet-based multimodal feature extractors, and we further use a pre-training model to ensure the
efectiveness of feature extraction for satellite image information. To efectively utilize the huge
amount of PO data, we propose a pseudo-label training framework to further improve the accuracy and
robustness of the model on the task. Our experiments demonstrate that our proposed multimodal deep
learning model improves on both public and private test sets, and we also demonstrate the efectiveness
of our proposed Pseudo-label training framework through ablation experiments.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgments</title>
      <p>This work was supported in part by the National Science and Technology Council, Taiwan under grants
NSTC 112-2634-F-002-005 and NSTC112-2634-F-006-002, and National Taiwan University under grants
113L900902.
[2] A. Joly, L. Picek, S. Kahl, H. Goëau, V. Espitalier, C. Botella, B. Deneu, D. Marcos, J. Estopinan,
C. Leblanc, T. Larcher, M. Šulc, M. Hrúz, M. Servajean, et al., Overview of lifeclef 2024: Challenges
on species distribution prediction and identification, in: International Conference of the
CrossLanguage Evaluation Forum for European Languages, Springer, 2024.
[3] M.-L. Zhang, Z.-H. Zhou, A review on multi-label learning algorithms, IEEE transactions on
knowledge and data engineering 26 (2013) 1819–1837.
[4] D. Zhou, P. Chen, Q. Wang, G. Chen, P.-A. Heng, Acknowledging the unknown for multi-label
learning with single positive labels, in: European Conference on Computer Vision, Springer, 2022,
pp. 423–440.
[5] M.-K. Xie, J. Xiao, S.-J. Huang, Label-aware global consistency for multi-label learning with single
positive labels, Advances in Neural Information Processing Systems 35 (2022) 18430–18441.
[6] B. Liu, N. Xu, J. Lv, X. Geng, Revisiting pseudo-label for single-positive multi-label learning, in:</p>
      <p>International Conference on Machine Learning, PMLR, 2023, pp. 22249–22265.
[7] B. Kellenberger, D. Tuia, Block label swap for species distribution modelling., in: CLEF (Working</p>
      <p>Notes), 2022, pp. 2103–2114.
[8] C. Leblanc, A. Joly, T. Lorieul, M. Servajean, P. Bonnet, Species distribution modeling based on
aerial images and environmental features with convolutional neural networks., in: CLEF (Working
Notes), 2022, pp. 2123–2150.
[9] H. Q. Ung, R. Kojima, S. Wada, Leverage samples with single positive labels to train cnn-based
models for multi-label plant species prediction, Working Notes of CLEF (2023).
[10] C. Botella, B. Deneu, D. Marcos, M. Servajean, J. Estopinan, T. Larcher, C. Leblanc, P. Bonnet,
A. Joly, The geolifeclef 2023 dataset to evaluate plant species distribution models at high spatial
resolution across europe, arXiv preprint arXiv:2308.05121 (2023).
[11] S. Iofe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal
covariate shift, in: International conference on machine learning, pmlr, 2015, pp. 448–456.
[12] A. F. Agarap, Deep learning using rectified linear units (relu), arXiv preprint arXiv:1803.08375
(2018).
[13] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of
the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[14] J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normalization, arXiv preprint arXiv:1607.06450 (2016).
[15] Y. Wang, N. A. A. Braham, Z. Xiong, C. Liu, C. M. Albrecht, X. X. Zhu, Ssl4eo-s12: A large-scale
multi-modal, multi-temporal dataset for self-supervised learning in earth observation, arXiv
preprint arXiv:2211.07044 (2022).
[16] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101
(2017).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Marcos</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Palard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of GeoLifeCLEF 2024:
          <article-title>Species presence prediction based on occurrence data and high-resolution remote sensing images</article-title>
          ,
          <source>in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>