<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Tighnari v2: Mitigating Label Noise and Distribution Shift in Multimodal Plant Distribution Prediction via Mixture of Experts and Weakly Supervised Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Haixu Liu</string-name>
          <email>hliu2490@uni.sydney.edu.au</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yufei Wang</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tianxiang Xu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chuancheng Shi</string-name>
          <email>cshi0459@uni.sydney.edu.au</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hongsheng Xing</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Software and Microelectronics, Peking University</institution>
          ,
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Shandong University of Technology</institution>
          ,
          <addr-line>Zibo, Shandong</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>The University of New South Wales</institution>
          ,
          <addr-line>Sydney, New South Wales</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>The University of Sydney</institution>
          ,
          <addr-line>Sydney, New South Wales</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Large-scale, cross-species plant distribution prediction plays a crucial role in biodiversity conservation, yet modeling eforts in this area still face significant challenges due to the sparsity and bias of observational data. Presence-Absence (PA) data provide accurate and noise-free labels, but are costly to obtain and limited in quantity; Presence-Only (PO) data, by contrast, ofer broad spatial coverage and rich spatiotemporal distribution, but sufer from severe label noise in negative samples. To address these real-world constraints, this paper proposes a multimodal fusion framework that fully leverages the strengths of both PA and PO data. We introduce an innovative pseudo-label aggregation strategy for PO data based on the geographic coverage of satellite imagery, enabling geographic alignment between the label space and remote sensing feature space. In terms of model architecture, we adopt Swin Transformer Base as the backbone for satellite imagery, utilize the TabM network for tabular feature extraction, retain the Temporal Swin Transformer for time-series modeling, and employ a stackable serial tri-modal cross-attention mechanism to optimize the fusion of heterogeneous modalities. Furthermore, empirical analysis reveals significant geographic distribution shifts between PA training and test samples, and models trained by directly mixing PO and PA data tend to experience performance degradation due to label noise in PO data. To address this, we draw on the mixture-of-experts paradigm: test samples are partitioned according to their spatial proximity to PA samples, and diferent models trained on distinct datasets are used for inference and post-processing within each partition. Experiments on the GeoLifeCLEF 2025 dataset demonstrate that our approach achieves superior predictive performance in scenarios with limited PA coverage and pronounced distribution shifts, ranking third in GeoLifeCLEF 2025 (where PA test samples exhibit geographic out-of-distribution characteristics), and surpassing the 2nd-place score on the GeoLifeCLEF 2024 leaderboard (where PA test and training samples are largely identically distributed).</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Weakly Supervised Learning</kwd>
        <kwd>Mixture of Experts</kwd>
        <kwd>Temporal Swin-Transformer</kwd>
        <kwd>Species Distribution Model</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>1.1. Background</title>
        <p>The task of predicting plant species distributions based on spatial location typically involves, at a given
set of spatiotemporal coordinates, using environmental, remote-sensing, and neighborhood features
to predict whether a particular plant is likely to occur. This approach enables managers to rapidly
CLEF 2025 Working Notes, 9 – 12 September 2025, Madrid, Spain
⋆You can use this document as the template for preparing your publication. We recommend using the latest version of the
ceurart style.
* Corresponding author.
identify priority conservation areas at a macro scale and to assess the expansion or contraction of
suitable habitat under climate change. However, several challenges remain: observation data are sparse
and biased toward hotspot regions and common species; Presence-Only (PO) data have a much higher
error ceiling than Presence–Absence (PA) data; and interspecific interactions are still dificult to model
explicitly.</p>
        <p>In the current dataset, we have 88,987 PA training samples, 3,845,533 PO training samples (based on
SurveyID merged from 5,079,797 observation records), and 14,716 test samples. Each sample represents
a single survey, and its label consists of the identifiers of all plant species observed during that survey,
covering a total of 11,255 species [1, 2]. Survey samples are classified as Presence–Absence (PA) or
Presence-Only (PO) depending on whether negative cases strictly represent species that were absent.
In real-world scenarios, PA data originate from plot surveys conducted by oficial or professional
institutions: any species that was ever observed within the survey area is labeled as a positive sample,
and any species not observed is explicitly labeled as a negative sample, so there is efectively no label
noise. However, because professional researchers are limited in number, while PA labels are cleaner
and more accurate, the cost of obtaining them is high, and the total number of PA records is far smaller
than that of PO records. By contrast, PO data typically come from crowdsourced citizen-science eforts.
They are very large in volume and span long time periods, and they provide invaluable spatiotemporal
coverage—especially for rare or hard-to-monitor species. However, citizens tend to record only species
that interest them, while ignoring species they consider “common” or simply do not recognize. As a
result, in PO data only positive samples can be taken as fully reliable; negative samples can contain
substantial label noise. Therefore, when building species distribution models, PA and PO data each
contribute complementary strengths.</p>
        <p>Previous work integrating PA and PO data has generally followed one of two paradigms. Chen et
al.[3] proposed a three-stage training framework: first, train a network on high-quality PA data; second,
use that network to generate pseudo-labels for PO data and perform semi-supervised fine-tuning; and
ifnally, refine the model again using the PA data. This approach preserves the “data dividend” of PO
while mitigating its label noise and distribution shift problems. Liu et al.[4] proposed an alternative
framework based exclusively on PA data: they train a network on PA data and, under the ecological
prior that geographic proximity implies ecological similarity, extract the most frequently occurring
species among neighboring PO and PA nodes for each test sample to supplement the neural network’s
predictions. These two approaches placed second and third, respectively, in the 2024 GeoLifeCLEF
challenge with only a small margin between them [5].</p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. Our method</title>
        <p>To leverage PO data, we propose a novel pseudo-labeling strategy based on aggregating plant species
labels from PO survey samples within the geographic coverage of satellite image patches. Our network
is an improved version of the Tighnari model introduced in 2024. With the expansion of training data,
we scale up the backbone Swin Transformer for satellite image feature extraction to the Base size,
and adopt the TabM network [6]—proposed by Yandex—as the backbone for the tabular modality to
enhance feature representation. The Temporal Swin-Transformer for time series feature extraction
is retained. Additionally, the modality derived from neighborhood label aggregation is now set as an
optional input rather than a mandatory one, as severe label noise in PO data could propagate into the
feature space through label aggregation; thus, this modality is only utilized when all training data are
from PA samples.</p>
        <p>Finally, we improve the modality fusion module by replacing the original hierarchical cross-attention
mechanism with a stackable serial tri-modal cross-attention, allowing better fusion of heterogeneous
modalities.</p>
        <p>During exploratory data analysis (EDA), we discovered distributional diferences between the
geographic locations of PA test and PA training samples. According to the ecological prior that geographic
proximity implies ecological similarity, such distribution shifts can significantly degrade model
performance on test samples located far from the PA training distribution. Moreover, models jointly trained
on re-labeled PO data (hereafter referred to as PO) and PA data sufer from label noise in the PO data,
resulting in inferior inference performance on test samples overlapping with the training distribution
compared to models trained solely on PA data. To address this, inspired by the expert model paradigm,
we partition test samples based on whether a PA sample exists within a 10-kilometer radius, and assign
diferent models trained on distinct datasets to infer the two parts separately.</p>
        <p>Our contributions are as follows:
1. We propose a weakly supervised pseudo-labeling rule, aggregating PO sample labels within
the geographic coverage of satellite image patches. This reduces label sparsity and minimizes
negative label noise without introducing positive label noise.
2. We design a stackable cross-attention module with serial updates and shared Key/Value layers
across modalities, enhancing multimodal fusion capability.
3. We develop an eficient two-stage training procedure that allows the model to maintain accurate
discrimination of positive and negative labels under the PA training sample distribution while
retaining the recognition of positive labels learned from the large-scale PO data.
4. Following the Mixture of Experts (MoE) paradigm, we train separate models on diferent datasets
to predict samples from distinct geographic distributions.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Exploratory Data Analysis</title>
      <sec id="sec-2-1">
        <title>2.1. Geographical Distribution of PO and PA Samples</title>
        <p>Exploratory data analysis (EDA) is crucial for uncovering the motivation behind our modeling choices.
We visualized the geographic locations of the Survey IDs for both PA (Presence-Absence) and PO
(Presence-Only) samples, aiming to investigate their spatial distribution patterns.</p>
        <p>The visualization reveals significant diferences in the geographic distributions among PA training
samples, PO training samples, and PA test samples. Specifically, PA training samples are only distributed
in Western European countries such as France, the Netherlands, and Denmark. In contrast, PO training
samples cover both Western and Central Europe. The PA test samples, in addition to the regions covered
by both PA and PO training data, also include certain Eastern European countries such as Ukraine. This
indicates the existence of an out-of-distribution (OOD) scenario in the test set (shown in Figure 1).</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Diferences in Species Occurrence Frequency and Sample Species Counts between PO and PA</title>
        <p>We plotted three figures to illustrate the presence–absence (PA) dataset (shown in Figure 2), the
presenceonly (PO) dataset obtained by merging observation records solely by SurveyID (shown in Figure 3),
and the PO dataset produced after applying our processing strategy (shown in Figure 4). For each
dataset, the figures display (i) the distribution of the total number of species recorded in each survey and
(ii) the distribution of the number of surveys in which each species appears. A substantial discrepancy
between the PA and PO datasets in the per-survey species-count distributions indicates severe label
noise, whereas an excessive divergence in the species-frequency distributions implies that the PO
samples sufer from a long-tail problem inconsistent with the true ecological situation.</p>
        <p>The blue histograms represent the distribution of the number of species present in each survey. It
can be observed that the peak of the PA training samples is at 10, while for the unprocessed PO training
data, the peak is at 1, and over 99% of the samples contain fewer than three species. This suggests
that the raw PO data contains a large amount of noise. The third figure shows the distribution after
relabeling the PO data with our pseudo-labeling strategy. Although the peak is still at 1, the distribution
becomes less steep and is closer to the distribution of PA samples.</p>
        <p>The red histograms represent the number of occurrences for each species across all surveys. We
found that 79.5% of plant species in the 88,987 PA samples occur fewer than 50 times, while more than
70% of species in the unprocessed PO samples appear fewer than 50 times, with 5,000 species appearing
only once. Although the final evaluation metric is the weighted F1 score, accurately predicting these
rare species is often the most critical aspect of species distribution modeling based on spatial and
positional data.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>In this section, we present a comprehensive description of the entire process from data preprocessing
to model development and training.</p>
      <sec id="sec-3-1">
        <title>3.1. Pseudo-Labeling Strategy</title>
        <p>We observed significant diferences in the annotation procedures between PA (Presence-Absence) and
PO (Presence-Only) survey samples. The labels for PA samples are derived from quadrat surveys, in
which researchers lay out plots (quadrats) of fixed area and shape—typically square, rectangular, or
circular—within the study region and conduct detailed investigations and recordings of the species,
abundance, and other attributes within each quadrat. In contrast, PO samples are mainly sourced from
citizen science crowdsourcing platforms, where volunteers photograph observed species and upload
their geotagged observations. Consequently, PO labels are much sparser compared to PA labels.</p>
        <p>If we aggregate (i.e., merge and deduplicate) all PO survey sample labels within a fixed geographic
area, it is equivalent to a group of citizen scientists collaboratively conducting a less professional quadrat
survey within that area. Thus, the labeling strategy for aggregated PO surveys becomes analogous to
that of PA surveys.</p>
        <p>Moreover, each survey corresponds to a satellite image patch—a 64 × 64 grid of pixels, each with a
spatial resolution of 10 m × 10 m—centered on the survey site, forming a 640 m × 640 m square region.
For any PO sample (hereafter referred to as the primary sample for aggregation), if we aggregate the
labels of all PO samples located within its corresponding satellite image patch (the 640 m × 640 m
area), the resulting patch efectively contains the satellite image pixels of all aggregated samples, thus
preventing the introduction of image feature noise that could arise when the satellite pixels of aggregated
PO samples fall outside the patch of the primary sample. Compared to training a teacher model with
PA data and then using it to pseudo-label PO samples, this aggregation-based strategy significantly
reduces negative label noise without introducing positive label noise.</p>
        <p>Another critical issue is that the initial PO dataset comprises 3,845,533 surveys. If every survey were
to be pseudo-labeled using this aggregation strategy, the distribution of species occurrences would
become even more imbalanced. Conversely, if all PO samples that participate in label aggregation are
subsequently excluded from serving as primary samples for further aggregation, the occurrences of rare
species would remain scarce. Worse still, since some surveys containing rare species are geographically
adjacent, their labels may be merged into a single PO sample, further decreasing their occurrence count.</p>
        <p>To address this, we propose a reserved-sample strategy: if a PO survey sample does not contain
any species whose overall occurrence count across all surveys is less than 100, then this sample will
no longer serve as a primary sample for subsequent aggregation. Otherwise, it remains eligible to
aggregate labels from nearby PO samples. We designate the three filtering strategies for PO survey
samples as loose filtering, balanced filtering, and strict filtering (shown in Algorithm 1).</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Backbone Networks</title>
        <p>
          Our network is an improved version of the Tighnari model proposed in 2024. Given the increased
training data, we upgraded the size of the Swin Transformer [7] backbone used for satellite image feature
extraction from Tiny to Base. For the temporal modality, we retained the Temporal Swin-Transformer
architecture from the Tighnari model. Specifically, when handling temporal cubes cropped to sizes
(
          <xref ref-type="bibr" rid="ref4">4, 18, 12</xref>
          ) and (
          <xref ref-type="bibr" rid="ref4 ref6">6, 4, 20</xref>
          ), we set the patch sizes to (
          <xref ref-type="bibr" rid="ref3 ref3">3, 3</xref>
          ) and (
          <xref ref-type="bibr" rid="ref2 ref5">2, 5</xref>
          ), and the window sizes to (
          <xref ref-type="bibr" rid="ref2 ref3">3, 2</xref>
          ) and
(
          <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
          ), respectively. We stacked two Swin-Transformer stages: one with depth 2 and 12 attention heads,
and another with depth 6 and 24 heads, for the two types of temporal cubes. This configuration of
SwinTransformer backbone showed highly competitive accuracy and stability in our ablation studies—only a
ResNet18 with reduced input convolutional size could achieve comparable performance.
        </p>
        <p>The efectiveness of Swin-Transformer-based models in feature extraction for plant species
distribution prediction is well-supported. In such tasks, large color regions (low-frequency information)
in images or temporal cubes are generally more important than textures and edges (high-frequency
information). Thanks to its alternating window and shifted window attention, Swin Transformer
naturally has a much larger receptive field than standard convolutions and is thus better at extracting
low-frequency information. Moreover, its use of smaller patch sizes and local windowed attention,
compared to ViT, avoids the shortcomings of global attention with large patches and enhances its
ability to capture high-frequency details—such as environmental boundaries in satellite image patches
or seasonal transitions in temporal cubes.</p>
        <p>For tabular modality feature extraction, we replaced the original MLP backbone with the TabM
network proposed by Yandex to enhance representational capacity. TabM is based on the BatchEnsemble
method: on top of a single main model’s weights, it introduces a pair of learnable rank-1 scaling
factors for each sub-model, enabling parallel inference and ensemble-like output fusion with minimal
extra parameters. TabM does not rely on attention mechanisms, yet achieves superior accuracy and
generalization on standard benchmarks and open tabular data competitions compared to leading models
such as TabNet (attention-based) and CatBoost (ensemble-based).</p>
        <p>Furthermore, the neighborhood label aggregation-based modality is set as an optional input rather
than a mandatory one. Since PO data contains severe label noise, using domain label aggregation as a
feature would introduce further noise. This modality is only available when the training data consists
solely of PA samples (shown in Algorithm 2).</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Modality Fusion</title>
        <p>We trained separate models for the three modalities(four inputs) and observed considerable diferences
in accuracy: the satellite image modality yielded the best performance, while the bioclimatic time series
and tabular modalities lagged behind. This suggests that the modalities contribute unequally to overall
performance, with a dominant modality often present.</p>
        <p>In parallel cross-attention fusion, the module cannot recognize the dominant modality’s prior
importance, causing its contribution to be diluted by weaker modalities. Additionally, when each modality
maintains independent sets of , , and  vectors, incompatibility between their  and  spaces
complicates modality alignment, making attention weights less efective. The cross-attention
implementation in the original Tighnari model also concatenated all attention representations at the output,
preventing stacking of modules and thereby limiting the extraction of higher-order features.</p>
        <p>To address these limitations, we designed a stackable serial three-modality cross-attention mechanism.
First, we use independent linear layers to map the three modalities’ inputs to a shared hidden dimension.
Each modality generates its own queries (), while all concatenated modality features share a single
key/value ( ) projection, greatly reducing parameter count and improving GPU utilization. This also
unifies the / vectors from diferent modalities into the same semantic space.</p>
        <p>Unlike traditional cross-attention modules that compute attention in parallel for each modality, our
design lets modality  attend to the latest features of  and  (as keys/values) and normalizes its
output; then, the updated  and  are used to update ; finally, the updated  and  are used to update
, forming an ordered iterative process. This approach allows the dominant modality to compute
attention representations first, helping the model focus on key information, while serial multi-step
attention updates help the module capture higher-order dependencies. Unlike previous designs, the
outputs are not simply concatenated but are mapped back to the original dimension and combined with
the input via residual connections, allowing the module to be stably stacked for extracting high-order
features (shown in Algorithm 3) .</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Loss Function</title>
        <p>Exploratory Data Analysis (EDA) reveals that the mode of the number of species recorded per survey is
12, while the total number of species is 11,255. Thus, negative classes vastly outnumber positive classes.
Traditional Binary Cross-Entropy (BCE) Loss treats positive and negative samples equally, which leads
to overfitting on majority (negative) classes and insuficient learning for minority (positive) classes.
Therefore, we aim to assign higher weights to dificult examples.</p>
        <p>Moreover, negative labels are inherently noisy—an “absent” label does not guarantee that the species
is truly absent. If the model correctly predicts a high probability of presence but is heavily penalized due
Algorithm 3 Tri-Modal Cross Attention
Require: Inputs: Modalities , , ; hidden dimension ; number of heads ℎ
Ensure: Outputs: Updated modalities ′, ′, ′
1: Project inputs , ,  to hidden dimension 
2: Compute attention queries , , 
3: Compute shared keys and values projections 
4: Update : Cross-attention with , 
5: Update : Cross-attention with updated  and 
6: Update : Cross-attention with updated  and 
7: Apply FFN and normalization to updated modalities , , 
8: Project updated hidden representations back to original dimensions and add residual connections
return ′, ′, ′
to a mislabeled negative, efective feature learning is impeded. In addition, the geographical distribution
of PA test samples difers from the training data, rendering Threshold Top-K methods inefective due
to their reliance on identical distributions between training and test sets. In such scenarios, fixing
the binary classification threshold at 0.5 becomes particularly important, as it provides a stable and
reproducible standard for positive/negative decision, independent of sample distribution.</p>
        <p>
          Ultimately, we adopted Asymmetric Loss (ASL) [8] as a replacement for BCE. The ASL formula is as
follows:
=1

ℒASL = 1 ∑︁[︀ −  (1 − ) + log  −
(1 − ) max(︀  − , 0)︀  − log(1 − )]︀ ,
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
        </p>
        <p>Here,  denotes the total number of labels;  ∈ {0, 1} is the ground-truth for the -th label;  ∈ [0, 1]
is the predicted probability that the -th label is positive;  + and  − are the focusing parameters for
positive and negative samples, respectively; and  is the clipping threshold applied to negative samples.</p>
        <p>ASL introduces several improvements over BCE:
• Asymmetric focusing: Diferent focusing parameters (  + for positives,  − for negatives) adjust
the contribution of hard/easy examples for each class separately.
• Negative sample clipping: The negative loss term is suppressed when the predicted probability
 is below the threshold , which reduces the impact of potentially noisy negative labels on the
gradient.
• Robustness: By mitigating the dominance of noisy negatives, ASL improves the model’s
robustness and overall feature learning capability.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Training Strategy</title>
        <p>In this section we describe the training strategy for the two expert models in the MoE architecture.
We trained two separate models: one for inference on test samples geographically close to PA training
samples (referred to as the in-distribution test set), and another for test samples in regions without any
nearby PA training samples (the out-of-distribution test set). For in-distribution inference, we followed
the same training procedures and post-processing strategies as the Tighnari model proposed in 2024.</p>
        <p>For out-of-distribution inference, we adopted a two-stage training strategy. In the first stage, we
mixed PA training samples with pseudo-labeled PO samples and pre-trained the model for three epochs
with a relatively high learning rate (0.0001). This enabled the model to encounter and learn associations
for many species present only in PO data. In the second stage, we fine-tuned the model for five additional
epochs with clean PA data and a lower learning rate, further enhancing the model’s predictive capability
for species in the PA set, while minimizing catastrophic forgetting for species only present in PO data.
The number of epochs was determined by monitoring overfitting on PO data (as training set) while
validating on PA data: since PO labels are much noisier, the model tends to overfit negatives faster. If
PA and PO were trained together from the beginning, persistent loss reduction on the PA validation set
would obscure the point when the PO set begins to overfit.</p>
        <p>Overall, this training strategy balances rare species recognition with robust generalization on
highquality labeled samples.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Post-Processing Strategy</title>
        <p>We implemented a divide-and-conquer inference process based on whether the test sample is
geographically close to any PA training sample, separating the in-distribution and out-of-distribution test sets.
The post-processing steps are adapted from the Tighnari model.</p>
        <p>First, we used the Threshold Top-K method to select predicted species, then merged these predictions
with high-probability species lists from geographically adjacent samples, deduplicating the results. For
the in-distribution test model, the threshold and  parameters in Threshold Top-K can be directly
optimized via grid search based on validation set performance. We further selected the five nearest PA
training samples to each test point, counting as present any species with over 80% observed frequency.</p>
        <p>For the out-of-distribution model, determining the optimal threshold for Threshold Top-K was less
straightforward. With the help of the ASL loss, we controlled the optimal threshold to around 0.45–0.5,
and further tuned this value using Kaggle’s submission feedback. Afterward, we selected the six closest
PO training samples (obtained using the strict pseudo-label filtering strategy), and considered any
species present in more than 50% of them to adjust the model’s predictions.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Metrics</title>
        <p>To demonstrate the optimality of our chosen backbone networks, we conducted extensive comparative
experiments. We used the weighted F1 Score and Binary Cross-Entropy (BCE) Loss as the evaluation
metrics. The formulas are as follows:</p>
        <p>For each PA test sample, let  denote the ground-truth set of species (as species IDs), and ˆ  denote
the predicted set. For every test sample, the submitted prediction must provide a list of predicted species.
The micro 1-score is then computed as:
where
1 =</p>
        <p>1 ∑︁</p>
        <p>TP
=1 TP + (FP + FN) /2
,
ˆ
TP = Number of predicted species truly present, i.e., |  ∩ |,
ˆ
FP = Number of species predicted but absent, i.e., |  ∖ |,</p>
        <p>FN = Number of species not predicted but present, i.e., | ∖ ˆ |.
 is the total number of test samples. This metric averages the per-sample 1-score over all samples.
The Binary Cross-Entropy (BCE) Loss is defined as:</p>
        <p>1 ∑︁ [ log() + (1 − ) log(1 − )] ,
ℒBCE = −</p>
        <p>=1
where  is the number of samples,  ∈ {0, 1} is the ground-truth label, and  is the predicted
probability for sample .</p>
        <p>
          The model was trained for approximately 10 hours on an H20 GPU for each run.
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Comparative Experiments</title>
        <p>Our comparative experiments demonstrate the superiority of our selected backbone networks and
modality fusion approach. In the comparative experiments, we adopted the Tighnari model from 2024
as our base model, in which satellite image features were extracted using Swin Transformer Tiny,
while the time-series cubes features were extracted by a modified Swin Transformer with adjusted
hyperparameters such as patch size, window size, and depth. Tabular features were extracted using
fully-connected layers.</p>
        <p>We used the PO dataset merged through the strategy described above as the training set, and all PA
samples (the 2024 PA training set) as the validation set, to evaluate the capability of various architectures
to learn efectively from label-noisy data.</p>
        <p>In Table 1, we fixed the time-series feature extraction network and experimented by replacing image
extraction networks of diferent types and sizes. Through extensive comparisons, we found that the
Swin Transformer Base significantly improved performance (shown in Table 1).</p>
        <p>In Table 2, we fixed the Swin Transformer Base as the satellite image feature extraction network and
tested several visually-based backbone networks with hyperparameter tuning for extracting features
from the time-series cubes. We found that our Temporal Swin Transformer, designed in 2024, achieved
the best performance (shown in Table 2).</p>
        <p>In Table 3, we compared direct feature concatenation, two other stackable cross-attention
methods, and our proposed sequential attention. We ultimately found that our proposed cross-attention
mechanism significantly enhanced performance (shown in Table 3).</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Ablation Studies</title>
        <p>Additionally, to demonstrate that the MoE approach can efectively enhance the model’s performance
on OOD datasets, we conducted an ablation study using the 2024 leaderboard (where test samples and
PA training samples are in-distribution) and the 2025 leaderboard (with OOD samples). Specifically, we
tested three scenarios: using only PA data, using both PA and PO data to train a single model, and our
previously mentioned MoE model.</p>
        <p>The results show that the MoE model achieved the best scores on both the 2024 and 2025 test sets.
Moreover, the performance improvement brought by the MoE model was particularly prominent on
datasets containing OOD samples (shown in Table 4).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <sec id="sec-5-1">
        <title>5.1. Limitations and Reflections</title>
        <p>In our framework, pseudo-labels are generated based on prior rules, resulting in the number of
pseudolabeled species being less than or equal to the actual number of species. Moreover, no semi-supervised
process is applied for incremental iteration in later stages.</p>
        <p>We recognize that PO data inherently contains label noise that is class-dependent; that is, some species
labeled as negative samples may in fact have been present during the corresponding spatio-temporal
survey. However, our current work does not incorporate strategies to explicitly model or mitigate such
label noise.</p>
        <p>Our species distribution model is essentially a multi-label binary classification task. The large number
of labels leads to overly generalized feature extraction by the backbone network, negatively impacting
model performance. In addition, the vast majority of labels only appear in a few samples (i.e., a
longtailed distribution), making the model prone to overfitting on the few common labels while ignoring
the rare ones.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Future Work</title>
        <p>Visualization results indicate significant diferences in the number of species observed per survey.
Instead of using a fixed  as in the Threshold Top-K method, we propose allowing the model to predict
the appropriate  for each sample based on tabular features.</p>
        <p>For binary classification of each species, we plan to anchor positive labels as immutable based on
prior knowledge, while allowing negative labels to transition to positives with a certain probability.
Specifically, we intend to estimate the probability of negative-to-positive transitions in a transition
matrix and use this matrix to down-weight the loss penalty for misclassified negatives.</p>
        <p>Furthermore, we aim to exploit inter-label correlations by constructing a graph structure, treating each
species label as a node and connecting nodes that co-occur in the same observation. After traversing all
samples, this yields a species co-occurrence graph. Based on this graph, we can apply graph clustering
to partition species nodes, or use the structure to initialize a Graph Neural Network (GNN) layer at the
output stage, thereby leveraging label co-occurrence as a prior to refine predictions.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The data for this paper is organized and published by INRIA. We express our gratitude to all the
institutions and individuals involved in data collection and processing, including but not limited to the
Global Biodiversity Information Facility (GBIF, www.gbif.org), NASA, Soilgrids, and the Ecodatacube
platform. We also express our sincere appreciation to the LifeCLEF and GeoLifeCLEF teams for
organizing the challenges and providing timely support. All authors contribute helpful ideas during
the course of the competition and participate in writing and revising the paper, so all authors are
co-first authors. All of the authors, as corresponding authors, are obliged to reply to emails to provide
readers with the relevant code and data of this work and explain the details of the work. Among them,
Haixu Liu, as the first corresponding author, is responsible for the necessary communication for the
publication of the article.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT-4.5 and ChatGPT-o3 in order to: Text
Translation and Formatting assistance. After using these tools/services, the authors reviewed and edited
the content as needed and take full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of GeoLifeCLEF 2025:
          <article-title>Plant species presence prediction with environmental and high-resolution remote sensing data</article-title>
          ,
          <source>in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <surname>H. Go</surname>
          </string-name>
          <article-title>"eau</article-title>
          , L. Adam,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>Janouškov'a</article-title>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Matas</surname>
          </string-name>
          ,
          <string-name>
            <surname>V.</surname>
          </string-name>
          <article-title>Čerm'ak, K. Papafitsoros</article-title>
          , R. Planqu'e, W.-P. Vellinga,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          , J. S. Ca nas, G. Martellucci,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vinatier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of lifeclef 2025:
          <article-title>Challenges on species presence prediction and identification, and individual animal identification</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages (CLEF)</source>
          , Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Combining present-only and present-absent data with pseudolabel generation for species distribution modeling</article-title>
          , Working Notes of CLEF (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Tighnari: Multi-modal plant species prediction based on hierarchical cross-attention using graph-based and vision backbone-extracted features</article-title>
          ,
          <source>arXiv preprint arXiv:2501.02649</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Palard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Marcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          , et al.,
          <source>Overview of geolifeclef</source>
          <year>2024</year>
          :
          <article-title>Species composition prediction with high spatial resolution at continental scale using remote sensing</article-title>
          ,
          <source>CEUR-WS</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gorishniy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kotelnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Babenko</surname>
          </string-name>
          , Tabm:
          <article-title>Advancing tabular deep learning with parametereficient ensembling</article-title>
          ,
          <source>arXiv preprint arXiv:2410.24210</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Swin transformer: Hierarchical vision transformer using shifted windows</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF international conference on computer vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10012</fpage>
          -
          <lpage>10022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ridnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ben-Baruch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Zamir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Noy</surname>
          </string-name>
          , I. Friedman,
          <string-name>
            <given-names>M.</given-names>
            <surname>Protter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zelnik-Manor</surname>
          </string-name>
          ,
          <article-title>Asymmetric loss for multi-label classification</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF international conference on computer vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>82</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          • 2024 GeoLifeCLEF, •
          <year>2025</year>
          GeoLifeCLEF.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>• GLC25 PA and PO Tabular, • GLC25 PO Satellite Image, • GLC25 PO Time Series Cube, • GLC25 PO Bioclimatic, • Four Channel Timm Backbone</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>