<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Tile Compression and Embeddings for Multi-Label Classification in GeoLifeCLEF 2024</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anthony Miyaguchi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patcharapong Aphiwetsa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark McDufie</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Georgia Institute of Technology</institution>
          ,
          <addr-line>North Ave NW, Atlanta, GA 30332</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We explore methods to solve the multi-label classification task posed by the GeoLifeCLEF 2024 competition with the DS@GT team, which aims to predict the presence and absence of plant species at specific locations using spatial and temporal remote sensing data. Our approach uses frequency-domain coeficients via the Discrete Cosine Transform (DCT) to compress and pre-compute the raw input data for convolutional neural networks. We also investigate nearest neighborhood models via locality-sensitive hashing (LSH) for prediction and to aid in the self-supervised contrastive learning of embeddings through tile2vec. Our best competition model utilized geolocation features with a leaderboard score of 0.152 and a best post-competition score of 0.161. Source code and models are available at https://github.com/dsgt-kaggle-clef/geolifeclef-2024.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;GeoLifeCLEF</kwd>
        <kwd>LifeCLEF</kwd>
        <kwd>remote sensing</kwd>
        <kwd>contrastive learning</kwd>
        <kwd>multi-label classification</kwd>
        <kwd>tile2vec</kwd>
        <kwd>discrete cosine transform</kwd>
        <kwd>locality-sensitive hashing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>The GeoLifeCLEF 2023 had seven submissions along with baseline results by the organizers [3]. Most
participants focused on bioclimatic rasters and satellite imagery, leveraging Convolutional Neural
Networks (CNN) like ResNets [4] for feature extraction. Participants combined rasters and trained
separate models for prediction [5]. Spatial coordinates (longitude/latitude) were commonly used with
models like K-Nearest Neighbors (KNN) and Random Forest, yielding surprisingly good results. However,
the combination of diverse modalities provided in the dataset was rare, with only one participant utilizing
time-series data with a 1D Convolutional Network.
3. Overview
{
}
" t y p e " : " P o l y g o n " ,
" c o o r d i n a t e s " : [
[
[ − 3 2 . 2 6 3 4 4 , 2 6 . 6 3 8 4 2 ] ,
[ − 3 2 . 2 6 3 4 4 , 7 2 . 1 8 3 9 2 ] ,
[ 3 5 . 5 8 6 7 7 , 7 2 . 1 8 3 9 2 ] ,
[ 3 5 . 5 8 6 7 7 , 2 6 . 6 3 8 4 2 ] ,
[ − 3 2 . 2 6 3 4 4 , 2 6 . 6 3 8 4 2 ] ,</p>
      <p>The competition has three main components to the dataset. The first are the metadata associated
with the competition comprising a presence-only training, presence-absence training, and
presenceabsence test set. The metadata provides a mapping between location and the species labels available for
supervised training. The second are the remote-sensing and raster data provided in pixel format. The
ifnal component is time series data containing quarterly environmental data over a 20 year period.</p>
      <p>The presence-only training dataset comprises 5,079,797 examples over 3,845,533 survey sites
distributed across Western Europe. The dataset is drawn from crowd-sourced data with potential gaps
in the reported species. The presence-absent dataset has stricter semantics – species not included in
the survey are presumed absent. The training set has 1,483,637 examples over 88,987 sites, while the
test set has 4,716 sites. The datasets includes an identifier for the survey site alongside latitude and
longitude, as per the schema in Listing 1. We compute a projection into EPSG 3035, which allows for
Euclidean distance between sites in units of meters.</p>
      <p>The majority of available data are raster and satellite imagery. The provided GeoTIFF files contain
various measures such as elevation, roads, population, and soil. The GeoTIFF files are bounded by a
GeoJSON polygon that covers Western Europe as seen in Figures 1 and 2. RGB-NIR satellite imagery is
directly available as 128 × 128 pixel tiles associated with each survey site.</p>
      <p>| − − d a t a s e t : s t r i n g ( n u l l a b l e = t r u e )
| − − s u r v e y I d : i n t e g e r ( n u l l a b l e = t r u e )
| − − l a t _ p r o j : d o u b l e ( n u l l a b l e = t r u e )
| − − l o n _ p r o j : d o u b l e ( n u l l a b l e = t r u e )
| − − l a t : d o u b l e ( n u l l a b l e = t r u e )
| − − l o n : d o u b l e ( n u l l a b l e = t r u e )
| − − y e a r : i n t e g e r ( n u l l a b l e = t r u e )
| − − g e o U n c e r t a i n t y I n M : d o u b l e ( n u l l a b l e = t r u e )
| − − s p e c i e s I d : d o u b l e ( n u l l a b l e = t r u e )
Listing 1: Metadata schema for the competition.</p>
    </sec>
    <sec id="sec-3">
      <title>4. Processing Pipeline</title>
      <p>We explore several solutions for the multi-label classification problem. We use Luigi [ 6] as our workflow
management tool, which provides idempotent directed acyclic graphs (DAGs) of tasks. We use Spark
[7] to perform data extraction, transformation, and loading (ETL) from tarred images and CSV files to
columnar parquet files. We use PyTorch as our deep learning framework and use PyTorch-Lightning to
simplify the training and inference process. We use Petastorm to preprocess and load data into Torch.
We use Weights and Biases to log hyperparameters and metrics.</p>
      <sec id="sec-3-1">
        <title>4.1. Satellite and Raster Data</title>
        <p>The competition organizers provide point data for each survey site in a pre-computed train CSV file. Our
experiments focus on 128x128 pixel tiles extracted from provided GeoTIFF files for use in a supervised
learning setting. Significant in-memory overhead tiling exists because we need to store a 128x128 matrix
of integers or floats for each survey site. Memory access patterns can cause significant slow-downs if
we need to fetch them from disks often.</p>
        <p>We fork the oficial plantnet/GeoLifeCLEF data loaders to pre-compute tiles for each of the
provided GeoTIFF under 1GB. Certain rasters do not fit into memory (e.g., elevation raster at 11GB);
therefore, we omit them from our experiments. We only compute the tiles associated with the survey
identifiers in our metadata, which helps limit the size of the resulting dataset.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Tile Compression via Discrete Cosine Transform (DCT)</title>
        <p>We compute the 2D-DCT on the resulting tile images and keep low-frequency coeficients as features
in downstream modeling. We implement a PySpark wrapper around the ND-DCT to supplement
the standard library 1D-DCT implementation for feature pre-processing. We lose significant spatial
information if we perform filtering in 1D coeficient-space as seen in Figure 5.</p>
      </sec>
      <sec id="sec-3-3">
        <title>4.3. Time-Series Data</title>
        <p>Time series data is treated as another layer in the network by pre-processing the data to obtain DCT
coeficients. We have access to quarterly time-series data for each survey site over 20 years. Some sites
have missing data, which are padded with zeros. We compute the 1D DCT on the time series data and
keep the first 64 coeficients in the transformed space, which parallels the 8x8 2D-DCT coeficients
extracted from the raster data. The original time-series data and its DCT are displayed side by side in
ifgure 6</p>
      </sec>
      <sec id="sec-3-4">
        <title>4.4. Data Augmentation</title>
        <p>We apply augmentations to our data to encourage model invariance to rotation and reflection. Some
such transforms include rotating and flipping images before sending them into a model. Equivalent
augmentations exist in frequency space. For example, a 90-degree rotation in pixel space is equivalent
to the transpose of the 2D-DCT coeficients. We can flip the image in pixel space by alternating the
signs of the 2D-DCT coeficients along a given axis. Rotations and flips along the axis give us enough
lfexibility to implement useful symmetries in the data to improve model generalization.</p>
      </sec>
      <sec id="sec-3-5">
        <title>4.5. Locality-Sensitive Hashing for Nearest Neighbor Queries</title>
        <p>Species at sites that are close together should intuitively have similar distributions of plants. The
projected latitude and longitude have physical meaning via euclidean distance, so we can build a
nearest-neighbor model and rank species per survey site by frequency in a neighborhood. We use
locality-sensitive hashing (LSH) with random hyperplane projections to build a k-NN model [8], with
the hyper-parameters for bucket length and number of hash tables set to 20 and 5, respectively. We can
ifnd the top-k nearest survey sites in linear time for any given survey site using the LSH model. The
approximate nearest neighbor self-join is performed with a 50km cutof, and the results stored on disk
for downstream modeling.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Models</title>
      <sec id="sec-4-1">
        <title>5.1. Nearest Neighbor Model</title>
        <p>We generate predictions by querying the LSH model built on survey site locations across presence-only
and presence-absent datasets in a 50km radius. For each survey site, we limit either the number of
neighbors or the distance to the nearest neighbor. Our nearest neighbor models (NN) consider all
neighbors within a 5km, 10km, and 50km radius. The k nearest neighbors model (k-NN) considers the
top 10 neighbors for each survey site and all of the species reported in the neighbors.</p>
      </sec>
      <sec id="sec-4-2">
        <title>5.2. Geolocation Model</title>
        <p>We model the relationship between geo-spatial metadata (i.e. projected latitude and longitude) and the
species labels. We learn a linear and non-linear multi-label classification problem, and treat this model
as a baseline for remote-sensing based models. The linear model uses a single linear layer that maps the
feature space into the output space. The nonlinear model uses a two-layer model, with a ℛ256 latent
space in the first layer and a linear layer to map to the output space. We add random noise to latitude
and longitude with a mean of 5km to increase generalization.</p>
        <sec id="sec-4-2-1">
          <title>5.2.1. Tile CNN Model</title>
          <p>The processed remote-sensing data is represented by a multi-dimensional array with one indexing
dimension for the layer type and two spatial dimensions, making this a natural fit for CNNs. We
pre-process satellite and raster data by generating 128x128 pixel tiles, then applying the 2D-DCT to
iflter the 8x8 coeficients representing the lowest spatial frequencies. We use the coeficients as inputs
to a CNN model that learns a mapping to a multi-label classification task. We additionally run a few
experiments where we apply the inverse 2D-DCT to the coeficients to recover a multi-layer 128x128
pixel image. We learn models on the presence-absent dataset, since the dataset is representative of the
test.</p>
          <p>We build CNN models convolve input layers with a 3x3 kernel with padding, followed by a 1x1
convolution and linear layer to map into latent space. We map latent space to a linear layer with the
number of classes as the output space. Batch normalization is applied at every convolution for numerical
stability, and ReLU activation for non-linearities. We also experiment with alternative parameterizations
of the network, notably replacing the custom CNN with a pre-packaged eficientnetv2 backbone.</p>
          <p>We experiment with several models that take in diferent input data. The simplest model uses RGB-NIR
satellite imagery in four channels. We then incorporate the 13 MODIS landcover layers and 19
bioclimatic rasters. We then hand-choose specific layers from the rasters to remove redundancy, specifically
layers 9, 10, and 11 from MODIS and the years 2001, 2010, and 2019 from the bio-climatic rasters. The
other layers of the MODIS dataset correspond to legacy classification schemes and confidence bands,
which are likely not useful to our model. We choose three bands from bio-climatic rasters since the
years are evenly spaced across the set.</p>
          <p>We build a model using the provided time-series RGB-SWIR data. We reuse the same architecture as
the satellite imagery data by reshaping the first 64 coeficients of the time-series DCT into a 8x8 tensor
and then applying the same convolutional layers. The semantics are not necessarily the same as the
2D-DCT coeficients, but we hypothesize learned structure from this basis despite the lack of spatial
symmetry.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>5.2.2. Tile2Vec Model</title>
          <p>Tile2Vec [9] is a self-supervised learning technique that learns embeddings of tiles of satellite imagery.
The Tile2Vec model utilizes a spatially-aware sampling procedure and triplet loss to learn a
lowdimensional embedding that preserves metric distances via the triangle inequality. The triplet loss is
(, , ) with a margin  where  maps data to a -dimensional vector of real numbers using a
model with parameters  .</p>
          <p>
            (, , ) = [|| () −  ()||2 − ||  () −  ()||2 + ]+
(
            <xref ref-type="bibr" rid="ref1">1</xref>
            )
          </p>
          <p>We obtain triplets from the presence-only datasets by querying the LSH model to sample one million
pairs of tiles within 100km in the presence-only dataset. We generate a distant neighbor by randomly
selecting a tile from dataloader batch. We train a tile2vec model using the CNN architecture described
in section 5.2.1 without a classifier head in ℛ256 latent-space. We experiment with a multi-objective
loss incorporating the sum of triplet and ASL losses, using labels for each survey site by aggregating
all species within each site’s radius. The classifier adds a linear layer to the learned latent space and
training with the ASL loss on the presence-absent dataset.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>5.3. Model Evaluation and Loss Functions</title>
        <p>The competition uses the F1-micro score to evaluate models, and we use the same metric in
training. We utilize to compute the F1 score. We evaluate the model during training and validation
using MulticlassF1Score(average="micro") from the torchmetrics library, with a 90-10
trainvalidation split of the presence-absent dataset.</p>
        <sec id="sec-4-3-1">
          <title>5.3.1. Binary Cross-Entropy</title>
          <p>Binary cross-entropy is a loss function used for binary classification. In a multi-label setting, each label
as an independent binary classification problem. We use the loss function that accepts logits as input
for numerical stability, which is necessary to achieve acceptable convergence.</p>
          <p>= −

∑︁ , log(,)
=1</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>5.3.2. Asymmetric Loss (ASL)</title>
          <p>The asymmetric loss [10] penalizes false positives and false negatives diferently than the binary
cross-entropy loss. The loss is defined in terms of the probability of the network output , and
hyperparameters  + and  −
. Setting  + &gt;</p>
          <p>−
thresholded samples are ignored.
to 0 yields binary cross-entropy. Easy negative samples are dynamically down-weighted, and hard
emphasizes positive examples, while setting both terms
 =
︂{ + = (1 −</p>
          <p>) + log()

−</p>
          <p>= () − log (1 − )</p>
          <p>We sweep over parameters  + ∈ {0, 1} and  − ∈ {0, 2, 4}. The default values are  + = 1 and
The sigmoidF1 loss [12] optimizes the F1 score directly by creating a diferentiable approximation of
the F1 score. We first define the terms true positive, false positive, false negative, and true negative as a
function of the sigmoid function.</p>
          <p>
            ̃︀ = ∑︁ S(y^) ⊙ y ˜ = ∑︁ S(y^) ⊙ (1 − y)
˜ = ∑︁(1 − S(y^)) ⊙ y ˜ = ∑︁(1 − S(y^)) ⊙ (1 − y)
where S(y^) is the sigmoid function applied to the model’s output y^.
(
            <xref ref-type="bibr" rid="ref2">2</xref>
            )
(
            <xref ref-type="bibr" rid="ref3">3</xref>
            )
(
            <xref ref-type="bibr" rid="ref4">4</xref>
            )
(
            <xref ref-type="bibr" rid="ref5">5</xref>
            )
(
            <xref ref-type="bibr" rid="ref6">6</xref>
            )

−
          </p>
          <p>= 4.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>5.4. Hill Loss</title>
        <p>false negatives.
 = 1.5 and  = 2.
5.4.1. sigmoidF1</p>
        <p>The implementation provided by the authors provides the following form with default values of
The Hill loss [11] is a loss function designed for robust multi-label classification with missing labels.
The loss is defined as a weighted mean-squared error (MSE), where the weight modulates potential
ℒ−Hill = − () ×</p>
        <p>= − ( − )2.
ℒ−Hill =  × (1 − ) log() + (1 − ) × −</p>
        <p>( − )2
(; ,  ) =
Then we define the F1 score as a function of the true positive, false positive, and false negative terms.
ℒ̃︁1 = 1 − ̃︁1,
where
̃︁1 =
2̃︀ + ̃ ︁ + ̃ ︁</p>
        <p>
          We are given two hyper-parameter  = −  and  =  . For tuning we sweep over parameters
 ∈ {− 1, − 15, − 30} and  ∈ {0, 1} as suggested in the author’s experiments. The default values are
 = − 1 and  = 0.
2̃︀
(
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
(
          <xref ref-type="bibr" rid="ref8">8</xref>
          )
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Results</title>
      <p>We report the performance of our models using the hidden test set on the Kaggle competition leaderboard.
For torch-based models, predictions are made in two ways: top-k and threshold. The top-k method
selects the top-k species with the highest probability, while the threshold method selects species with a
probability greater than a threshold. We set k to 20 and the threshold to 0.5 for all relevant models.</p>
      <sec id="sec-5-1">
        <title>6.1. Nearest Neighbor Model</title>
        <p>We report nearest neighbor models in table 1. The k-NN models perform better than our NN models,
likely due to filtering out noise from diferent thresholds. We observe that when we do not limit by the
number of neighbors, larger distance thresholds lead to worse performance, with a diference of 0.10 to
0.08 when going from 5km to 50km. The opposite is true once we keep the top 10 neighbors, where
there is a small improvement in score as we increase the distance threshold.</p>
      </sec>
      <sec id="sec-5-2">
        <title>6.2. Geolocation Model</title>
        <p>The projected latitude and longitude provide a relatively strong signal for the multi-label species
classification tasks, with a score of 0.161 on the public leaderboard when using the top-k results in
table 3. Our experiments show that the linear model performs poorly as per table 2. However, adding a
non-linear layer to the model increases the performance by a large margin. For example, models trained
with BCE loss without class weights go from 0.03 to 0.14 when adding a non-linear layer.</p>
        <p>During training of BCE models, we observed that while the loss was decreasing between training
and validation sets, the validation F1 score would peak on the first epoch and then decrease over
time. We suspect that BCE loss has dificulty with the class imbalance even when given explicit class
weights. ASL increases the validation scores by a wide margin, likely because of the dynamic weighting
behavior. While the Hill and sigmoidF1 losses report improvements over ASL in the experimental
setting in literature, we find that default hyper-parameters perform worse than BCE loss. It’s possible
hyper-parameter tuning could increase performance over ASL. For the rest of the experiments, we focus
on ASL as the primary loss due to its performance and robust parameterization.</p>
        <p>For BCE models, we find that adding a class weight that is proportional to the normalized frequency
in the dataset does not improve performance of the model. It’s possible that these weights are not
computed correctly, but ASL provides a dynamic weighting mechanism that is far more efective when
the number of classes is large.</p>
        <p>For our ASL models, we find that the best hyper-parameters are  − = 0 and  + = 0 on the public
leaderboard, but the default value of  − = 4 and  + = 1 works just as well. This score performs better
than the BCE model, despite claims that setting values of  − and  + to zero would lead to a model that
is equivalent to the BCE model. We note that if we had to choose hyperparameters for ASL through a
validation set, it’s possible that we could choose one that would be sub-optimal for the test set. We
choose to use the default values for the rest of our experimentation.</p>
        <p>The Hill loss performs between BCE and ASL in the non-linear model, and so we do not consider it
further given the performance of ASL. The sigmoidF1 loss performs the worst out of all of the losses,
despite the tuning in the ranges provided by the literature.</p>
      </sec>
      <sec id="sec-5-3">
        <title>6.3. Tile CNN Models</title>
        <p>We report minor improvements in the performance of the geolocation model, with our best model
utilizing satellite imagery at 0.161 on the public leaderboard in table 3. However, training a model
on the presence-absent dataset is dificult due to model convergence to a minima. This behavior is
most prevalent in the eficientnetv2 backbone, where the larger parameter space and domain-specific
pre-processing distortion leads to sub-optimal convergence.</p>
        <p>The first 13 channels of the landcover raster to the RGB-NIR channels does not converge to a useful
model. We find that our performance drops down to 0.02 when we try to utilize these features. Keeping
the subset of features from landcover aids in a model that performs relatively well, but lacks large
improvements over the RGB-NIR model.</p>
        <p>The time-series model learns some structure despite the strange input representation with a score of
0.10 on the public leaderboard.</p>
      </sec>
      <sec id="sec-5-4">
        <title>6.4. Tile2Vec Model</title>
        <p>Tile2vec learns a useful representation that leads to smooth convergence of downstream classifiers. We
observe convergence occurring within four epochs in our experimental setting, and the increase in the
F1 metric for both validation and training sets increases monotonically on the classifier. In contrast, the
models without the tile2vec backbone have validation F1 scores that fluctuate, typically within the first
ifve epochs, and then decrease over time. The learned predictions are marginally less efective than
learning the CNN model directly on the presence-absent dataset.</p>
        <p>When we trained the model with ASL as part of the optimization objective, we observed that the triplet
loss term no longer decreased monotonically over time. Instead, it sharply decreased to a minimum,
increased, and decreased slowly over time. The behavior is likely due to the diference in magnitude of
the triplet loss and the ASL loss. The triplet loss is normalized, while the ASL loss is not, so the ASL
hyper-parameter dominates the gradient updates. This version of the model performs better on the
transfer learning task to present-absent data than the triplet loss alone.</p>
      </sec>
      <sec id="sec-5-5">
        <title>6.5. Competition Performance</title>
        <p>Our best models are reported against the public leaderboard in table 4. The best score of the competition
is 0.4089, while baseline models provided by the competition organizers lies around 0.25. Our models
lie between the granular and coarse-grain frequency-based models.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>7. Discussion</title>
      <p>We find dificulty overcoming basic baselines in the competition. In particular, frequency-based baseline
submissions can be significantly more efective than the solutions proposed in our research of the
problem. These solutions are done by predicting the top 25 species at varying levels of locality (e.g.,
globally or regionally) and by dataset. However, we find that latitude and longitude are surprisingly
predictive of plant species in the dataset given an appropriate loss function. Using these geospatial
features provides a useful diagnostic for more complex datasets, since the number of input features are
small and are easier to debug. One possible limitation of our methodology is that we do not utilize the
presence-only dataset with the exception of pre-training the tile2vec model.</p>
      <sec id="sec-6-1">
        <title>7.1. Alternatives Methods</title>
        <p>Learning a relationship between latitude and longitude to the species labels with classical machine
learning techniques and of-the-shelf libraries is computationally intractable. We note alternative
approaches that were explored but did not produce results for various reasons.</p>
        <sec id="sec-6-1-1">
          <title>7.1.1. Classical Supervised Learning</title>
          <p>Intead of using a neural network to learn a mapping from location to species, we tried learning the
mapping via logistic regression. This numerically simple model can be learned using Spark via stochastic
gradient descent (SGD). As a validation, we build a model to predict the 10 most frequent species in
the dataset per site using only the location features. This achieves an F1-macro score of 0.09 when
splitting the sites into a 90-10 train-validation split, which is better than random but roughly equivalent
to always choosing the most frequent species.</p>
          <p>We run into out-of-memory (OOM) issues when learning on 5 million rows and 10,358 species with
scikit-learn or statsmodels. When we run the same procedure in Spark via distributed stochastic gradient
descent (SGD), we find it will run for over 48 hours on a GCP n1-standard-8 instance (8 vCPU, 16GB
RAM, 350GB NVME SSD) using 3-fold cross-validation (CV). We suspect this is due to the size of the
coeficients involving J features and K output classes. Presuming an 8-byte double, the coeficients alone
will be at least 8MB, larger than the typical CPU cache.</p>
          <p>We investigate other algorithms for modeling multi-label classification, including Naive Bayes, SVM,
Random Forests, and Factorization Machines. Naive Bayes assumes non-negative count data. SVMs
are not tractable for our problem, and are slower to solve than linear/logistic regression for other
problems in the Spark toolbox. Random Forests only support up to 100 classes in Spark, likely due
to the branching factor to support each class. Factorization machines sufer a similar issue to logistic
regression and SVMs. Our final attempt to model multi-class classification via classical supervised
techniques is through XGBoost [13], which maintains a Spark binding. We run out of memory when
trying to model many classes.</p>
        </sec>
        <sec id="sec-6-1-2">
          <title>7.1.2. Low-rank Multilabel-Space Regression</title>
          <p>Instead of trying to learn the mapping between features and label-space directly, it is possible to learn a
relationship between features and a low-rank multi-label space instead [14]. It takes 30 minutes to learn
a regression between location and a single binary response using either linear regression or XGBoost.
Given these constraints, we would like to constrain our model to 4-8 response variables.</p>
          <p>We try reducing the label-space via the DCT since the relationship is trivially invertible in the
machine learning pipeline. We find that this is untenable since we need many more coeficients than
are available in our budget to represent discontinuities in species presence.</p>
          <p>Another approach is to use singular value decomposition (SVD) to compute a projection of label-space
into the first few eigenvectors, and then learn the relationship between the features and the projection
[15]. Then, predictions are quantized uses nearest neighbors in the projection of the label-space. This
process is similar to latent semantic indexing (LSI), and would allow a model to take into consideration
cooccurrences between labels. While interesting, this approach requires significant engineering efort
for results that are no more interpretable than neural networks.
7.1.3. Node2Vec
Node2vec [16] learns to preserve properties of network nodes using biased random walks. Using the
K-NN graph, we attempt embedding the survey sites using the co-occurrences of species among sites.
We could then use the survey site embedding as a feature for the classification task. The survey node
embedding is intractable due to the network size of 4 million nodes and 1 billion edges. A species node
embedding can be computed in 20 minutes, which results in a vector representation of species that can
be used for clustering or classification.</p>
          <p>A survey node embedding would be useful as a feature for the classification task, since it would
require no further processing to go from survey site to species. To take advantage of the species
embeddings, we would need to compute some average of the embedding vectors before passing into a
supervised classification model.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>8. Future Work</title>
      <p>We have explored various techniques for finding useful representations to model species distribution.
One area for future work is to capture better nuance associated with the self-supervised representation
learning of the tiles. We quickly reached a limit in how well the model could represent our training data,
so it would be helpful to rigorously explore alternative model parameterizations and hyperparameters
for the various loss functions. Additionally, it is unclear how best to incorporate the many raster layers
provided in the competition. Ideally we would be able to determine which layers are most important to
the multi-label classification task, possibly through extensive ablation testing of the features.</p>
      <p>We would also like to continue down network or graph models of the survey and species. A rich
interconnection exists between sites and species where sparse co-occurrences could be exploited through
spatial locality. One way this could be done is by constructing node features through message passing
of survey site nearest neighbors. Graph neural networks could also be an efective mechanism for
generating embedding spaces by propagating information via difusion. However, implementing these
techniques could be challenging, especially since we failed to generate a survey embedding through the
survey-species bipartite network due to computational constraints.</p>
      <p>Our findings indicate a significant variation in the occurrence of the labels, with some labels with
less than 10 data points and others with more than 10,000. Thus, this causes bias and imbalance in the
training. A possible solution would be to bin the labels according to their frequency so that each label is
relatively in the same range in terms of data points. This would also allow us to utilize XGBoost since it
would reduce the number of classes that need to be classified.</p>
      <p>It would also be interesting to implement a proper model of the dynamics of the remote sensing
data. We can build manifold representation of satellite imagery, demonstrated by our experiments with
tile2vec. It should be possible to model the linearized dynamics of a system by learning a Koopman
operator that steps forward state space from one timestep to another. We hypothesize that this could be
done by conditioning the tile embeddings on state evolutions, e.g., the 20 years of bio-climatic rasters
and quarterly time series data. One potential way to do this is to learn a spatiotemporal embedding of
the tiles via an explicit sequence model like an LSTM or transformer alongside methods to enforce the
geographical distributional semantics aforded by tile embeddings. Another approach is to perform
data-driven system identification to understand the dynamics of the bio-climatic rasters that have been
embedded into the space and to understand the governing equations of the system with a method like
SINDy [17].</p>
    </sec>
    <sec id="sec-8">
      <title>9. Conclusions</title>
      <p>In this study, we addressed the multi-label classification challenge of GeoLifeCLEF 2024, which aims to
predict the presence or absence of plant species at specific locations based on spatial and temporal remote
sensing data. We explored using a compressed version of the remote sensing data to train deep learning
models, with varying levels of success. We take advantage of the geospatial nature of the data by building
a neighborhood model with locality-sensitive hashing. Predictions from the neighborhood model
perform better than some of the simplest frequency models made by the competition organizers. The
neighborhood model is used as part of a self-supervised embedding model that learns a low-dimensional
representation of the data that is efective for classification. Despite poor performance on the leaderboard,
some of the ideas presented in this working note have potential for future work and have not been fully
explored. Source code and models are available at https://github.com/dsgt-kaggle-clef/geolifeclef-2024.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgements</title>
      <p>Thank you to Professor Patricio Vela for supervising the project for Anthony Miyaguchi’s ECE8903
Special Problems course at Georgia Tech. Thank you to the DS@GT CLEF group for access to cloud
computing resources through Google Cloud Platform, and for a supportive environment for
collaboration.
learning for spatially distributed data, in: Proceedings of the AAAI Conference on Artificial
Intelligence, volume 33, 2019, pp. 3967–3974.
[10] T. Ridnik, E. Ben-Baruch, N. Zamir, A. Noy, I. Friedman, M. Protter, L. Zelnik-Manor, Asymmetric
loss for multi-label classification, in: Proceedings of the IEEE/CVF international conference on
computer vision, 2021, pp. 82–91.
[11] Y. Zhang, Y. Cheng, X. Huang, F. Wen, R. Feng, Y. Li, Y. Guo, Simple and robust loss design for
multi-label learning with missing labels, arXiv preprint arXiv:2112.07368 (2021).
[12] G. Bénédict, V. Koops, D. Odijk, M. de Rijke, Sigmoidf1: A smooth f1 score surrogate loss for
multilabel classification, arXiv preprint arXiv:2108.10566 (2021).
[13] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd acm
sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794.
[14] A. Dasgupta, S. Katyan, S. Das, P. Kumar, Review of extreme multilabel classification, arXiv
preprint arXiv:2302.05971 (2023).
[15] F. Tai, H.-T. Lin, Multilabel classification with principal label space transformation, Neural</p>
      <p>Computation 24 (2012) 2508–2542.
[16] A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks, in: Proceedings of the
22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 2016, pp.
855–864.
[17] S. L. Brunton, J. L. Proctor, J. N. Kutz, Discovering governing equations from data by sparse
identification of nonlinear dynamical systems, Proceedings of the national academy of sciences
113 (2016) 3932–3937.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Marcos</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Palard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of GeoLifeCLEF 2024:
          <article-title>Species presence prediction based on occurrence data and high-resolution remote sensing images</article-title>
          ,
          <source>in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Espitalier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Marcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Šulc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hrúz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          , et al.,
          <source>Overview of lifeclef</source>
          <year>2024</year>
          :
          <article-title>Challenges on species distribution prediction and identification</article-title>
          ,
          <source>in: International Conference of the CrossLanguage Evaluation Forum for European Languages</source>
          , Springer,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of geolifeclef 2023:
          <article-title>Species composition prediction with high spatial resolution at continental scale using remote sensing</article-title>
          ,
          <source>in: Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2023</year>
          . URL: https://api.semanticscholar.org/CorpusID:264441401.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H. Q.</given-names>
            <surname>Ung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kojima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wada</surname>
          </string-name>
          ,
          <article-title>Leverage samples with single positive labels to train cnn-based models for multi-label plant species prediction</article-title>
          ,
          <source>in: Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2023</year>
          . URL: https://api.semanticscholar.org/CorpusID:264441458.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rouhani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Buchfuhrer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Bernhardsson</surname>
          </string-name>
          , E. Freider,
          <string-name>
            <given-names>G.</given-names>
            <surname>Poulin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Stadther</surname>
          </string-name>
          , Honnix,
          <string-name>
            <given-names>U.</given-names>
            <surname>Barbans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krasnukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Whiting</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Crobak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rieger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kiosidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nyman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Demaria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kukul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Raposo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>McGinty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Peksag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Brausewetter</surname>
          </string-name>
          , R. Tavory, hirosassa, G. Balaraman,
          <string-name>
            <given-names>T.</given-names>
            <surname>Engström</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Grainger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Grey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Czygan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Arapé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Caswell</surname>
          </string-name>
          , spotify/luigi,
          <year>2024</year>
          . URL: https://github.com/spotify/luigi.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Armbrust</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Xin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. K.</given-names>
            <surname>Bradley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kaftan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Franklin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghodsi</surname>
          </string-name>
          , et al.,
          <article-title>Spark sql: Relational data processing in spark</article-title>
          ,
          <source>in: Proceedings of the 2015 ACM SIGMOD international conference on management of data</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1383</fpage>
          -
          <lpage>1394</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rajaraman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Ullman</surname>
          </string-name>
          ,
          <article-title>Mining of massive data sets</article-title>
          , Cambridge university press,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Jean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Samar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Azzari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lobell</surname>
          </string-name>
          , S. Ermon, Tile2vec: Unsupervised representation
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>