<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Weak-Supervision Based on Label Proportions for Earth Observation Applications from Optical and Hyperspectral Imagery</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Laura E. Cué La Rosa</string-name>
          <email>lauracuerosa@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dário A. Borges Oliveira</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sam Thiele</string-name>
          <email>sam.thiele01@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pedram Ghamisi</string-name>
          <email>p.ghamisi@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard Gloaguen</string-name>
          <email>r.gloaguen@hzdr.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data Science in Earth Observation, Technical University of Munich (TUM)</institution>
          ,
          <addr-line>Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Helmholtz-Zentrum Dresden-Rossendorf (HZDR), Helmholtz Institute Freiberg for Resource Technology</institution>
          ,
          <addr-line>Freiberg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Advanced Research in Artificial Intelligence (IARAI)</institution>
          ,
          <addr-line>1030 Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Pontifical Catholic University of Rio de Janeiro (PUC-Rio)</institution>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>School of Applied Mathematics</institution>
          ,
          <addr-line>Getulio Vargas Foundation, Rio de Janeiro</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we assess a weak-supervised approach that employs weak constraints in the form of class proportions to train a neural network capable of performing pixel-wise classification for Earth Observation (EO) applications. The approach combines self-supervised contrastive clustering and a constraint on cluster proportions in an online fashion allowing its application in large-scale EO images. The methodology is based on the generation of simple augmented views of input image tiles, and the use of a loss function that performs contrastive learning to achieve consistent results that are invariant to these augmentations, and simultaneously follow the cluster proportions constraint. In many EO applications, information about class proportions is available through expert knowledge or e.g., governmental census. This weak information about class proportions allows training a classifier without information about the class at the pixel-level, alleviating the burden of manual annotation. In this context, crop and geological mapping from EO data are two crucial applications in the search for sustainable ways of resource management. We tested the approach upon optical and hyperspectral data achieving promising results and proving the method's applicability across diferent applications and data sources.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Weak-supervision</kwd>
        <kwd>Learning from proportions</kwd>
        <kwd>Multi-source</kwd>
        <kwd>Crop mapping</kwd>
        <kwd>Geological mapping</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>main characteristic of these methods is the capability of
learning meaningful feature representations in an
Self-supervised learning [1, 2, 3, 4] has recently emerged unsupervised fashion. This capability has opened new
as a powerful tool in computer vision applications. venues in other research fields beyond computer vision
Among the existing self-supervised methods, contrastive such as Earth Observation (EO) applications. In this
learning can be considered the most promising one. This context, crop and geological mapping from EO data are
type of approach is based on the generation of two crucial applications to agricultural monitoring and
augmented versions of the input image and the use of a modern mining, where frequently limited or
twin network that performs feature extraction that non-existent training information is available.
combined with a loss function performs contrastive Considering EO applications, self-supervised methods
learning to achieve consistent results between these have been employed with success including image
augmentations. The contrastive loss function is classification, object detection and semantic
expected to increase the similarity among the segmentation [5, 6, 7, 8, 9]. Some of these works employ
augmentations of the same image while decreasing the geolocation and spatio-temporal information to learn a
similarity from augmentations of diferent images. The more discriminative set of features for remote sensing
applications [5, 10]. Hyperspectral image classification
and clustering using contrastive learning have also been
the focus of recent publications [9, 8]. However, all the
approaches mentioned above need positive and negative
sample pairs to perform the contrastive loss, which is
computationally intensive.</p>
      <p>One of the most important contrastive-learning
methods is the Swapping Assignments between Multiple
Views (SwAV) [2], which performs self-supervised and
clustering in an online fashion. The method employs an
optimal transport (OT) solver to assign the image
feature vectors to cluster centroids by means of an
equipartition constraint that ensures that all samples 2.1. LLP and Optimal Transport
within a batch of images are equally assigned to the In this work, we asses the LLP-Co approach in a scenario
predefined number of clusters. where only to the global class proportions are available</p>
      <p>An advantage of the SwAV method over the to train the network. To implement LLP, the training
previously proposed contrastive learning frameworks is samples are split into  disjoint bags of image tiles, where
that the use of the OT solver with the equipartition  is the th bag, which consists of a set of  randomly
cRoencestnrtaliyn, t waleloawks indifsorremgaartidoinng inpaitrhweisefocrmompoafricsolansss. cropped image tiles from the large scale input EO image.
proportions was introduced as a constraint in SwAV to wHietrhei,nℬthe=ba{g( x.,T h)}efin=a1l, twrahineirnegxse,t iiss tthheenimexapgreestsieled
train a classifier in a weakly-supervised fashion. The
PmreotthotoydpiccaallledCoLnetarransitnivge frComlustLearbinegl Pr(oLpLoPr-tCioon)s w[1it1h] laaasmbeul lpt=ir-ocpl{ao(srℬstipo,rnwosb),l}wem=hi1cw,hiwtihshethreecswlaasmsiseesaf,owvreac∈ltl o∆braogfsagnldo.bsIa.ntl.
disregards the equipartition constraint in the OT solver ∑︀=1 w = 1, where the w element is the proportion
by adding a cluster proportions constraint. of tiles that belong to class . In the methodology a</p>
      <p>Using information about class proportions to train a neural network acts as the feature extractor followed by
classifier has gained more attention in the last years
layer that delivers the class probabilities vector p˜, =
[12, 13, 14, 15]. Given a set of images, Learning from  (y|x, ), where  represents the network parameters
Label Proportions (LLPs) approach focuses on learning [16]. Then, the estimated global label proportions for
tahneincsltaasnscep-rleovpeolrctliaosnssifieroubssienrgveads rienfertehniscesseitg.nalInonlEyO each bag is expressed as:
applications, with a large amount of available data and 
the unavailability of pixel-level annotations, the use of w^  = 1 ∑︁ p˜, ,
priors like class proportions is an attractive solution. In  =1
many real-life scenarios, these proportions can be and to train the network a standard cross-entropy loss
obtained by governmental census or even expert function can be used
knowledge. Examples of governmental agencies that 
rreecsoourdrcsetsa,tiasmticosnagbootuhtearsg,raicruelttuhreeN,faotrieosntaryl,Aagnrdicnualttuurraall (ˆ, ) = − 1 ∑=︁1 w log w^ . (1)
Statistics Service of the United States Department of The above equation is reformulated by encoding the
Agriculture1, the Brazilian Institute of Geography and label proportions as a posterior distribution [1, 17, 11]
Statistics (IBGE) in Brazil 2, Forest Research in the
United Kingdom3, and the European Statistics website 4. 1 ∑︁ ∑︁ ∑︁ (|x, ) log  (|x, )</p>
      <p>This paper focuses on accessing the viability of using (, ) = −  =1 =1 =1 
contrastive learning combined with LLP to train a
pixelwise classifier based only on prior information about (2)
global class proportions for EO applications. We tested delivering the LLP optimization objective as:
the LLP-Co methodology upon two datasets, the first min (, ), s.t. ∀ : (|· ) ∈ [0, 1] (3)
focuses on crop type mapping using optical data and (,)
the second on geological mapping using hyperspectral 
data. This allows assessing the model’s applicability ∑︁ (|x, ) = w, (4)
across diferent applications and data sources. Hence, =1
the main contribution of this study is to propose a
weaksupervised deep clustering method that employs label
proportions as priors and can be easily applied to
largescale EO data from diferent sources for significantly
diferent applications.</p>
      <p>1https://www.nass.usda.gov/
2https://www.ibge.gov.br/
3https://www.forestresearch.gov.uk/tools-and-resources/
statistics/forestry-statistics/
4https://ec.europa.eu/eurostat
where the global proportion constraint ensures that each
label  contains overall w samples. This equation is
an instance of the regularized optimal transport problem
and is solved using the Sinkhorn-Knopp algorithm [1, 17,
11]. Here P, =  (|x, ) 1 is the probabilities matrix</p>
      <p>estimated by the network and Q, = (|x, ) 1 is the</p>
      <p>matrix of assigned probabilities for bag ℬ. In the LLP-Co
approach, Q splits the samples within the bag following
the global label proportions. Then the objective function
as an OT solver is defined as</p>
      <p>min
Q∈(w,a)
⟨Q, − log P⟩ + ℎ(Q),
(5)
where  (w, a) is the matrix space of possible solutions
for the -th bag,and a = (1/)1 is a normalizing
constraint [18].</p>
      <sec id="sec-1-1">
        <title>2.2. Learning from Global Label</title>
      </sec>
      <sec id="sec-1-2">
        <title>Proportions with Prototypical</title>
      </sec>
      <sec id="sec-1-3">
        <title>Contrastive Clustering</title>
        <p>Non-Commercial Crops (NCC), pasture, eucalyptus,
turfgrass, cerrado and soil. This work focuses in the
second seeding period for major crops maize and cotton
for months between March to July. The reference data
consisted of 608 parcels. Table 1 gives the percentages of
the overall area planted with major crops accordingly to
the annotated parcel, we use this information as the
global vector of class proportions for our experiments.</p>
        <sec id="sec-1-3-1">
          <title>LLP-Co [11] is a self-supervised contrastive method that</title>
          <p>performs online clustering by means of a convolutional
neural network that delivers consistent cluster
assignments between augmentations of the same input.
At the same time, the cluster assignment must follow
certain cluster size constraints that are provided as weak
information. Given a user-defined number of views of
the same input image tile, the algorithm employs the OT
solver in Eq.5 to compute soft targets or codes. These
targets as then considered as true labels to calculate the
cross-entropy considering the network’s prediction for
other views. The methodology pipeline for two
augmented views and  classes is the following. First
each image tile  within a bag is transformed into two
augmented version fed to an encoder network that
extracts the features vectors z,1 , z,2 . These features
are then mapped to one of  trainable prototypes V to
perform the code assignments for each view c,1 and
c,2 using the OT solver. From then on, a “swapped"
contrastive loss is applied to predict the assignment of
one feature from the code of the other. The optimization
process is then conducted by minimizing the loss for all
samples  within bag :
(z, , z, ) = ℓ(z,1 , c, ) + ℓ(z,2 , c,1 ),
1 2 2
where each term is the cross-entropy loss between the
code and the probability obtained after applying a
softmax function on the dot product between the
features Z and the prototypes V. For more information
about the LLP-Co method, see [11].</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Datasets</title>
      <sec id="sec-2-1">
        <title>3.1. Campo Verde dataset (CV)</title>
        <sec id="sec-2-1-1">
          <title>The first study site is in Campo Verde, an agricultural</title>
          <p>region located in Mato Grosso, at a latitude of 15°32′48”
south and a longitude of 55°10′08” west, Brazil (Fig. 1).
Campo Verde (CV) [19] is a public dataset 5 that
provides pre-processed SAR and Optical images between
October 2015 and July 2016. The major crops found in
the region are soybean, maize and cotton. Other crops
and non crops categries are beans, sorghum,</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>5The CV database is available from IEEE Dataport at https:</title>
          <p>//ieee-dataport.org/documents/campo-verde-database.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>3.2. Corta Atalaya dataset (CA)</title>
        <sec id="sec-2-2-1">
          <title>The second study area is located at Rio Tinto, Spain. Rio</title>
          <p>Tinto is located 70 km north of Huelva in the Iberian
Pyrite Belt (IPB), a belt extending from southern Portugal
(6) into southern Spain (Fig. 2). Our data was collected from
Corta Atalaya (CA), an open-pit mine with a size of 1200
× 900 m and a depth of ca. 350 m. This pit exposes
basaltic to intermediate volcanic rocks along the northern
part of the pit, and overlying felsic volcanic rocks, slate,
and conglomerate which are exposed in the western part
of the mine. We tested our approach using ground-based
hyperspectral imagery collected using a tripod-mounted
Specim AsiaFENIX sensor, which covers the visible-near
and short-wave infrared range. A labeled reference image
was created based on field mapping, fifty-seven hand
samples, and combined supervised classification followed
by manual interpretation of the hyperspectral data [20].
The lithologies interpreted at CA are as follows: oxidised,
massive sulphide, two varieties of chlorite, two sericitic
units, shale and purple shale. In this study, we grouped
the lithologies into two major categories, chlorite schist
and mineralised volcanics, in addition, weathered material
and vegetation were grouped in a category named others.
Table 1 gives the percentages of the overall area with
these two major lithologies accordingly to the labeled
reference image, we use this information as the global
vector of class proportions for our experiments. For more</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experiments</title>
      <sec id="sec-3-1">
        <title>4.1. Experimental Protocol</title>
        <sec id="sec-3-1-1">
          <title>Our experiments focused on the major categories found</title>
          <p>in both datasets. To assess the methodology’s robustness
to diferent data sources, we employed optical data for
CV dataset and hyperspectral data for CA dataset. For
the CV dataset, we considered the cloud-free optical
image available for May 2016. For the CA dataset, we
stacked VNIR and SWIR data in a unique data cube. We
evaluated the LLP-Co method under a scenario that uses
global class proportions to identify the major categories
in the target regions. Unlike the traditional LLP training
schemes, which calculate the class proportion for each</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Implementation Details</title>
        <sec id="sec-3-2-1">
          <title>Considering the diferent data sources, we employed a</title>
          <p>modified ResNet18 and ResNet10 as the backbone
architecture for CV and CA datasets, respectively. To
process the hyperspectral data cube in both spatial and
spectral domains with also added two 3D convolutional
layers at the beginning of the ResNet10 network for the
CA dataset. The ResNet architecture is then followed by
a projection head that projects the features to a
1024-dimensional space. We trained the models for 100
epochs using stochastic gradient descent with cosine
learning rate decay [21]. The image tiles size was set to
21 × 21 for both datasets. For each dataset, we randomly
selected 200,000 image tiles on the fly to create the
random bags. The list of augmentations includes
random rotations, mirroring, and random resizing to
obtain two views. For the OT solver, we set the
hyper-parameters as in [11]. The number of clusters for
both models was set to the number of categories found
in the datasets. We quantitatively assessed the method
using three metrics: cluster accuracy (), macro
average F1-score (F1-score), and normalized mutual
information (NMI). Since we use the class proportion
information, we reported the classification metrics by
considering the cluster assigned by the network at
inference time. We also report the confusion matrices.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>4.3. Baseline method</title>
        <p>We adopted the original SwAV method with the
equipartition constraint as the baseline method. This
constraint ensure that samples are equally partitioned
among the clusters, and for a good performance the
authors recommend a number of cluster at least three
times higher than the expected number of categories. In
preliminary experiment we found that 30 cluster
delivered a good performance for CV dataset, while 10
cluster delivered an acceptable performance for CA
dataset. The backbone network for SwAV is the same as
the LLP-Co backbone network for each dataset. To
evaluate the model we used the feature z generated by
the backbone network followed by a -means clustering.
LLP-Co Prediction</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Results</title>
      <p>Table 2 shows the performance for both datasets in
terms of , F1-score, and NMI. The model
performance reported competitive results, achieving
accuracies of 94.1% and 91.6% for the CV and CA
datasets, respectively. Similar performance was
observed in terms of F1-score for CV dataset with 93.8%. the major categories, with values above 91% for both
In contrast, for CA dataset, a lower value was observed datasets. However, in CA dataset, 48% of class others
with 76.9% of F1-score due principally to class others. was misclassified as chlorite schist, demonstrating the
The cluster quality metrics NMI reported values of 0.76 challenge of this task. Another possible explanation of
and 0.66 for CV and CA, respectively. Considering these this drop in performance can be related to the
metrics, the CV dataset reported better results than CA distribution of the classes, since considering a more
dataset. This may be due to the diferent types of balanced vector of class proportions (like in CV dataset
application and data since geological mapping from with w = (45.3, 35.8, 18.9)) but significantly diferent
hyperspectral data is a more challenging task due to among the classes, delivers much better performance,
significant confounding data variance and often subtle allowing the model to learn a more discriminative and
distinctions between the features of interest. relevant set of features. In contrast, for a highly</p>
      <p>Comparing LLP-Co with the baseline model, we unbalanced vector of proportions, the model will favor
observe that, as expected, the inclusion of priors into the the majority classes, as we observed for the CA dataset.
training process was crucial for a good classification Finally, Fig. 3 presents the classification maps for each
performance. LLP-Co outperformed SwAV by ∼ 20% and dataset. Here we can observe classification errors
∼ 30% in terms of accuracy for the CV and CA datasets, between class maize and the other two classes for CV
respectively. Similar improvement was observed for the dataset, and class mineralised volcanics with class others
F1-score, achieving an enhancement of ∼ 27% and ∼ 30% for CA dataset. In addition, it is worth pointing out the
for CV and CA datasets, respectively. quality of the predictions for both datasets, where no</p>
      <p>Table 3 presents the confusion matrices. As expected, salt-and-pepper efect was observed.
the per-class accuracy achieved high performance for</p>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusions References</title>
      <p>This work evaluates a recently proposed
weak-supervised method that combines contrastive
learning with class proportions constraints to train a
classifier without the need for labels at the pixel level in
the context of Earth Observation (EO) applications. The
approach was able to archive reasonable accuracy values
across diferent tasks and data sources, proving its
robustness and applicability to large-scale EO data.
Overall accuracy of 90% was reported for crop and
geological mapping applications considering the major
categories found in the target regions. The approach
also failed to identify classes with very small
proportions. Several ways of dealing with this problem
such as weighted cross-entropy or focal loss can be also
implemented into our method. The success of the
methodology opens a new path in the use of weak
information to help alleviate the burden of manual
annotation in EO.</p>
      <sec id="sec-5-1">
        <title>Diferentiable deep clustering with cluster</title>
        <p>size constraints, arXiv preprint, arXiv:1910.09036
(2019).
[19] I. D. Sanches, R. Q. Feitosa, P. M. A. Diaz,
M. D. Soares, A. J. B. Luiz, B. Schultz, L. E. P.
Maurano, Campo Verde database: Seeking to
improve agricultural remote sensing of tropical
areas, IEEE Geoscience and Remote Sensing Letters
15 (2018) 369–373.
[20] S. T. Thiele, S. Lorenz, M. Kirsch, I. C. C. Acosta,
L. Tusa, E. Herrmann, R. Möckel, R. Gloaguen,
Multi-scale, multi-sensor data integration for
automated 3-d geological mapping, Ore Geology
Reviews 136 (2021) 104252.
[21] I. Loshchilov, F. Hutter, Sgdr: Stochastic gradient
descent with warm restarts, arXiv preprint,
arXiv:1608.03983 (2016).
[22] H. W. Kuhn, The Hungarian method for the
assignment problem, Naval Research Logistics
Quarterly 2 (1955) 83–97.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>