Weak-Supervision Based on Label Proportions for Earth
Observation Applications from Optical and Hyperspectral
Imagery
Laura E. Cué La Rosa1,2 , Dário A. Borges Oliveira3,4 , Sam Thiele2 , Pedram Ghamisi2,5 and
Richard Gloaguen2
1
  Pontifical Catholic University of Rio de Janeiro (PUC-Rio), Brazil
2
  Helmholtz-Zentrum Dresden-Rossendorf (HZDR), Helmholtz Institute Freiberg for Resource Technology, Freiberg, Germany
3
  Data Science in Earth Observation, Technical University of Munich (TUM), Munich, Germany
4
  School of Applied Mathematics, Getulio Vargas Foundation, Rio de Janeiro, Brazil
5
  Institute of Advanced Research in Artificial Intelligence (IARAI), 1030 Vienna, Austria


                                          Abstract
                                          In this paper, we assess a weak-supervised approach that employs weak constraints in the form of class proportions to train
                                          a neural network capable of performing pixel-wise classification for Earth Observation (EO) applications. The approach
                                          combines self-supervised contrastive clustering and a constraint on cluster proportions in an online fashion allowing its
                                          application in large-scale EO images. The methodology is based on the generation of simple augmented views of input image
                                          tiles, and the use of a loss function that performs contrastive learning to achieve consistent results that are invariant to
                                          these augmentations, and simultaneously follow the cluster proportions constraint. In many EO applications, information
                                          about class proportions is available through expert knowledge or e.g., governmental census. This weak information about
                                          class proportions allows training a classifier without information about the class at the pixel-level, alleviating the burden of
                                          manual annotation. In this context, crop and geological mapping from EO data are two crucial applications in the search for
                                          sustainable ways of resource management. We tested the approach upon optical and hyperspectral data achieving promising
                                          results and proving the method’s applicability across different applications and data sources.

                                          Keywords
                                          Weak-supervision, Learning from proportions, Multi-source, Crop mapping, Geological mapping.


1. Introduction                                                                        main characteristic of these methods is the capability of
                                                                                       learning meaningful feature representations in an
Self-supervised learning [1, 2, 3, 4] has recently emerged unsupervised fashion. This capability has opened new
as a powerful tool in computer vision applications. venues in other research fields beyond computer vision
Among the existing self-supervised methods, contrastive such as Earth Observation (EO) applications. In this
learning can be considered the most promising one. This context, crop and geological mapping from EO data are
type of approach is based on the generation of two crucial applications to agricultural monitoring and
augmented versions of the input image and the use of a modern mining, where frequently limited or
twin network that performs feature extraction that non-existent training information is available.
combined with a loss function performs contrastive                                        Considering EO applications, self-supervised methods
learning to achieve consistent results between these have been employed with success including image
augmentations.                  The contrastive loss function is classification,                         object detection and semantic
expected to increase the similarity among the segmentation [5, 6, 7, 8, 9]. Some of these works employ
augmentations of the same image while decreasing the geolocation and spatio-temporal information to learn a
similarity from augmentations of different images. The more discriminative set of features for remote sensing
                                                                                       applications [5, 10]. Hyperspectral image classification
CDCEO 2022: 2nd Workshop on Complex Data Challenges in Earth
Observation, July 25, 2022, Vienna, Austria                                            and clustering using contrastive learning have also been
$ lauracuerosa@gmail.com (L. E. C. L. Rosa);                                           the focus of recent publications [9, 8]. However, all the
darioaugusto@gmail.com (Dário A. B. Oliveira);                                         approaches mentioned above need positive and negative
sam.thiele01@gmail.com (S. Thiele); p.ghamisi@gmail.com                                sample pairs to perform the contrastive loss, which is
(P. Ghamisi); r.gloaguen@hzdr.de (R. Gloaguen)
                                                                                       computationally intensive.
 0000-0002-6284-9494 (L. E. C. L. Rosa); 0000-0002-0674-5332
(Dário A. B. Oliveira); 0000-0003-4169-0207 (S. Thiele);                                  One of the most important contrastive-learning
0000-0003-1203-741X (P. Ghamisi); 0000-0002-4383-473X                                  methods is the Swapping Assignments between Multiple
(R. Gloaguen)                                                                          Views (SwAV) [2], which performs self-supervised and
          © 2022 Copyright for this paper by its authors. Use permitted under Creative
          Commons License Attribution 4.0 International (CC BY 4.0).                   clustering in an online fashion. The method employs an
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
optimal transport (OT) solver to assign the image               2. METHOD
feature vectors to cluster centroids by means of an
equipartition constraint that ensures that all samples          2.1. LLP and Optimal Transport
within a batch of images are equally assigned to the
                                                                In this work, we asses the LLP-Co approach in a scenario
predefined number of clusters.
                                                                where only to the global class proportions are available
   An advantage of the SwAV method over the
                                                                to train the network. To implement LLP, the training
previously proposed contrastive learning frameworks is
                                                                samples are split into 𝑆 disjoint bags of image tiles, where
that the use of the OT solver with the equipartition
                                                                𝐵𝑖 is the 𝑖th bag, which consists of a set of 𝑠𝑖 randomly
constraint allows disregarding pairwise comparisons.
                                                                cropped image tiles from the large scale input EO image.
Recently, weak information in the form of class
                                                                Here, ℬ𝑖 = {(x𝑖,𝑗 )}𝑠𝑗=1𝑖
                                                                                            , where x𝑖,𝑗 is the image tile 𝑗
proportions was introduced as a constraint in SwAV to
                                                                within the bag 𝑖. The final training set is then expressed
train a classifier in a weakly-supervised fashion. The
                                                                as 𝒯 = {(ℬ𝑖 , w)}𝑆    𝑖=1 , where w is a vector of global
method called Learning from Label Proportions with
                                                                label proportions, which is the same for all bags 𝐵𝑖 . In
Prototypical Contrastive Clustering (LLP-Co) [11]
                                                                a∑︀multi-class problem with 𝐾 classes, w ∈ ∆𝐾 and s.t.
disregards the equipartition constraint in the OT solver
                                                                    𝑘=1 w = 1, where the w element is the proportion
                                                                    𝐾      𝑘                    𝑘
by adding a cluster proportions constraint.
                                                                 of tiles that belong to class 𝑘. In the methodology a
   Using information about class proportions to train a
                                                                 neural network acts as the feature extractor followed by
classifier has gained more attention in the last years
                                                                 layer that delivers the class probabilities vector p̃𝑖,𝑗 =
[12, 13, 14, 15]. Given a set of images, Learning from
                                                                𝑝𝜃 (y|x𝑖,𝑗 ), where 𝜃 represents the network parameters
Label Proportions (LLPs) approach focuses on learning
                                                                 [16]. Then, the estimated global label proportions for
an instance-level classifier using as reference signal only
                                                                 each bag is expressed as:
the class proportions observed in this set. In EO
applications, with a large amount of available data and                                             1 ∑︁
                                                                                                          𝑠𝑖

the unavailability of pixel-level annotations, the use of                                 w
                                                                                          ^𝑖 =            p̃ ,
                                                                                                    𝑠𝑖 𝑗=1 𝑖,𝑗
priors like class proportions is an attractive solution. In
many real-life scenarios, these proportions can be              and to train the network a standard cross-entropy loss
obtained by governmental census or even expert                  function can be used
knowledge. Examples of governmental agencies that                                                          𝑆
record statistics about agriculture, forestry, and natural                                              1 ∑︁
                                                                                  ˆ , 𝑤) = −
                                                                                𝐿(𝑤                           w log w
                                                                                                                    ^ 𝑖.           (1)
resources, among others, are the National Agricultural                                                  𝑆 𝑖=1
Statistics Service of the United States Department of
                                                                   The above equation is reformulated by encoding the
Agriculture1 , the Brazilian Institute of Geography and
                                                                label proportions as a posterior distribution [1, 17, 11]
Statistics (IBGE) in Brazil 2 , Forest Research in the
United Kingdom3 , and the European Statistics website 4 .                             𝑆     𝑠𝑖     𝐾
                                                                                1 ∑︁ ∑︁ ∑︁ 𝑞(𝑦 𝑘 |x𝑖,𝑗 )
   This paper focuses on accessing the viability of using       𝐿(𝑝, 𝑞) = −                              log 𝑝𝜃 (𝑦 𝑘 |x𝑖,𝑗 )
                                                                                𝑆 𝑖=1 𝑗=1       𝑠𝑖
contrastive learning combined with LLP to train a pixel-                                          𝑘=1

wise classifier based only on prior information about                                                                              (2)
global class proportions for EO applications. We tested         delivering the LLP optimization objective as:
the LLP-Co methodology upon two datasets, the first                     min 𝐿(𝑞, 𝑝),             s.t.    ∀𝑦 : 𝑞(𝑦 𝑘 |·) ∈ [0, 1]   (3)
focuses on crop type mapping using optical data and                     (𝑝,𝑞)

the second on geological mapping using hyperspectral                                𝑠𝑖
                                                                                   ∑︁
data. This allows assessing the model’s applicability                                      𝑞(𝑦 𝑘 |x𝑖,𝑗 ) = w𝑘 𝑠𝑖 ,                 (4)
across different applications and data sources. Hence,                             𝑗=1
the main contribution of this study is to propose a weak-       where the global proportion constraint ensures that each
supervised deep clustering method that employs label            label 𝑘 contains overall w𝑘 𝑠𝑖 samples. This equation is
proportions as priors and can be easily applied to large-       an instance of the regularized optimal transport problem
scale EO data from different sources for significantly          and is solved using the Sinkhorn-Knopp algorithm [1, 17,
different applications.                                         11]. Here P𝑦𝑖,𝑗 = 𝑝𝜃 (𝑦|x𝑖,𝑗 ) 𝑛1𝑖 is the probabilities matrix
                                                                estimated by the network and Q𝑦𝑖,𝑗 = 𝑞(𝑦|x𝑖,𝑗 ) 𝑛1𝑖 is the
                                                                matrix of assigned probabilities for bag ℬ𝑖 . In the LLP-Co
    1
                                                                approach, Q𝑖 splits the samples within the bag following
       https://www.nass.usda.gov/
    2
       https://www.ibge.gov.br/                                 the global label proportions. Then the objective function
     3
       https://www.forestresearch.gov.uk/tools-and-resources/   as an OT solver is defined as
statistics/forestry-statistics/
     4
       https://ec.europa.eu/eurostat                                            min        ⟨Q𝑖 , − log P𝑖 ⟩ + 𝜀ℎ(Q𝑖 ),             (5)
                                                                          Q𝑖 ∈𝑈 (w,a𝑖 )
where 𝑈 (w, a𝑖 ) is the matrix space of possible solutions           Non-Commercial Crops (NCC), pasture, eucalyptus,
for the 𝑖-th bag,and a = (1/𝑛𝑖 )1𝑛𝑖 is a normalizing                 turfgrass, cerrado and soil. This work focuses in the
constraint [18].                                                     second seeding period for major crops maize and cotton
                                                                     for months between March to July. The reference data
2.2. Learning from Global Label                                      consisted of 608 parcels. Table 1 gives the percentages of
                                                                     the overall area planted with major crops accordingly to
     Proportions with Prototypical                                   the annotated parcel, we use this information as the
     Contrastive Clustering                                          global vector of class proportions for our experiments.
LLP-Co [11] is a self-supervised contrastive method that
performs online clustering by means of a convolutional
neural network that delivers consistent cluster
assignments between augmentations of the same input.
At the same time, the cluster assignment must follow
certain cluster size constraints that are provided as weak
information. Given a user-defined number of views of
the same input image tile, the algorithm employs the OT
solver in Eq.5 to compute soft targets or codes. These
targets as then considered as true labels to calculate the
cross-entropy considering the network’s prediction for
other views. The methodology pipeline for two
augmented views and 𝐾 classes is the following. First
each image tile 𝑗 within a bag is transformed into two
augmented version fed to an encoder network that                     Figure 1: Overview map of Brazil, Mato Grosso state, and the
extracts the features vectors z𝑡1 𝑖,𝑗 , z𝑖,𝑗 . These features
                                         𝑡2
                                                                     Compo Verde region were the images were acquired.
are then mapped to one of 𝐾 trainable prototypes V to
perform the code assignments for each view c𝑡1        𝑖,𝑗 and
 𝑖,𝑗 using the OT solver. From then on, a “swapped"
c𝑡2
contrastive loss is applied to predict the assignment of             3.2. Corta Atalaya dataset (CA)
one feature from the code of the other. The optimization
process is then conducted by minimizing the loss for all             The second study area is located at Rio Tinto, Spain. Rio
samples 𝑗 within bag 𝑖:                                              Tinto is located 70 km north of Huelva in the Iberian
                                                                     Pyrite Belt (IPB), a belt extending from southern Portugal
   𝐿𝑠𝑤𝑎𝑝 (z𝑡1     𝑡2         𝑡1     𝑡2         𝑡2     𝑡1
           𝑖,𝑗 , z𝑖,𝑗 ) = ℓ(z𝑖,𝑗 , c𝑖,𝑗 ) + ℓ(z𝑖,𝑗 , c𝑖,𝑗 ),   (6)   into southern Spain (Fig. 2). Our data was collected from
                                                                     Corta Atalaya (CA), an open-pit mine with a size of 1200
where each term is the cross-entropy loss between the                × 900 m and a depth of ca. 350 m. This pit exposes
code and the probability obtained after applying a                   basaltic to intermediate volcanic rocks along the northern
softmax function on the dot product between the                      part of the pit, and overlying felsic volcanic rocks, slate,
features Z𝑖 and the prototypes V. For more information               and conglomerate which are exposed in the western part
about the LLP-Co method, see [11].                                   of the mine. We tested our approach using ground-based
                                                                     hyperspectral imagery collected using a tripod-mounted
                                                                     Specim AsiaFENIX sensor, which covers the visible-near
3. Datasets                                                          and short-wave infrared range. A labeled reference image
                                                                     was created based on field mapping, fifty-seven hand
3.1. Campo Verde dataset (CV)                                        samples, and combined supervised classification followed
The first study site is in Campo Verde, an agricultural              by manual interpretation of the hyperspectral data [20].
region located in Mato Grosso, at a latitude of 15°32′ 48”           The lithologies interpreted at CA are as follows: oxidised,
south and a longitude of 55°10′ 08” west, Brazil (Fig. 1).           massive sulphide, two varieties of chlorite, two sericitic
Campo Verde (CV) [19] is a public dataset 5 that                     units, shale and purple shale. In this study, we grouped
provides pre-processed SAR and Optical images between                the lithologies into two major categories, chlorite schist
October 2015 and July 2016. The major crops found in                 and mineralised volcanics, in addition, weathered material
the region are soybean, maize and cotton. Other crops                and vegetation were grouped in a category named others.
and non crops categries are beans, sorghum,                          Table 1 gives the percentages of the overall area with
                                                                     these two major lithologies accordingly to the labeled
     5
       The CV database is available from IEEE Dataport at https:     reference image, we use this information as the global
//ieee-dataport.org/documents/campo-verde-database.                  vector of class proportions for our experiments. For more
information about the dataset, we refer the reader to [20]. bag of samples independently in a supervised way, our
                                                            proposal uses only weak information.
                                                                In our experiments, we used as prior information the
                                                            global proportions reported in Table 1. Given the bag size
                                                            𝑠𝑖 , we defined the training bag ℬ𝑖 by randomly cropping
                                                            𝑠𝑖 image tiles from the large-scale images. The tiles were
                                                            cropped from the annotated area and we used the class of
                                                            the central pixel of the tile. As the bag size increases, the
                                                            class proportions within the bag converge to the global
                                                            class proportions found in the dataset, hence we adopted
                                                            a large bag size of 𝑛𝑖 = 2048 for both datasets.

                                                                   4.2. Implementation Details
                                                                   Considering the different data sources, we employed a
                                                                   modified ResNet18 and ResNet10 as the backbone
                                                                   architecture for CV and CA datasets, respectively. To
                                                                   process the hyperspectral data cube in both spatial and
                                                                   spectral domains with also added two 3D convolutional
                                                                   layers at the beginning of the ResNet10 network for the
                                                                   CA dataset. The ResNet architecture is then followed by
                                                                   a projection head that projects the features to a
                                                                   1024-dimensional space. We trained the models for 100
Figure 2: Overview map of the Iberian Pyrite belt (a) with
                                                                   epochs using stochastic gradient descent with cosine
locations of the main volcano-sedimentary units (green). The
geology of the Corta Atalaya and Cerro Colorado open pits is       learning rate decay [21]. The image tiles size was set to
also shown (b). Maps taken with permission from [20].              21 × 21 for both datasets. For each dataset, we randomly
                                                                   selected 200,000 image tiles on the fly to create the
                                                                   random bags. The list of augmentations includes
                                                                   random rotations, mirroring, and random resizing to
Table 1                                                            obtain two views. For the OT solver, we set the
Global class proportions (%) for each dataset accordingly to       hyper-parameters as in [11]. The number of clusters for
the reference data. Cs standsd for chlorite schist and Mv stands   both models was set to the number of categories found
for mineralised volcanics.                                         in the datasets. We quantitatively assessed the method
                CV                            CA                   using three metrics: cluster accuracy (𝐴𝑐𝑐), macro
                                                                   average F1-score (F1-score), and normalized mutual
    Cotton     Maize     Others      Cs     Mv      Others
                                                                   information (NMI). Since we use the class proportion
     45.3       35.8       18.9     38.7    57.7      3.6          information, we reported the classification metrics by
                                                                   considering the cluster assigned by the network at
                                                                   inference time. We also report the confusion matrices.

4. Experiments                                                     4.3. Baseline method
4.1. Experimental Protocol                                         We adopted the original SwAV method with the
                                                                   equipartition constraint as the baseline method. This
Our experiments focused on the major categories found
                                                                   constraint ensure that samples are equally partitioned
in both datasets. To assess the methodology’s robustness
                                                                   among the clusters, and for a good performance the
to different data sources, we employed optical data for
                                                                   authors recommend a number of cluster at least three
CV dataset and hyperspectral data for CA dataset. For
                                                                   times higher than the expected number of categories. In
the CV dataset, we considered the cloud-free optical
                                                                   preliminary experiment we found that 30 cluster
image available for May 2016. For the CA dataset, we
                                                                   delivered a good performance for CV dataset, while 10
stacked VNIR and SWIR data in a unique data cube. We
                                                                   cluster delivered an acceptable performance for CA
evaluated the LLP-Co method under a scenario that uses
                                                                   dataset. The backbone network for SwAV is the same as
global class proportions to identify the major categories
                                                                   the LLP-Co backbone network for each dataset. To
in the target regions. Unlike the traditional LLP training
                                                                   evaluate the model we used the feature z generated by
schemes, which calculate the class proportion for each
                                                                   the backbone network followed by a 𝑘-means clustering.
                    Reference                          LLP-Co Prediction                           SwAV Prediction
 CV
 CA


Figure 3: Maps of the class output CV and CA datasets. Crop types for CV dataset:       maize,   cotton,     others. Lithologies
for CA dataset: chlorite schist , mineralised volcanics, others.


An Hungarian match [22] between the true categories           Table 2
and the 𝑘-means result delivered the final accuracy.          Test performance for the CV and CA datasets.

                                                                                    LLP-Co                       SwAV
                                                                       Metric
5. Results                                                                       CV         CA              CV          CA

Table 2 shows the performance for both datasets in                     𝐴𝑐𝑐      94.1%      91.6%           74.4%     61.0%
                                                                     F1-score   93.8%      76.9%           66.0%     47.5%
terms of 𝐴𝑐𝑐, F1-score, and NMI. The model
                                                                       NMI       0.76       0.66           0.50%      0.38
performance reported competitive results, achieving
accuracies of 94.1% and 91.6% for the CV and CA
datasets, respectively.      Similar performance was
observed in terms of F1-score for CV dataset with 93.8%.      the major categories, with values above 91% for both
In contrast, for CA dataset, a lower value was observed       datasets. However, in CA dataset, 48% of class others
with 76.9% of F1-score due principally to class others.       was misclassified as chlorite schist, demonstrating the
The cluster quality metrics NMI reported values of 0.76       challenge of this task. Another possible explanation of
and 0.66 for CV and CA, respectively. Considering these       this drop in performance can be related to the
metrics, the CV dataset reported better results than CA       distribution of the classes, since considering a more
dataset. This may be due to the different types of            balanced vector of class proportions (like in CV dataset
application and data since geological mapping from            with w = (45.3, 35.8, 18.9)) but significantly different
hyperspectral data is a more challenging task due to          among the classes, delivers much better performance,
significant confounding data variance and often subtle        allowing the model to learn a more discriminative and
distinctions between the features of interest.                relevant set of features. In contrast, for a highly
   Comparing LLP-Co with the baseline model, we               unbalanced vector of proportions, the model will favor
observe that, as expected, the inclusion of priors into the   the majority classes, as we observed for the CA dataset.
training process was crucial for a good classification           Finally, Fig. 3 presents the classification maps for each
performance. LLP-Co outperformed SwAV by ∼20% and             dataset. Here we can observe classification errors
∼30% in terms of accuracy for the CV and CA datasets,         between class maize and the other two classes for CV
respectively. Similar improvement was observed for the        dataset, and class mineralised volcanics with class others
F1-score, achieving an enhancement of ∼27% and ∼30%           for CA dataset. In addition, it is worth pointing out the
for CV and CA datasets, respectively.                         quality of the predictions for both datasets, where no
   Table 3 presents the confusion matrices. As expected,      salt-and-pepper effect was observed.
the per-class accuracy achieved high performance for
Table 3                                                       [5] K. Ayush, B. Uzkent, C. Meng, K. Tanmay, M. Burke,
LLP-Co confusion matrices for the CV and CA datasets for          D. Lobell, S. Ermon,       Geography-aware self-
major categories and class others.                                supervised learning, in: Proceedings of the
                                                                  IEEE/CVF International Conference on Computer
                                 Predicted
              CV        Maize Cotton Others
                                                                  Vision, 2021, pp. 10181–10190.
                                                              [6] W. Li, H. Chen, Z. Shi, Semantic segmentation
               Maize     91%        5%        4%                  of remote sensing images with self-supervised
        True


              Cotton     2%        96%        2%                  multitask representation learning, IEEE Journal
              Others     2%         3%       95%
                                                                  of Selected Topics in Applied Earth Observations
                                 Predicted                        and Remote Sensing 14 (2021) 6438–6450.
              CA          Cs       Mv       Others            [7] V. Stojnic, V. Risojevic, Self-supervised learning
                Cs       94%        5%        1%                  of remote sensing scene representations using
        True


                Mv       2%        93%        5%                  contrastive multiview coding, in: Proceedings of
              Others     48%        0%       52%                  the IEEE/CVF Conference on Computer Vision and
                                                                  Pattern Recognition, 2021, pp. 1182–1191.
                                                              [8] Y. Cai, Z. Zhang, Y. Liu, P. Ghamisi, K. Li,
6. Conclusions                                                    X. Liu, Z. Cai, Large-scale hyperspectral image
                                                                  clustering using contrastive learning,        arXiv
This work evaluates a recently proposed                           preprint, arXiv:2111.07945 (2021).
weak-supervised method that combines contrastive              [9] J. Yue, L. Fang, H. Rahmani, P. Ghamisi, Self-
learning with class proportions constraints to train a            supervised learning with adaptive distillation
classifier without the need for labels at the pixel level in      for hyperspectral image classification,        IEEE
the context of Earth Observation (EO) applications. The           Transactions on Geoscience and Remote Sensing
approach was able to archive reasonable accuracy values           60 (2021) 1–13.
across different tasks and data sources, proving its [10] O. Mañas, A. Lacoste, X. Giro-i Nieto, D. Vazquez,
robustness and applicability to large-scale EO data.              P. Rodriguez, Seasonal contrast: Unsupervised
Overall accuracy of 90% was reported for crop and                 pre-training from uncurated remote sensing data,
geological mapping applications considering the major             in: Proceedings of the IEEE/CVF International
categories found in the target regions. The approach              Conference on Computer Vision, 2021, pp. 9414–
also failed to identify classes with very small                   9423.
proportions. Several ways of dealing with this problem [11] L. E. C. L. Rosa, D. A. B. Oliveira, Learning from
such as weighted cross-entropy or focal loss can be also          label proportions with prototypical contrastive
implemented into our method. The success of the                   learning, in: to appear, AAAI, 2022.
methodology opens a new path in the use of weak [12] Z. Qi, B. Wang, F. Meng, L. Niu, Learning with
information to help alleviate the burden of manual                label proportions via NPSVM, IEEE Transactions
annotation in EO.                                                 on Cybernetics 47 (2016) 3293–3305.
                                                             [13] G. Dulac-Arnold, N. Zeghidour, M. Cuturi, L. Beyer,
References                                                        J.-P. Vert, Deep multi-class learning from label
                                                                  proportions, arXiv preprint, arXiv:1905.12909
 [1] Y. M. Asano, C. Rupprecht, A. Vedaldi, Self-labelling        (2019).
      via simultaneous clustering and representation [14] Y. Shi, J. Liu, B. Wang, Z. Qi, Y. Tian, Deep learning
      learning, arXiv preprint, arXiv:1911.05371 (2019).          from label proportions with labeled samples, Neural
 [2] M. Caron, I. Misra, J. Mairal, P. Goyal,                     Networks 128 (2020) 73–81.
      P. Bojanowski, A. Joulin, Unsupervised learning of [15] C. Scott, J. Zhang, Learning from label proportions:
      visual features by contrasting cluster assignments,         A mutual contamination framework, Advances in
      Advances in Neural Information Processing                   Neural Information Processing Systems 33 (2020)
      Systems 33 (2020) 9912–9924.                                22256–22267.
 [3] J. Li, P. Zhou, C. Xiong, R. Socher, S. C. Hoi, [16] J. Liu, B. Wang, Z. Qi, Y. Tian, Y. Shi, Learning
      Prototypical contrastive learning of unsupervised           from label proportions with generative adversarial
      representations, arXiv preprint, arXiv:2005.04966           networks,       Advances in Neural Information
      (2020).                                                     Processing Systems 32 (2019) 7169–7179.
 [4] C. Li, X. Li, L. Zhang, B. Peng, M. Zhou, J. Gao, [17] J. Liu, B. Wang, X. Shen, Z. Qi, Y. Tian, Two-stage
      Self-supervised pre-training with hard examples             training for learning from label proportions, arXiv
      improves visual representations, arXiv preprint,            preprint, arXiv:2105.10635 (2021).
      arXiv:2012.13493 (2020).                               [18] A. Genevay, G. Dulac-Arnold, J.-P. Vert,
     Differentiable deep clustering with cluster
     size constraints, arXiv preprint, arXiv:1910.09036
     (2019).
[19] I. D. Sanches, R. Q. Feitosa, P. M. A. Diaz,
     M. D. Soares, A. J. B. Luiz, B. Schultz, L. E. P.
     Maurano, Campo Verde database: Seeking to
     improve agricultural remote sensing of tropical
     areas, IEEE Geoscience and Remote Sensing Letters
     15 (2018) 369–373.
[20] S. T. Thiele, S. Lorenz, M. Kirsch, I. C. C. Acosta,
     L. Tusa, E. Herrmann, R. Möckel, R. Gloaguen,
     Multi-scale, multi-sensor data integration for
     automated 3-d geological mapping, Ore Geology
     Reviews 136 (2021) 104252.
[21] I. Loshchilov, F. Hutter, Sgdr: Stochastic gradient
     descent with warm restarts,         arXiv preprint,
     arXiv:1608.03983 (2016).
[22] H. W. Kuhn, The Hungarian method for the
     assignment problem, Naval Research Logistics
     Quarterly 2 (1955) 83–97.