1. Introduction

Weak-Supervision Based on Label Proportions for Earth Observation Applications from Optical and Hyperspectral Imagery

Laura E. Cué La Rosa

lauracuerosa@gmail.com 1 3

Dário A. Borges Oliveira

0 4

Sam Thiele

sam.thiele01@gmail.com 1

Pedram Ghamisi

p.ghamisi@gmail.com 1 2

Richard Gloaguen

r.gloaguen@hzdr.de 1 0 Data Science in Earth Observation, Technical University of Munich (TUM) , Munich , Germany 1 Helmholtz-Zentrum Dresden-Rossendorf (HZDR), Helmholtz Institute Freiberg for Resource Technology , Freiberg , Germany 2 Institute of Advanced Research in Artificial Intelligence (IARAI) , 1030 Vienna , Austria 3 Pontifical Catholic University of Rio de Janeiro (PUC-Rio) , Brazil 4 School of Applied Mathematics , Getulio Vargas Foundation, Rio de Janeiro , Brazil

In this paper, we assess a weak-supervised approach that employs weak constraints in the form of class proportions to train a neural network capable of performing pixel-wise classification for Earth Observation (EO) applications. The approach combines self-supervised contrastive clustering and a constraint on cluster proportions in an online fashion allowing its application in large-scale EO images. The methodology is based on the generation of simple augmented views of input image tiles, and the use of a loss function that performs contrastive learning to achieve consistent results that are invariant to these augmentations, and simultaneously follow the cluster proportions constraint. In many EO applications, information about class proportions is available through expert knowledge or e.g., governmental census. This weak information about class proportions allows training a classifier without information about the class at the pixel-level, alleviating the burden of manual annotation. In this context, crop and geological mapping from EO data are two crucial applications in the search for sustainable ways of resource management. We tested the approach upon optical and hyperspectral data achieving promising results and proving the method's applicability across diferent applications and data sources.

eol>Weak-supervision Learning from proportions Multi-source Crop mapping Geological mapping

1. Introduction

main characteristic of these methods is the capability of learning meaningful feature representations in an Self-supervised learning [1, 2, 3, 4] has recently emerged unsupervised fashion. This capability has opened new as a powerful tool in computer vision applications. venues in other research fields beyond computer vision Among the existing self-supervised methods, contrastive such as Earth Observation (EO) applications. In this learning can be considered the most promising one. This context, crop and geological mapping from EO data are type of approach is based on the generation of two crucial applications to agricultural monitoring and augmented versions of the input image and the use of a modern mining, where frequently limited or twin network that performs feature extraction that non-existent training information is available. combined with a loss function performs contrastive Considering EO applications, self-supervised methods learning to achieve consistent results between these have been employed with success including image augmentations. The contrastive loss function is classification, object detection and semantic expected to increase the similarity among the segmentation [5, 6, 7, 8, 9]. Some of these works employ augmentations of the same image while decreasing the geolocation and spatio-temporal information to learn a similarity from augmentations of diferent images. The more discriminative set of features for remote sensing applications [5, 10]. Hyperspectral image classification and clustering using contrastive learning have also been the focus of recent publications [9, 8]. However, all the approaches mentioned above need positive and negative sample pairs to perform the contrastive loss, which is computationally intensive.

One of the most important contrastive-learning methods is the Swapping Assignments between Multiple Views (SwAV) [2], which performs self-supervised and clustering in an online fashion. The method employs an optimal transport (OT) solver to assign the image feature vectors to cluster centroids by means of an equipartition constraint that ensures that all samples 2.1. LLP and Optimal Transport within a batch of images are equally assigned to the In this work, we asses the LLP-Co approach in a scenario predefined number of clusters. where only to the global class proportions are available

An advantage of the SwAV method over the to train the network. To implement LLP, the training previously proposed contrastive learning frameworks is samples are split into disjoint bags of image tiles, where that the use of the OT solver with the equipartition is the th bag, which consists of a set of randomly cRoencestnrtaliyn, t waleloawks indifsorremgaartidoinng inpaitrhweisefocrmompoafricsolansss. cropped image tiles from the large scale input EO image. proportions was introduced as a constraint in SwAV to wHietrhei,nℬthe=ba{g( x.,T h)}efin=a1l, twrahineirnegxse,t iiss tthheenimexapgreestsieled train a classifier in a weakly-supervised fashion. The PmreotthotoydpiccaallledCoLnetarransitnivge frComlustLearbinegl Pr(oLpLoPr-tCioon)s w[1it1h] laaasmbeul lpt=ir-ocpl{ao(srℬstipo,rnwosb),l}wem=hi1cw,hiwtihshethreecswlaasmsiseesaf,owvreac∈ltl o∆braogfsagnldo.bsIa.ntl. disregards the equipartition constraint in the OT solver ∑︀=1 w = 1, where the w element is the proportion by adding a cluster proportions constraint. of tiles that belong to class . In the methodology a

Using information about class proportions to train a neural network acts as the feature extractor followed by classifier has gained more attention in the last years layer that delivers the class probabilities vector p˜, = [12, 13, 14, 15]. Given a set of images, Learning from (y|x, ), where represents the network parameters Label Proportions (LLPs) approach focuses on learning [16]. Then, the estimated global label proportions for tahneincsltaasnscep-rleovpeolrctliaosnssifieroubssienrgveads rienfertehniscesseitg.nalInonlEyO each bag is expressed as: applications, with a large amount of available data and the unavailability of pixel-level annotations, the use of w^ = 1 ∑︁ p˜, , priors like class proportions is an attractive solution. In =1 many real-life scenarios, these proportions can be and to train the network a standard cross-entropy loss obtained by governmental census or even expert function can be used knowledge. Examples of governmental agencies that rreecsoourdrcsetsa,tiasmticosnagbootuhtearsg,raicruelttuhreeN,faotrieosntaryl,Aagnrdicnualttuurraall (ˆ, ) = − 1 ∑=︁1 w log w^ . (1) Statistics Service of the United States Department of The above equation is reformulated by encoding the Agriculture1, the Brazilian Institute of Geography and label proportions as a posterior distribution [1, 17, 11] Statistics (IBGE) in Brazil 2, Forest Research in the United Kingdom3, and the European Statistics website 4. 1 ∑︁ ∑︁ ∑︁ (|x, ) log (|x, )

This paper focuses on accessing the viability of using (, ) = − =1 =1 =1 contrastive learning combined with LLP to train a pixelwise classifier based only on prior information about (2) global class proportions for EO applications. We tested delivering the LLP optimization objective as: the LLP-Co methodology upon two datasets, the first min (, ), s.t. ∀ : (|· ) ∈ [0, 1] (3) focuses on crop type mapping using optical data and (,) the second on geological mapping using hyperspectral data. This allows assessing the model’s applicability ∑︁ (|x, ) = w, (4) across diferent applications and data sources. Hence, =1 the main contribution of this study is to propose a weaksupervised deep clustering method that employs label proportions as priors and can be easily applied to largescale EO data from diferent sources for significantly diferent applications.

1https://www.nass.usda.gov/ 2https://www.ibge.gov.br/ 3https://www.forestresearch.gov.uk/tools-and-resources/ statistics/forestry-statistics/ 4https://ec.europa.eu/eurostat where the global proportion constraint ensures that each label contains overall w samples. This equation is an instance of the regularized optimal transport problem and is solved using the Sinkhorn-Knopp algorithm [1, 17, 11]. Here P, = (|x, ) 1 is the probabilities matrix

estimated by the network and Q, = (|x, ) 1 is the

matrix of assigned probabilities for bag ℬ. In the LLP-Co approach, Q splits the samples within the bag following the global label proportions. Then the objective function as an OT solver is defined as

min Q∈(w,a) ⟨Q, − log P⟩ + ℎ(Q), (5) where (w, a) is the matrix space of possible solutions for the -th bag,and a = (1/)1 is a normalizing constraint [18].

2.2. Learning from Global Label Proportions with Prototypical Contrastive Clustering

Non-Commercial Crops (NCC), pasture, eucalyptus, turfgrass, cerrado and soil. This work focuses in the second seeding period for major crops maize and cotton for months between March to July. The reference data consisted of 608 parcels. Table 1 gives the percentages of the overall area planted with major crops accordingly to the annotated parcel, we use this information as the global vector of class proportions for our experiments.

LLP-Co [11] is a self-supervised contrastive method that

performs online clustering by means of a convolutional neural network that delivers consistent cluster assignments between augmentations of the same input. At the same time, the cluster assignment must follow certain cluster size constraints that are provided as weak information. Given a user-defined number of views of the same input image tile, the algorithm employs the OT solver in Eq.5 to compute soft targets or codes. These targets as then considered as true labels to calculate the cross-entropy considering the network’s prediction for other views. The methodology pipeline for two augmented views and classes is the following. First each image tile within a bag is transformed into two augmented version fed to an encoder network that extracts the features vectors z,1 , z,2 . These features are then mapped to one of trainable prototypes V to perform the code assignments for each view c,1 and c,2 using the OT solver. From then on, a “swapped" contrastive loss is applied to predict the assignment of one feature from the code of the other. The optimization process is then conducted by minimizing the loss for all samples within bag : (z, , z, ) = ℓ(z,1 , c, ) + ℓ(z,2 , c,1 ), 1 2 2 where each term is the cross-entropy loss between the code and the probability obtained after applying a softmax function on the dot product between the features Z and the prototypes V. For more information about the LLP-Co method, see [11].

3. Datasets 3.1. Campo Verde dataset (CV) The first study site is in Campo Verde, an agricultural

region located in Mato Grosso, at a latitude of 15°32′48” south and a longitude of 55°10′08” west, Brazil (Fig. 1). Campo Verde (CV) [19] is a public dataset 5 that provides pre-processed SAR and Optical images between October 2015 and July 2016. The major crops found in the region are soybean, maize and cotton. Other crops and non crops categries are beans, sorghum,

5The CV database is available from IEEE Dataport at https:

//ieee-dataport.org/documents/campo-verde-database.

3.2. Corta Atalaya dataset (CA) The second study area is located at Rio Tinto, Spain. Rio

Tinto is located 70 km north of Huelva in the Iberian Pyrite Belt (IPB), a belt extending from southern Portugal (6) into southern Spain (Fig. 2). Our data was collected from Corta Atalaya (CA), an open-pit mine with a size of 1200 × 900 m and a depth of ca. 350 m. This pit exposes basaltic to intermediate volcanic rocks along the northern part of the pit, and overlying felsic volcanic rocks, slate, and conglomerate which are exposed in the western part of the mine. We tested our approach using ground-based hyperspectral imagery collected using a tripod-mounted Specim AsiaFENIX sensor, which covers the visible-near and short-wave infrared range. A labeled reference image was created based on field mapping, fifty-seven hand samples, and combined supervised classification followed by manual interpretation of the hyperspectral data [20]. The lithologies interpreted at CA are as follows: oxidised, massive sulphide, two varieties of chlorite, two sericitic units, shale and purple shale. In this study, we grouped the lithologies into two major categories, chlorite schist and mineralised volcanics, in addition, weathered material and vegetation were grouped in a category named others. Table 1 gives the percentages of the overall area with these two major lithologies accordingly to the labeled reference image, we use this information as the global vector of class proportions for our experiments. For more

4. Experiments 4.1. Experimental Protocol Our experiments focused on the major categories found

in both datasets. To assess the methodology’s robustness to diferent data sources, we employed optical data for CV dataset and hyperspectral data for CA dataset. For the CV dataset, we considered the cloud-free optical image available for May 2016. For the CA dataset, we stacked VNIR and SWIR data in a unique data cube. We evaluated the LLP-Co method under a scenario that uses global class proportions to identify the major categories in the target regions. Unlike the traditional LLP training schemes, which calculate the class proportion for each

4.2. Implementation Details Considering the diferent data sources, we employed a

modified ResNet18 and ResNet10 as the backbone architecture for CV and CA datasets, respectively. To process the hyperspectral data cube in both spatial and spectral domains with also added two 3D convolutional layers at the beginning of the ResNet10 network for the CA dataset. The ResNet architecture is then followed by a projection head that projects the features to a 1024-dimensional space. We trained the models for 100 epochs using stochastic gradient descent with cosine learning rate decay [21]. The image tiles size was set to 21 × 21 for both datasets. For each dataset, we randomly selected 200,000 image tiles on the fly to create the random bags. The list of augmentations includes random rotations, mirroring, and random resizing to obtain two views. For the OT solver, we set the hyper-parameters as in [11]. The number of clusters for both models was set to the number of categories found in the datasets. We quantitatively assessed the method using three metrics: cluster accuracy (), macro average F1-score (F1-score), and normalized mutual information (NMI). Since we use the class proportion information, we reported the classification metrics by considering the cluster assigned by the network at inference time. We also report the confusion matrices.

4.3. Baseline method

We adopted the original SwAV method with the equipartition constraint as the baseline method. This constraint ensure that samples are equally partitioned among the clusters, and for a good performance the authors recommend a number of cluster at least three times higher than the expected number of categories. In preliminary experiment we found that 30 cluster delivered a good performance for CV dataset, while 10 cluster delivered an acceptable performance for CA dataset. The backbone network for SwAV is the same as the LLP-Co backbone network for each dataset. To evaluate the model we used the feature z generated by the backbone network followed by a -means clustering. LLP-Co Prediction

5. Results

Table 2 shows the performance for both datasets in terms of , F1-score, and NMI. The model performance reported competitive results, achieving accuracies of 94.1% and 91.6% for the CV and CA datasets, respectively. Similar performance was observed in terms of F1-score for CV dataset with 93.8%. the major categories, with values above 91% for both In contrast, for CA dataset, a lower value was observed datasets. However, in CA dataset, 48% of class others with 76.9% of F1-score due principally to class others. was misclassified as chlorite schist, demonstrating the The cluster quality metrics NMI reported values of 0.76 challenge of this task. Another possible explanation of and 0.66 for CV and CA, respectively. Considering these this drop in performance can be related to the metrics, the CV dataset reported better results than CA distribution of the classes, since considering a more dataset. This may be due to the diferent types of balanced vector of class proportions (like in CV dataset application and data since geological mapping from with w = (45.3, 35.8, 18.9)) but significantly diferent hyperspectral data is a more challenging task due to among the classes, delivers much better performance, significant confounding data variance and often subtle allowing the model to learn a more discriminative and distinctions between the features of interest. relevant set of features. In contrast, for a highly

Comparing LLP-Co with the baseline model, we unbalanced vector of proportions, the model will favor observe that, as expected, the inclusion of priors into the the majority classes, as we observed for the CA dataset. training process was crucial for a good classification Finally, Fig. 3 presents the classification maps for each performance. LLP-Co outperformed SwAV by ∼ 20% and dataset. Here we can observe classification errors ∼ 30% in terms of accuracy for the CV and CA datasets, between class maize and the other two classes for CV respectively. Similar improvement was observed for the dataset, and class mineralised volcanics with class others F1-score, achieving an enhancement of ∼ 27% and ∼ 30% for CA dataset. In addition, it is worth pointing out the for CV and CA datasets, respectively. quality of the predictions for both datasets, where no

Table 3 presents the confusion matrices. As expected, salt-and-pepper efect was observed. the per-class accuracy achieved high performance for

6. Conclusions References

This work evaluates a recently proposed weak-supervised method that combines contrastive learning with class proportions constraints to train a classifier without the need for labels at the pixel level in the context of Earth Observation (EO) applications. The approach was able to archive reasonable accuracy values across diferent tasks and data sources, proving its robustness and applicability to large-scale EO data. Overall accuracy of 90% was reported for crop and geological mapping applications considering the major categories found in the target regions. The approach also failed to identify classes with very small proportions. Several ways of dealing with this problem such as weighted cross-entropy or focal loss can be also implemented into our method. The success of the methodology opens a new path in the use of weak information to help alleviate the burden of manual annotation in EO.

Diferentiable deep clustering with cluster

size constraints, arXiv preprint, arXiv:1910.09036 (2019). [19] I. D. Sanches, R. Q. Feitosa, P. M. A. Diaz, M. D. Soares, A. J. B. Luiz, B. Schultz, L. E. P. Maurano, Campo Verde database: Seeking to improve agricultural remote sensing of tropical areas, IEEE Geoscience and Remote Sensing Letters 15 (2018) 369–373. [20] S. T. Thiele, S. Lorenz, M. Kirsch, I. C. C. Acosta, L. Tusa, E. Herrmann, R. Möckel, R. Gloaguen, Multi-scale, multi-sensor data integration for automated 3-d geological mapping, Ore Geology Reviews 136 (2021) 104252. [21] I. Loshchilov, F. Hutter, Sgdr: Stochastic gradient descent with warm restarts, arXiv preprint, arXiv:1608.03983 (2016). [22] H. W. Kuhn, The Hungarian method for the assignment problem, Naval Research Logistics Quarterly 2 (1955) 83–97.