1. Introduction

Knowledge-Aware Cross-Modal Text-Image Retrieval for Remote Sensing Images

Li Mi

Siran Li

siran.li@epfl.ch 1

Christel Chappuis

christel.chappuis@epfl.ch 0

Devis Tuia

devis.tuia@epfl.ch 0 0 Environmental Computational Science and Earth Observation Laboratory, École Polytechnique Fédérale de Lausanne (EPFL) , 1950 Sion , Switzerland 1 Section of Electrical and Electronics Engineering, École Polytechnique Fédérale de Lausanne (EPFL) , 1015 Lausanne , Switzerland

Image-based retrieval in large Earth observation archives is dificult, because one needs to navigate across thousands of candidate matches only with the proposition image as a guide. By using text as a query language, the retrieval system gains in usability, but at the same time faces dificulties due to the diversity of visual signals that cannot be summarized by a short caption only. For this reason, as a matching-based task, cross-modal text-image retrieval often sufers from information asymmetry between texts and images. To address this challenge, we propose a Knowledge-aware Cross-modal Retrieval (KCR) method for remote sensing text-image retrieval. By mining relevant information from an external knowledge graph, KCR enriches the text scope available in the search query and alleviates the information gaps between texts and images for better matching. Experimental results on two commonly used remote sensing text-image retrieval benchmarks show that the proposed knowledge-aware method outperforms state-of-the-art methods.

eol>Cross-modal Retrieval Knowledge Graph Remote Sensing

1. Introduction

specific aspects, focusing on the most dominant information. For example, one image could receive the following Recent advances in satellite data acquisition and stor- caption text: There is a lake. Nevertheless, there might be age have led to a rapid development of remote sensing trees and mountains around the lake which are ignored image archives. To explore them, image retrieval has by humans or caption generators. In addition, diferent received increasing attention [1, 2]. However, retrieving people will describe the image from subjective perspecimages using example images would limit the versatil- tives, resulting in a variety of text information for a single ity of the retrieval system, since with the query image image, which may confuse the matching model. Thereonly, one cannot specify which elements are essential fore, strategies to handle lacunary captions, nuances and for the query or what the retrieval objective is. As a synonyms are needed for the task, and a balance between solution, text-image retrieval [3, 4] has been introduced objectivity and completeness must be achieved. to explicit the retrieval targets in a semantic way. Text- Knowledge graphs [6] present relationships and proximage retrieval aims at recalling an image based on a imities among concepts through graph structures. By protext or, in reverse, retrieving a text according to an im- viding the experience and commonsense from human unage. As a bridge between vision and language research, derstanding, knowledge graphs have been recognized as it provides a possibility to explore the growing amount efective prior knowledge in many vision-and-language of cross-modal remote sensing data. research [7, 8] to reveal commonsense and alleviate am

When regarding text as the query, the prospective re- biguities. In this paper, we propose a Knowledge-aware trieval system gains in usability among cross-modal data, Cross-modal Retrieval (KCR) method for remote sensing but at the same time faces the problem of information text-image retrieval. With the help of external Knowlasymmetry between texts and images [5]. When dealing edge Graphs, KCR extends the text scope to obtain a more with very high-resolution remote sensing images, the robust text representation. More specifically, based on image content can be very diverse, hence it is dificult the objects mentioned in a sentence as starting points, to be comprehensively summarized by the natural lan- KCR proposes to mine the expanded nodes and edges in a guage, especially by a short caption. On one hand, human knowledge graph and embeds them as features to enrich captions can only describe the image from one or a few those extracted from the text content alone. As such, KCR integrates commonsense knowledge and leads to competitive performance on two commonly used remote sensing text-image retrieval benchmarks. r e d o c n e e g a m I r e d o c n e t x e T g eg idnd Iam ebm e g n texT ieddb m e t n e m e r u s a e m y t i r a l i m i S

Image 2. Related Work 3. Knowledge-aware Cross-modal Retrieval Method

Due to the emergence of multi-modal remote sensing data, vision-language research, such as image captioning The proposed text-image retrieval system comprises [9], visual question answering [10], and cross-modal re- three main components: an image encoder, a text entrieval [4] has attracted increasing attention [11]. Recent coder and a similarity measurement module (Figure 1). advances in remote sensing text-image retrieval mainly The image encoder is designed to extract image features focused on: 1) learning a more representative image fea- by a pre-trained feature extractor and a self-attention ture by fusing local and global image feature [12] or block. The text encoder embeds a sentence and its refeatures from diferent feature extractors [ 4]. 2) learning lated external knowledge extracted from a knowledge a distinguishable joint embedding space by fusing the graph into a joint feature space. Finally, the image and cross-modal features and leveraging a ranking-based loss text features are both used within the similarity measurefunction [4]. Departing from previous eforts that based ment module to compute the similarity score between the retrieval on the image characteristic and the caption text queries and candidate images, which are then ranked only, we propose to enrich the latter with a knowledge according to their relevance. The model can also be apgraph that would extend the text content and alleviate plied in reverse, where the best captions to summarize ambiguities for a more robust text representation. an image are retrieved.

Consisting of various nodes as concepts, knowledge graphs encode commonsense knowledge about the world [6, 13, 14]. By exploring the knowledge graphs, vision- 3.1. Image encoder and-language research has been promoted due to the The image encoder is a pre-trained feature extractor with priors for visual understanding [7, 8]. In remote sens- a self-attention block [16, 17]. Two sets of image features ing research, Li et al. [15] constructed a remote sensing are extracted from the image encoder: knowledge graph to support zero-shot remote sensing High-level image feature. We use ResNet-101 [18] as image scene classification. Their eforts in exploiting a re- a backbone and the last Fully Connected (FC) layer is mote sensing knowledge graph to image understanding retrained. For an image , the output of the retrained FC focused on using graph embedding as an overall represen- layer is regarded as the high-level image feature fℎℎ. tation of an image. Diferent from this work, the proposed Mid-level image feature. The output of ResNet block 3, KCR explores the fine-grained object-level connections denoted as f, is sent to an additional self-attention block between nodes in the graph and words in sentences. to further capture the long-range dependencies among pixels and provide more detailed information at relatively After being embedded in the feature space, sentence repmid-level. The self-attention block can be defined as Eq. resentation and external knowledge representation are f. , , and are weight matrices of 1 × 1 convolutions to embed the feature. Then followed by a 2D pooling layer and a flattening operation, the mid-level image feature is extracted as a vector: f = pool2d (fout ).

(2) to the query feature, the similarity score is larger and the graph can be regarded as a subgraph of the existing re- retrieval result with query image and image ′ is the To project the image feature and text feature into a same dimension, the concatenated high-level and mid-level image feature are sent to a final FC layer to obtain the overall image presentation: f = FC (concat (fhigh , fmid )).

(3) 3.2. Knowledge-aware text encoder Knowledge representation. For a sentence with words: = {1, 2, ..., } ( ≥ is used to separate every word and divide the part-ofspeech (e.g. noun, verb, adjective, adverb, etc.) for them.

Based on the part-of-speech tags, all the nouns can be appended into a word list. Then we extract a sentence graph based on the word list only. The sentence

1), a tokenizer mote sensing knowledge graph [15], . More specifically, the nouns in the word list are regarded as the initial nodes. Starting from those nodes, all the one-step neighbours with the connected edges in are included in . Note that the sentence graph is a directed graph, which means the edge between two nodes is a one-way relationship. In the sentence graph, each edge can be represented as a relationship triplet (, , ), shown as < − − >, which can be regarded as a short sentence with three words. Mining all the edges of the sentence graph might be redundant, so we decide to randomly select triplets from all the available ones.

Text encoder. We use Sentence-Transformer [19] as the text encoder for the sentence features f, as well as the external knowledge representation f. SentenceTransformer is a modification of the pretrained BERT network using siamese and triplet network structures to derive semantically meaningful sentence embeddings.

The encoding process of a sentence and the corresponding knowledge can be formulated as:

f = SenTrans (s{w1 , w2 , ..., wn }) f = FC( =0

1 ∑︁ SenTrans (ri {si , pi , oi })).

(4) concatenated and sent to the finial FC layer to obtain the overall representation of a text: f = FC (concat (fsen , fknow )).

(5) features: = − target ranks higher.

3.3. Similarity Measurement

Similarity score. The similarity score is defined as the negative pairwise euclidean distance between two dis (fimg , ftext ). With smaller distance Loss function. Triplet loss is commonly used in the text-image retrieval task [3, 20, 4]. It constrains the similarity score of the matched image-text pairs to be larger than the similarity score of the unmatched ones by a margin. Meanwhile, previous research [3] discovered that using the hardest negative in a batch during training rather than all negatives samples can boost performance. Therefore, the loss function can be formulated as: (, ) = max (0, − + ′ ) + max (0, − + ′ ) , (6) where is a margin parameter, image and sentence are the corresponding pair. Sentence ′ is the top-1 text top-1 image retrieval result with query text .

4. Experiments 4.1. Experimental details

Datasets. We perform experiments on two commonly used RS text-image datasets: the RSICD dataset and UCMCaption dataset. The RSICD dataset [9] contains 10921 images with the size 224× 224 pixels. The UCM-Captions dataset [23], which is based on the UC Merced Land Use dataset [24], contains remote sensing images categorized into 21 land use classes, with 100 samples for each class. For each sample in both datasets, there are 5 sentences describing the image content. We follow the train-test split in previous work [4], randomly selecting 80%, 10% and 10% for the dataset as the training set, validation set and test set, respectively.

Metrics. To evaluate the model performance, we exploit the standard evaluation metrics in retrieval tasks and and mR [3, 25]. With diferent values of , R@ means the fraction of queries for which the most relevant item is ranked among the top- retrievals. mR represents the average of all R@ in both text-image retrieval and 47.24 54.19 52.90 57.52 18.10 26.73 29.49 31.44 32.13 36.12 27.81 33.68 34.77 37.36 76.00 81.52 81.81 80.38 3.38 5.85 5.02 5.39 6.59 4.76 5.12 3.94 4.12 5.95 9.51 12.89 12.52 15.08 19.85 18.59 12.89 12.36 18.40 18.59 17.46 19.84 19.74 23.40 31.04 27.20 21.12 24.08 29.30 29.58 VSE++ [3] SCAN [20] MTFN [21] AMFMN [12] GaLR [4] KCR CAMP [22] KCR w/o KG Att KCR w/o KG KCR

Backbone ResNet18

ResNet101 VSE++ [3] SCAN [20] MTFN [21] AMFMN [12] KCR CAMP [22] KCR w/o KG Att KCR w/o KG KCR

Backbone ResNet18 ResNet101 2.82 3.71 4.90 4.90 4.69 5.84 4.15 4.47 4.63 5.40 11.71 16.67 16.00 17.43 image-text retrieval. In our experiment, we report the results of = 1, = 5, and = 10.

Hyper-parameters. In all experiments, the margin of the triplet loss function is set to 0.2 following the previous work [4]. For the image encoder, the input and intermediate dimensions of the self-attention block are respectively set to 1024 and 512, according to [16]. In terms of text encoder, the number of selected triplets are set to 10 in our experiments. Other feature dimensions are annotated in Figure 1. In addition, to achieve fair comparison with the competing methods, results with the ResNet18 backbone are also reported. For ResNet18 backbone, the dimension of the mid-level feature is 256 and other parameters are as the same of the model with ResNet101 backbone.

Implementation details. For the training process, we train and evaluate the model in mini-batch with a batchsize of 100. The optimizer is Adam optimizer with a weight decay of 5e-4 and initial learning rate of 0.001. For every 10 epochs, the learning rate drops 10%. All the experiments are conducted on a single NVIDIA GeForce RTX 3090 GPU. The max training epoch are 150 and 200 for the UCM-Caption and RSICD dataset, respectively.

4.2. Experimental results

Comparison methods. We compare the proposed method with the following state-of-the-art methods in text-image retrieval, especially those for remote sensing text-image retrieval.

• VSE++ [3] uses a CNN and a Gated Recurrent Unit (GRU) [26] to capture image and text features, respectively. • SCAN [20] exploits fine-grained interplay between images and texts by inferring the semantic alignment between them. • CAMP [22] proposes a cross-modal message passing method to explore the image-text interactions before calculating similarities. • MTFN [21] introduces a rank-based fusion model to avoid finding the common embedding space for cross-modal data. • AMFMN [12] employs multiscale visual selfattention module to extract the visual features and guide the text representation. • GaLR [4] utilizes an attention-based multi-level

Text-Image Retrieval

This is a sparse residential area with a villa surrounded by lush plants

Image-Text Retrieval

Baseline KCR Baseline

A busy intersection with many cars.

An intersection only with some plants at the corner.

A busy intersection only with some houses and plants at the corner.

An intersection with some plants at the corner.

An overpass with a road go across another two vertically.

An intersection with two roads vzertical to each other.

An intersection with two roadszvertical to each other.

It is a parking lot with many cars parked neatly and some parking spots are free.

An intersection with cars on the road.

Some cars are on the freeways.

information dynamic module to fuse global and and the mid-level feature degrade the results by 2.03% local feature extracted by a CNN and a Graph on mR, which indicates the importance of representative Neural Network (GCN), respectively. In addition, mid-level image feature.

GaLR involves a post-processing stage based on Results on the UCM-Caption dataset (Table 2). For a plug-and-play multivariate rerank algorithm. text-image retrieval, KCR significantly outperforms stateResults on the RSICD dataset (Table 1). KCR achieves of-the-art methods. For R@1, R@5, and R@10, With the best performance with the exception of image-text ResNet18 backbone, KCR achieves the best performance retrieval, where GaLR is the best performing. With with an average improvement of 3.27% compared to ResNet18 backbone, KCR outperforms GaLR on mR by AMFMN. For image-text retrieval, KCR gains 2.86% on 0.18%. In terms of text-image retrieval, the improvements both R@5 and R@10. The overall improvement on mR are 1.15%, 2.83% and 3.99% for R@1, R@5, and R@10, re- is 1.79%. As is shown in the sub-component analysis, spectively. For image-text retrieval, KCR achieves close self-attention block and mid-level feature improve the performance compared to GaLR and outperforms other model performance on mR by 1.79%. External knowledge competitors. Note that compared to GaLR, which has improves the model performance, especially for imagemultiple image feature extractors and post-processing text Retrieval. The average improvements on the three stage, the structure of KCR is less conceptually heavy. Ex- metrics are 5.24%, 6.66%, and 4.28% respectively. mR perimental results on the sub-component analysis of KCR gains a 3.46% increase because of introducing relevant (e.g. running the model without knowledge embedding knowledge from knowledge graph. Meanwhile, comand self-attention module) show that incorporating com- pared with RSICD dataset, knowledge embedding has a monsense knowledge can extend sentence content and al- more obvious improvement on the UCM-Caption dataset, leviate the information gap, since the model performance indicating that the information gap might be larger on is significantly improved. The combination of external the smaller dataset. Examples of the top-5 retrieval reknowledge brings an extra 0.77%, 2.33%, and 2.59% for sults of KCR and KCR w/o KG are shown in Figure 2. the three metrics in text-image retrieval. For image-text In addition, we observe that ResNet101 is slightly more retrieval, external knowledge improves the model per- efective than ResNet18, with observed improvements of formance by 1.83%, 0.19%, and 0.28% on R@1, R@5, and 0.75% on average for RSICD dataset and 1.27% on average R@10, respectively. Removing the self-attention module for UCM-Caption dataset.

5. Conclusion

[11] D. Tuia, R. Roscher, J. D. Wegner, N. Jacobs, X. Zhu,

G. Camps-Valls, Toward a collective agenda on AI Retrieving remote sensing images from text queries is for earth science data analysis, IEEE Geoscience appealing but complex, since retrieval needs to be both and Remote Sensing Magazine 9 (2021) 88–104. visual and semantic. To address the information asymme- [12] Z. Yuan, W. Zhang, K. Fu, X. Li, C. Deng, H. Wang, try between images and texts, we propose a Knowledge- X. Sun, Exploring a fine-grained multiscale method aware Cross-modal Retrieval (KCR) method. By integrat- for cross-modal remote sensing image retrieval, ing relevant information from external knowledge graph, arXiv preprint arXiv:2204.09868 (2022). the model enriches the text scope to better match texts [13] M. Sap, R. Le Bras, E. Allaway, C. Bhagavatula, and images. Despite its conceptual simplicity, KCR shows N. Lourie, H. Rashkin, B. Roof, N. A. Smith, Y. Choi, improved performance with respect to all competitors, ATOMIC: An atlas of machine commonsense for which indicates potential generalization capabilities of if-then reasoning, in: AAAI, 2019, pp. 3027–3035. the knowledge-aware method. [14] R. Speer, J. Chin, C. Havasi, ConceptNet 5.5: An open multilingual graph of general knowledge, in: References AAAI, 2017, pp. 4444–4451. [15] Y. Li, D. Kong, Y. Zhang, Y. Tan, L. Chen, Ro[1] W. Zhou, S. Newsam, C. Li, Z. Shao, PatternNet: bust deep alignment network with remote sensA benchmark dataset for performance evaluation ing knowledge graph for zero-shot and generalized of remote sensing image retrieval, ISPRS Journal zero-shot remote sensing image scene classificaof Photogrammetry and Remote Sensing 145 (2018) tion, ISPRS Journal of Photogrammetry and Remote 197–209. Sensing 179 (2021) 145–158. [2] G. Hoxha, F. Melgani, B. Demir, Toward remote [16] X. Wang, R. Girshick, A. Gupta, K. He, Non-local sensing image retrieval under a deep image caption- neural networks, in: CVPR, 2018, pp. 7794–7803. ing perspective, IEEE Journal of Selected Topics in [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, Applied Earth Observations and Remote Sensing L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At13 (2020) 4462–4475. tention is all you need, in: NIPS, volume 30, 2017. [3] F. Faghri, D. J. Fleet, J. R. Kiros, S. Fidler, VSE++: [18] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learnImproving visual-semantic embeddings with hard ing for image recognition, in: CVPR, 2016, pp. 770– negatives, arXiv preprint arXiv:1707.05612 (2017). 778. [4] Z. Yuan, W. Zhang, C. Tian, X. Rong, Z. Zhang, [19] N. Reimers, I. Gurevych, Sentence-BERT: Sentence H. Wang, K. Fu, X. Sun, Remote sensing cross- embeddings using siamese BERT-networks, in: modal text-image retrieval based on global and local EMNLP, 2019, pp. 671–688. information, IEEE Transactions on Geoscience and [20] K.-H. Lee, X. Chen, G. Hua, H. Hu, X. He, Stacked Remote Sensing 60 (2022) 1–16. cross attention for image-text matching, in: ECCV, [5] K. Wang, Q. Yin, W. Wang, S. Wu, L. Wang, A com- 2018, pp. 201–216.

prehensive survey on cross-modal retrieval, arXiv [21] T. Wang, X. Xu, Y. Yang, A. Hanjalic, H. T. Shen, preprint arXiv:1607.06215 (2016). J. Song, Matching images and text with multi-modal [6] F. Ilievski, P. Szekely, B. Zhang, CSKG: The com- tensor fusion and re-ranking, in: ACM MM, 2019, monsense knowledge graph, in: ESWC, 2021, pp. pp. 12–20.

680–696. [22] Z. Wang, X. Liu, H. Li, L. Sheng, J. Yan, X. Wang, [7] W. Yang, X. Wang, A. Farhadi, A. Gupta, R. Mot- J. Shao, CAMP: Cross-modal adaptive message passtaghi, Visual semantic navigation using scene pri- ing for text-image retrieval, in: CVPR, 2019, pp. ors, arXiv preprint arXiv:1810.06543 (2018). 5764–5773. [8] Y. Fang, K. Kuan, J. Lin, C. Tan, V. Chandrasekhar, [23] B. Qu, X. Li, D. Tao, X. Lu, Deep semantic underObject detection meets knowledge graphs, in: IJ- standing of high resolution remote sensing image, CAI, 2017, pp. 1661–1667. in: CITS, 2016, pp. 1–5. [9] X. Lu, B. Wang, X. Zheng, X. Li, Exploring models [24] Y. Yang, S. Newsam, Bag-of-visual-words and and data for remote sensing image caption genera- spatial extensions for land-use classification, in: tion, IEEE Transactions on Geoscience and Remote SIGSPATIAL, 2010, pp. 270–279.

Sensing 56 (2017) 2183–2195. [25] X. Huang, Y. Peng, Deep cross-media knowledge [10] S. Lobry, D. Marcos, J. Murray, D. Tuia, RSVQA: transfer, in: CVPR, 2018, pp. 8837–8846.

Visual question answering for remote sensing data, [26] K. Cho, B. Van Merriënboer, D. Bahdanau, Y. BenIEEE Transactions on Geoscience and Remote Sens- gio, On the properties of neural machine translaing 58 (2020) 8555–8566. tion: Encoder-decoder approaches, arXiv preprint arXiv:1409.1259 (2014).