<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Knowledge-Aware Cross-Modal Text-Image Retrieval for Remote Sensing Images</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Li Mi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Siran Li</string-name>
          <email>siran.li@epfl.ch</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christel Chappuis</string-name>
          <email>christel.chappuis@epfl.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Devis Tuia</string-name>
          <email>devis.tuia@epfl.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Environmental Computational Science and Earth Observation Laboratory, École Polytechnique Fédérale de Lausanne (EPFL)</institution>
          ,
          <addr-line>1950 Sion</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Section of Electrical and Electronics Engineering, École Polytechnique Fédérale de Lausanne (EPFL)</institution>
          ,
          <addr-line>1015 Lausanne</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Image-based retrieval in large Earth observation archives is dificult, because one needs to navigate across thousands of candidate matches only with the proposition image as a guide. By using text as a query language, the retrieval system gains in usability, but at the same time faces dificulties due to the diversity of visual signals that cannot be summarized by a short caption only. For this reason, as a matching-based task, cross-modal text-image retrieval often sufers from information asymmetry between texts and images. To address this challenge, we propose a Knowledge-aware Cross-modal Retrieval (KCR) method for remote sensing text-image retrieval. By mining relevant information from an external knowledge graph, KCR enriches the text scope available in the search query and alleviates the information gaps between texts and images for better matching. Experimental results on two commonly used remote sensing text-image retrieval benchmarks show that the proposed knowledge-aware method outperforms state-of-the-art methods.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Cross-modal Retrieval</kwd>
        <kwd>Knowledge Graph</kwd>
        <kwd>Remote Sensing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>specific aspects, focusing on the most dominant
information. For example, one image could receive the following
Recent advances in satellite data acquisition and stor- caption text: There is a lake. Nevertheless, there might be
age have led to a rapid development of remote sensing trees and mountains around the lake which are ignored
image archives. To explore them, image retrieval has by humans or caption generators. In addition, diferent
received increasing attention [1, 2]. However, retrieving people will describe the image from subjective
perspecimages using example images would limit the versatil- tives, resulting in a variety of text information for a single
ity of the retrieval system, since with the query image image, which may confuse the matching model.
Thereonly, one cannot specify which elements are essential fore, strategies to handle lacunary captions, nuances and
for the query or what the retrieval objective is. As a synonyms are needed for the task, and a balance between
solution, text-image retrieval [3, 4] has been introduced objectivity and completeness must be achieved.
to explicit the retrieval targets in a semantic way. Text- Knowledge graphs [6] present relationships and
proximage retrieval aims at recalling an image based on a imities among concepts through graph structures. By
protext or, in reverse, retrieving a text according to an im- viding the experience and commonsense from human
unage. As a bridge between vision and language research, derstanding, knowledge graphs have been recognized as
it provides a possibility to explore the growing amount efective prior knowledge in many vision-and-language
of cross-modal remote sensing data. research [7, 8] to reveal commonsense and alleviate
am</p>
      <p>When regarding text as the query, the prospective re- biguities. In this paper, we propose a Knowledge-aware
trieval system gains in usability among cross-modal data, Cross-modal Retrieval (KCR) method for remote sensing
but at the same time faces the problem of information text-image retrieval. With the help of external
Knowlasymmetry between texts and images [5]. When dealing edge Graphs, KCR extends the text scope to obtain a more
with very high-resolution remote sensing images, the robust text representation. More specifically, based on
image content can be very diverse, hence it is dificult the objects mentioned in a sentence as starting points,
to be comprehensively summarized by the natural lan- KCR proposes to mine the expanded nodes and edges in a
guage, especially by a short caption. On one hand, human knowledge graph and embeds them as features to enrich
captions can only describe the image from one or a few those extracted from the text content alone. As such,
KCR integrates commonsense knowledge and leads to
competitive performance on two commonly used remote
sensing text-image retrieval benchmarks.
r
e
d
o
c
n
e
e
g
a
m
I
r
e
d
o
c
n
e
t
x
e
T
g
eg idnd
Iam ebm
e
g
n
texT ieddb
m
e
t
n
e
m
e
r
u
s
a
e
m
y
t
i
r
a
l
i
m
i
S</p>
      <sec id="sec-1-1">
        <title>Image</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work 3. Knowledge-aware Cross-modal Retrieval Method</title>
      <p>Due to the emergence of multi-modal remote sensing
data, vision-language research, such as image captioning The proposed text-image retrieval system comprises
[9], visual question answering [10], and cross-modal re- three main components: an image encoder, a text
entrieval [4] has attracted increasing attention [11]. Recent coder and a similarity measurement module (Figure 1).
advances in remote sensing text-image retrieval mainly The image encoder is designed to extract image features
focused on: 1) learning a more representative image fea- by a pre-trained feature extractor and a self-attention
ture by fusing local and global image feature [12] or block. The text encoder embeds a sentence and its
refeatures from diferent feature extractors [ 4]. 2) learning lated external knowledge extracted from a knowledge
a distinguishable joint embedding space by fusing the graph into a joint feature space. Finally, the image and
cross-modal features and leveraging a ranking-based loss text features are both used within the similarity
measurefunction [4]. Departing from previous eforts that based ment module to compute the similarity score between
the retrieval on the image characteristic and the caption text queries and candidate images, which are then ranked
only, we propose to enrich the latter with a knowledge according to their relevance. The model can also be
apgraph that would extend the text content and alleviate plied in reverse, where the best captions to summarize
ambiguities for a more robust text representation. an image are retrieved.</p>
      <p>Consisting of various nodes as concepts, knowledge
graphs encode commonsense knowledge about the world
[6, 13, 14]. By exploring the knowledge graphs, vision- 3.1. Image encoder
and-language research has been promoted due to the The image encoder is a pre-trained feature extractor with
priors for visual understanding [7, 8]. In remote sens- a self-attention block [16, 17]. Two sets of image features
ing research, Li et al. [15] constructed a remote sensing are extracted from the image encoder:
knowledge graph to support zero-shot remote sensing High-level image feature. We use ResNet-101 [18] as
image scene classification. Their eforts in exploiting a re- a backbone and the last Fully Connected (FC) layer is
mote sensing knowledge graph to image understanding retrained. For an image , the output of the retrained FC
focused on using graph embedding as an overall represen- layer is regarded as the high-level image feature fℎℎ.
tation of an image. Diferent from this work, the proposed Mid-level image feature. The output of ResNet block 3,
KCR explores the fine-grained object-level connections denoted as f, is sent to an additional self-attention block
between nodes in the graph and words in sentences. to further capture the long-range dependencies among
pixels and provide more detailed information at relatively
After being embedded in the feature space, sentence
repmid-level. The self-attention block can be defined as Eq.
resentation and external knowledge representation are
f. ,  , and  are weight matrices of 1 × 1
convolutions to embed the feature. Then followed by a
2D pooling layer and a flattening operation, the mid-level
image feature is extracted as a vector:
f = pool2d (fout ).</p>
      <p>(2) to the query feature, the similarity score is larger and the
graph can be regarded as a subgraph of the existing re- retrieval result with query image  and image ′ is the
To project the image feature and text feature into a same
dimension, the concatenated high-level and mid-level
image feature are sent to a final FC layer to obtain the
overall image presentation:
f = FC (concat (fhigh , fmid )).</p>
      <p>(3)
3.2. Knowledge-aware text encoder
Knowledge representation. For a sentence  with
 words:  = {1, 2, ..., } ( ≥
is used to separate every word and divide the
part-ofspeech (e.g. noun, verb, adjective, adverb, etc.) for them.</p>
      <p>Based on the part-of-speech tags, all the nouns can be
appended into a word list. Then we extract a sentence
graph  based on the word list only. The sentence</p>
      <p>1), a tokenizer
mote sensing knowledge graph [15], . More
specifically, the nouns in the word list are regarded as the initial
nodes. Starting from those nodes, all the one-step
neighbours with the connected edges in  are included in
. Note that the sentence graph is a directed graph,
which means the edge between two nodes is a one-way
relationship. In the sentence graph, each edge can be
represented as a relationship triplet (, , ), shown as
&lt; −  −  &gt;, which can be regarded
as a short sentence with three words. Mining all the edges
of the sentence graph might be redundant, so we decide
to randomly select  triplets from all the available ones.</p>
      <p>Text encoder. We use Sentence-Transformer [19] as the
text encoder for the sentence features f, as well as
the external knowledge representation f.
SentenceTransformer is a modification of the pretrained BERT
network using siamese and triplet network structures
to derive semantically meaningful sentence embeddings.</p>
      <p>The encoding process of a sentence and the
corresponding knowledge can be formulated as:</p>
      <p>f = SenTrans (s{w1 , w2 , ..., wn })
f = FC(  =0</p>
      <p>1 ∑︁ SenTrans (ri {si , pi , oi })).</p>
      <p>(4)
concatenated and sent to the finial FC layer to obtain the
overall representation of a text:
f = FC (concat (fsen , fknow )).</p>
      <p>(5)
features:  = −
target ranks higher.</p>
      <sec id="sec-2-1">
        <title>3.3. Similarity Measurement</title>
        <p>Similarity score. The similarity score  is defined as
the negative pairwise euclidean distance between two
dis (fimg , ftext ). With smaller distance
Loss function. Triplet loss is commonly used in the
text-image retrieval task [3, 20, 4]. It constrains the
similarity score of the matched image-text pairs to be larger
than the similarity score of the unmatched ones by a
margin. Meanwhile, previous research [3] discovered
that using the hardest negative in a batch during training
rather than all negatives samples can boost performance.
Therefore, the loss function can be formulated as:
(, ) = max (0,  −  + ′ )
+ max (0,  −  + ′ ) ,
(6)
where  is a margin parameter, image  and sentence 
are the corresponding pair. Sentence ′ is the top-1 text
top-1 image retrieval result with query text .</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experiments</title>
      <sec id="sec-3-1">
        <title>4.1. Experimental details</title>
        <p>Datasets. We perform experiments on two commonly
used RS text-image datasets: the RSICD dataset and
UCMCaption dataset. The RSICD dataset [9] contains 10921
images with the size 224× 224 pixels. The UCM-Captions
dataset [23], which is based on the UC Merced Land Use
dataset [24], contains remote sensing images categorized
into 21 land use classes, with 100 samples for each class.
For each sample in both datasets, there are 5 sentences
describing the image content. We follow the train-test
split in previous work [4], randomly selecting 80%, 10%
and 10% for the dataset as the training set, validation set
and test set, respectively.</p>
        <p>Metrics. To evaluate the model performance, we exploit
the standard evaluation metrics in retrieval tasks and
and mR [3, 25]. With diferent values of , R@ means
the fraction of queries for which the most relevant item
is ranked among the top- retrievals. mR represents
the average of all R@ in both text-image retrieval and
47.24
54.19
52.90
57.52
18.10
26.73
29.49
31.44
32.13
36.12
27.81
33.68
34.77
37.36
76.00
81.52
81.81
80.38
3.38
5.85
5.02
5.39
6.59
4.76
5.12
3.94
4.12
5.95
9.51
12.89
12.52
15.08
19.85
18.59
12.89
12.36
18.40
18.59
17.46
19.84
19.74
23.40
31.04
27.20
21.12
24.08
29.30
29.58
VSE++ [3]
SCAN [20]
MTFN [21]
AMFMN [12]
GaLR [4]
KCR
CAMP [22]
KCR w/o KG Att
KCR w/o KG
KCR</p>
        <p>Backbone
ResNet18</p>
        <p>ResNet101
VSE++ [3]
SCAN [20]
MTFN [21]
AMFMN [12]
KCR
CAMP [22]
KCR w/o KG Att
KCR w/o KG
KCR</p>
        <p>Backbone
ResNet18
ResNet101
2.82
3.71
4.90
4.90
4.69
5.84
4.15
4.47
4.63
5.40
11.71
16.67
16.00
17.43
image-text retrieval. In our experiment, we report the
results of  = 1,  = 5, and  = 10.</p>
        <p>Hyper-parameters. In all experiments, the margin of
the triplet loss function is set to 0.2 following the
previous work [4]. For the image encoder, the input and
intermediate dimensions of the self-attention block are
respectively set to 1024 and 512, according to [16]. In
terms of text encoder, the number of selected triplets are
set to 10 in our experiments. Other feature dimensions
are annotated in Figure 1. In addition, to achieve fair
comparison with the competing methods, results with
the ResNet18 backbone are also reported. For ResNet18
backbone, the dimension of the mid-level feature is 256
and other parameters are as the same of the model with
ResNet101 backbone.</p>
        <p>Implementation details. For the training process, we
train and evaluate the model in mini-batch with a
batchsize of 100. The optimizer is Adam optimizer with a
weight decay of 5e-4 and initial learning rate of 0.001.
For every 10 epochs, the learning rate drops 10%. All the
experiments are conducted on a single NVIDIA GeForce
RTX 3090 GPU. The max training epoch are 150 and 200
for the UCM-Caption and RSICD dataset, respectively.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Experimental results</title>
        <p>Comparison methods. We compare the proposed
method with the following state-of-the-art methods in
text-image retrieval, especially those for remote sensing
text-image retrieval.</p>
        <p>• VSE++ [3] uses a CNN and a Gated Recurrent
Unit (GRU) [26] to capture image and text
features, respectively.
• SCAN [20] exploits fine-grained interplay
between images and texts by inferring the semantic
alignment between them.
• CAMP [22] proposes a cross-modal message
passing method to explore the image-text interactions
before calculating similarities.
• MTFN [21] introduces a rank-based fusion model
to avoid finding the common embedding space
for cross-modal data.
• AMFMN [12] employs multiscale visual
selfattention module to extract the visual features
and guide the text representation.
• GaLR [4] utilizes an attention-based multi-level</p>
        <sec id="sec-3-2-1">
          <title>Text-Image Retrieval</title>
          <p>This is a sparse
residential area with a
villa surrounded by
lush plants</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Image-Text Retrieval</title>
          <p>Baseline
KCR
Baseline</p>
          <p>A busy intersection
with many cars.</p>
          <p>An intersection only
with some plants at
the corner.</p>
          <p>A busy intersection
only with some
houses and plants at
the corner.</p>
          <p>An intersection with
some plants at the
corner.</p>
          <p>An overpass with a
road go across
another two vertically.</p>
          <p>An intersection with
two roads vzertical to
each other.</p>
          <p>An intersection with
two roadszvertical to
each other.</p>
          <p>It is a parking lot with
many cars parked
neatly and some
parking spots are free.</p>
          <p>An intersection with
cars on the road.</p>
          <p>Some cars are on the
freeways.</p>
          <p>information dynamic module to fuse global and and the mid-level feature degrade the results by 2.03%
local feature extracted by a CNN and a Graph on mR, which indicates the importance of representative
Neural Network (GCN), respectively. In addition, mid-level image feature.</p>
          <p>GaLR involves a post-processing stage based on Results on the UCM-Caption dataset (Table 2). For
a plug-and-play multivariate rerank algorithm. text-image retrieval, KCR significantly outperforms
stateResults on the RSICD dataset (Table 1). KCR achieves of-the-art methods. For R@1, R@5, and R@10, With
the best performance with the exception of image-text ResNet18 backbone, KCR achieves the best performance
retrieval, where GaLR is the best performing. With with an average improvement of 3.27% compared to
ResNet18 backbone, KCR outperforms GaLR on mR by AMFMN. For image-text retrieval, KCR gains 2.86% on
0.18%. In terms of text-image retrieval, the improvements both R@5 and R@10. The overall improvement on mR
are 1.15%, 2.83% and 3.99% for R@1, R@5, and R@10, re- is 1.79%. As is shown in the sub-component analysis,
spectively. For image-text retrieval, KCR achieves close self-attention block and mid-level feature improve the
performance compared to GaLR and outperforms other model performance on mR by 1.79%. External knowledge
competitors. Note that compared to GaLR, which has improves the model performance, especially for
imagemultiple image feature extractors and post-processing text Retrieval. The average improvements on the three
stage, the structure of KCR is less conceptually heavy. Ex- metrics are 5.24%, 6.66%, and 4.28% respectively. mR
perimental results on the sub-component analysis of KCR gains a 3.46% increase because of introducing relevant
(e.g. running the model without knowledge embedding knowledge from knowledge graph. Meanwhile,
comand self-attention module) show that incorporating com- pared with RSICD dataset, knowledge embedding has a
monsense knowledge can extend sentence content and al- more obvious improvement on the UCM-Caption dataset,
leviate the information gap, since the model performance indicating that the information gap might be larger on
is significantly improved. The combination of external the smaller dataset. Examples of the top-5 retrieval
reknowledge brings an extra 0.77%, 2.33%, and 2.59% for sults of KCR and KCR w/o KG are shown in Figure 2.
the three metrics in text-image retrieval. For image-text In addition, we observe that ResNet101 is slightly more
retrieval, external knowledge improves the model per- efective than ResNet18, with observed improvements of
formance by 1.83%, 0.19%, and 0.28% on R@1, R@5, and 0.75% on average for RSICD dataset and 1.27% on average
R@10, respectively. Removing the self-attention module for UCM-Caption dataset.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion</title>
      <p>[11] D. Tuia, R. Roscher, J. D. Wegner, N. Jacobs, X. Zhu,</p>
      <p>G. Camps-Valls, Toward a collective agenda on AI
Retrieving remote sensing images from text queries is for earth science data analysis, IEEE Geoscience
appealing but complex, since retrieval needs to be both and Remote Sensing Magazine 9 (2021) 88–104.
visual and semantic. To address the information asymme- [12] Z. Yuan, W. Zhang, K. Fu, X. Li, C. Deng, H. Wang,
try between images and texts, we propose a Knowledge- X. Sun, Exploring a fine-grained multiscale method
aware Cross-modal Retrieval (KCR) method. By integrat- for cross-modal remote sensing image retrieval,
ing relevant information from external knowledge graph, arXiv preprint arXiv:2204.09868 (2022).
the model enriches the text scope to better match texts [13] M. Sap, R. Le Bras, E. Allaway, C. Bhagavatula,
and images. Despite its conceptual simplicity, KCR shows N. Lourie, H. Rashkin, B. Roof, N. A. Smith, Y. Choi,
improved performance with respect to all competitors, ATOMIC: An atlas of machine commonsense for
which indicates potential generalization capabilities of if-then reasoning, in: AAAI, 2019, pp. 3027–3035.
the knowledge-aware method. [14] R. Speer, J. Chin, C. Havasi, ConceptNet 5.5: An
open multilingual graph of general knowledge, in:
References AAAI, 2017, pp. 4444–4451.
[15] Y. Li, D. Kong, Y. Zhang, Y. Tan, L. Chen,
Ro[1] W. Zhou, S. Newsam, C. Li, Z. Shao, PatternNet: bust deep alignment network with remote
sensA benchmark dataset for performance evaluation ing knowledge graph for zero-shot and generalized
of remote sensing image retrieval, ISPRS Journal zero-shot remote sensing image scene
classificaof Photogrammetry and Remote Sensing 145 (2018) tion, ISPRS Journal of Photogrammetry and Remote
197–209. Sensing 179 (2021) 145–158.
[2] G. Hoxha, F. Melgani, B. Demir, Toward remote [16] X. Wang, R. Girshick, A. Gupta, K. He, Non-local
sensing image retrieval under a deep image caption- neural networks, in: CVPR, 2018, pp. 7794–7803.
ing perspective, IEEE Journal of Selected Topics in [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
Applied Earth Observations and Remote Sensing L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,
At13 (2020) 4462–4475. tention is all you need, in: NIPS, volume 30, 2017.
[3] F. Faghri, D. J. Fleet, J. R. Kiros, S. Fidler, VSE++: [18] K. He, X. Zhang, S. Ren, J. Sun, Deep residual
learnImproving visual-semantic embeddings with hard ing for image recognition, in: CVPR, 2016, pp. 770–
negatives, arXiv preprint arXiv:1707.05612 (2017). 778.
[4] Z. Yuan, W. Zhang, C. Tian, X. Rong, Z. Zhang, [19] N. Reimers, I. Gurevych, Sentence-BERT: Sentence
H. Wang, K. Fu, X. Sun, Remote sensing cross- embeddings using siamese BERT-networks, in:
modal text-image retrieval based on global and local EMNLP, 2019, pp. 671–688.
information, IEEE Transactions on Geoscience and [20] K.-H. Lee, X. Chen, G. Hua, H. Hu, X. He, Stacked
Remote Sensing 60 (2022) 1–16. cross attention for image-text matching, in: ECCV,
[5] K. Wang, Q. Yin, W. Wang, S. Wu, L. Wang, A com- 2018, pp. 201–216.</p>
      <p>prehensive survey on cross-modal retrieval, arXiv [21] T. Wang, X. Xu, Y. Yang, A. Hanjalic, H. T. Shen,
preprint arXiv:1607.06215 (2016). J. Song, Matching images and text with multi-modal
[6] F. Ilievski, P. Szekely, B. Zhang, CSKG: The com- tensor fusion and re-ranking, in: ACM MM, 2019,
monsense knowledge graph, in: ESWC, 2021, pp. pp. 12–20.</p>
      <p>680–696. [22] Z. Wang, X. Liu, H. Li, L. Sheng, J. Yan, X. Wang,
[7] W. Yang, X. Wang, A. Farhadi, A. Gupta, R. Mot- J. Shao, CAMP: Cross-modal adaptive message
passtaghi, Visual semantic navigation using scene pri- ing for text-image retrieval, in: CVPR, 2019, pp.
ors, arXiv preprint arXiv:1810.06543 (2018). 5764–5773.
[8] Y. Fang, K. Kuan, J. Lin, C. Tan, V. Chandrasekhar, [23] B. Qu, X. Li, D. Tao, X. Lu, Deep semantic
underObject detection meets knowledge graphs, in: IJ- standing of high resolution remote sensing image,
CAI, 2017, pp. 1661–1667. in: CITS, 2016, pp. 1–5.
[9] X. Lu, B. Wang, X. Zheng, X. Li, Exploring models [24] Y. Yang, S. Newsam, Bag-of-visual-words and
and data for remote sensing image caption genera- spatial extensions for land-use classification, in:
tion, IEEE Transactions on Geoscience and Remote SIGSPATIAL, 2010, pp. 270–279.</p>
      <p>Sensing 56 (2017) 2183–2195. [25] X. Huang, Y. Peng, Deep cross-media knowledge
[10] S. Lobry, D. Marcos, J. Murray, D. Tuia, RSVQA: transfer, in: CVPR, 2018, pp. 8837–8846.</p>
      <p>Visual question answering for remote sensing data, [26] K. Cho, B. Van Merriënboer, D. Bahdanau, Y.
BenIEEE Transactions on Geoscience and Remote Sens- gio, On the properties of neural machine
translaing 58 (2020) 8555–8566. tion: Encoder-decoder approaches, arXiv preprint
arXiv:1409.1259 (2014).</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>