<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Network for Bitemporal Remote Sensing Image Change Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shizhen Chang</string-name>
          <email>shizhen.chang@iarai.ac.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Kopp</string-name>
          <email>michael.kopp@iarai.ac.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pedram Ghamisi</string-name>
          <email>pedram.ghamisi@iarai.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Helmholtz-Zentrum Dresden-Rossendorf, Helmholtz Institute Freiberg for Resource Technology, Machine Learning Group</institution>
          ,
          <addr-line>09599 Freiberg</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Advanced Research in Artificial Intelligence (IARAI)</institution>
          ,
          <addr-line>1030 Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The task of bitemporal change detection aims to identify the surface changes of specific scenes at two diferent points in time. In recent years, we have increasingly witnessed the success of deep learning in a variety of applications in remote sensing, including change detection and monitoring. In this paper, a novel deep feature retrieval neural network architecture for change detection is proposed that uses a trainable associative memory component to exploit potential similarities and connections of the deep features between image pairs. A key ingredient in our novel architecture is the use of a continuous modern Hopfield network component. The proposed method beats the current state-of-the-art on the well-known LEVIR-CD data set. The codes of this work will soon be available online (https://github.com/ShizhenChang). Remote sensing, change detection, modern Hopfield network, deep learning, Siamese network, convolutional neural network.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>With the rapid development of technologies for Earth
lution (VHR) remote sensing data has become available
for geographical analysis and image processing [1]. VHR
images can provide detailed information about land
surfaces, and images collected at diferent time epochs from
the scene are able to record changes regularly.
Therefore, as one of the most important remote sensing tasks,
change detection has been widely applied in many areas
of land-use and land-cover analysis, such as
environmental monitoring, urban growth, deforestation assessment,
shifting cultivation evaluation, and so on.</p>
      <p>A variety of deep neural networks, such as the
convo[3], recurrent neural networks [4], generative
adversarial network (GAN) [5], and deep belief network (DBNs)
[6], have been successfully utilized for remote sensing
change detection over the last few years. Among them,
CNN-based methods can take full use of the spatial
information of VHR remote sensing images, thus, can better
extract high-level deep features and abstract semantic
contents to learn discriminative diferences between the
periods.</p>
      <p>Strategies that have been applied to extract deep
features of the inputs, can be broadly divided into two
catenEvelop-O
gories: early fusion [7, 8] and late fusion [9] networks.</p>
      <p>The early-fusion networks first concatenate
multitemporal images into a unified data cube, and then, the
panetworks usually learn single-temporal features
individually and share the parameters by using a Siamese
network. Compared to early-fusion networks, late-fusion
methods can better utilize the features of the inputs and
return clearer contours of the change objects. However,
the features of shallower layers may not be suficiently
learned and utilized due to the gradient vanishing
problem. Therefore, learning information from both shallow
and deep layers are very important to efectively detect
changes using deep-learning-based approaches.</p>
      <p>In order to accurately extract features, deeper and more
include architecture components such as Long
ShortTerm Memory (LSTM) [10] and attention mechanisms
(self-attention [11], spatial attention [12], and channel
attention [8]). The successful combination of CNNs and
other networks has shown that discriminative features
within the image pairs can be better extracted and the
detection accuracy can be greatly improved. However,
limited by the architecture of CNNs, as the high-level
features are only related to the shallower layers through
larger receptive fields, the global and temporal
information between the image pairs are still not suficiently
utilized.</p>
      <p>To address this issue, we design a Hopfield pooling
block to interactively retrieve the high-level concepts of
changes. This idea is inspired by the successful
application of the modern Hopfield network for continuous
tic information between the image pairs in deeper layers
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License pattern retrieval [13]. Our assumption is that the
semanpatterns for the queries, each being a linear combination of stored patterns lying in the convex hull of the simplex spanned by
the stored patterns. (a) This Hopfield layer associates two sets R and Y to propagate sets of vectors. (b) Layer Hopfield Pooling
layer performs a pooling operation to the set Y via learned queries. (c) The Hopfield layer learns a new set of stored patterns
based on the input R.
that can be learned during the training process. We use
following energy function is minimized:
can be represented using a common matrix, i.e. a query,
hull of the simplex spanned by the { 1, ...,   }, such the
Binary modern Hopfield networks are associative
memories on binary data that can retrieve data of exponen- contrastive learning of joint image- and text embedding
this query to retrieve related semantic features between
given images. These retrieved features reflect a
common spatio-temporal context and are used by subsequent
layers in our network. Concretely, we incorporate a
Hopifeld network block into a Siamese fully convolutional
network (FCN) resulting in the design of our proposed
deep feature retrieved network (FrNet) for bitemporal
remote sensing change detection. It should be noted that
diferent from previous change detection models, both
semantic and temporal information can be fully
considered; and it is our first attempt of using modern Hopfield
networks in the remote sensing community.</p>
      <p>The rest of this article is organized as follows. Section
II briefly reviews continuous modern Hopfield networks.</p>
      <p>Section III describes the proposed method. Experiments
are conducted and discussed in Section IV.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Continuous Modern Hopfield Network</title>
      <p>tially many stored patterns [14, 15], this being the key
distinguishing feature to their classical binary
counterparts [16, 17]. These binary modern Hopfield networks
have been generalized to continuous modern Hopfield
networks that, crucially, are diferentiable and can thus
be embedded in deep learning architectures trained by
gradient descent [13, 18]. Moreover, continuous modern
Hopfield networks retain the key ability to store
exponentially many patterns and they can furthermore retrieve
patterns in only one update step.</p>
      <p>Given a matrix  of shape  × 
formed of column
vectors { 1, ...,   } ∈ ℝ , a query pattern  , also a column
vector, seeks to retrieve the best pattern in the convex
2
1  ⊤ +
1
2
 2,
 = −</p>
      <p>−1 log(∑ exp ( ⊤ ))+ −1 log  +
shown in [13, 18],  
rule:
where  is the largest norm of the { 1, ...,   } in ℝ . As
is defined by the following update


=  ( ;  , ) = 
softmax (
⊤ ),
(1)
which will converge globally, almost always, to a local
minima of the energy function in essentially one update
step. Moreover, equation (1) is closely related to the well
known transformer attention mechanism, showing that
retrieval in modern Hopfield networks and transformer
attention coincide [13, 18].</p>
      <p>With changable structures in deeper networks (as
shown in Fig. 1), continuous modern Hopfield networks
have greater application prospects in deep learning. It
has been successfully applied to solving large scale
multiinstance learning tasks [19], to few- and zero-shot
chemical reaction template prediction [20], to creating new
reinforcement learning algorithms [21, 22], to improving
representations [23] and to tabular data [24].</p>
      <p>Inspired by continuous modern Hopfield networks,
we design a Siamese Hopfield pooling layer and attempt
to capture deep feature diferences for remote sensing
bitemporal change detection.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Deep Feature Retrieved</title>
    </sec>
    <sec id="sec-4">
      <title>Network for Change Detection</title>
      <sec id="sec-4-1">
        <title>3.1. Overview</title>
        <p>As shown in Fig. 2, the proposed deep feature retrieved
network (FrNet) is a Siamese network that contains three
C</p>
        <p>Change Map
: Backbone blocks
: 1×1 convolution
: Up sample
: Matrices difference
: Matrix multiplication
: Concatenation</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Hopfield Pooling Block</title>
        <p>The Hopfield layer is proven to be capable of retrieving
key features of the input through one update. For the
proposed bitemporal change detection task, the question
is: “how can we obtain the most typical information
Let us assume two temporal VHR images are denoted
by   ∈ ℝ3×ℎ× , where  = {1, 2} represents the  -th time
period and ℎ and  are the height and width of the
images, respectively. Features obtained by the backbone
are denoted as   ∈ ℝ</p>
        <p>× ℎ̃ × ̃ , where  , ℎ̃ , and  ̃ represent
respectively. For the proposed VGG-16 feature extractor,
the features are 1/32 of the original image.
the channel size of   is 512, and the height and width of</p>
        <p>In the Hopfield pooling block, the features are first
reshaped into ℝℎ̃ ×̃ of row-wise vectors. Then, for the
time 1 image, we introduce a trainable weight matrix
  ∈ ℝ  ×ℎ̃  ̃ to retrieve the related deep features of  1
related to the 2nd period. The output can be written as:
 1 = softmax (   1⊤) 2.
(2)
The number of rows   in   is set to 2 in this paper
which represents the change/unchange semantic
information we retrieved.</p>
        <p>Similarly, the common weight matrix   is utilized to
retrieve  2 related to the 1st period:
 2 = softmax (   2⊤) 1.
(3)
It should be noted that the retrieved output  1 and  2
have the same size and contain both global and temporal
information of the image pairs.</p>
        <p>We concatenate the retrieved outputs together:  =
[ 1;  2], restore their spatial dimensions, and feed them
into a 1 × 1 2D convolutional layer with 16 filters to
generate a new feature map. After bilinear interpolation,
the features through the Hopfield pooling block is finally
derived:</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Experiments</title>
      <sec id="sec-5-1">
        <title>4.1. Data Set</title>
        <p>In the experimental part, the LEVIR-CD data set [26] is
utilized to compare the change detection methods. The
LEVIR-CD data set is composed of 637 VHR (0.5m/pixel)
Google Earth (GE) image pairs with the size of 1024×1024
pixels. These image pairs have been captured in
diferent periods of 5 to 14 years and cover a total of 31,333
individual buildings for the task of building growth
assessment. With the ratio of 7:1:2, these image pairs are
split into the training set, validation set, and testing set.
Following the initial settings, we crop each image into 16
non-overlapped small patches with the size of 256×256
pixels. Thus, there are a total of 7120 image pairs for
training, 1024 for validation, and 2024 for testing.
 =
   =</p>
        <p>+</p>
        <p>+  
 1 = 2   ⋅</p>
        <p>+ 
 =</p>
        <p>+  
  +   +   +  
(5)
(6)
(7)
(8)
where TP (True Positive) represents the number of
pixels of real changes that are correctly detected, FP (False
Positive) represents the number of pixels of unchanged
objects that are falsely detected as changed objects,
TN (True Negative) denotes the number of pixels of
unchanged objects that are correctly regarded as
nonchange, and FN (False Negative) denotes the number of
changed pixels that are not detected as changed objects.</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.3. Experimental Results and Analysis</title>
      </sec>
      <sec id="sec-5-3">
        <title>4.2. Comparative method and Evaluation</title>
      </sec>
      <sec id="sec-5-4">
        <title>Metrics</title>
        <p>In our experiments, the proposed FrNet is implemented
with the Pytorch platform using a single NVIDIA A100
To verify the efectiveness of the proposed FrNet method, GPU (with 40-GB RAM). During the training stage, the
four representative deep-learning-based change detec- Adam optimizer with a weight decay of 1 − 5 was
emtion networks are taken into consideration. The FC-EF ployed. The batch size is set to 32, and the learning rate
[7] is an early fusion method based on U-Net that con- is initially set to 1 − 4 and will linearly reduce to 0 over
catenates the bitemporal image pairs as the input. And its 50,000 iterations. The  of the Hopfield layer is set to
extended versions, the FC-Siam-dif and FC-Siam-conc 1/√  .
[7], use Siamese networks with shared weights to ex- The quantitative results for the precision, recall, F1
tract multi-level features and use feature diference and score, and OA of all models are summarized in Table 1.
concatenation, respectively, to fuse bitemporal informa- It can be found that FC-EF obtains the lowest F1 score
tion. The bitemporal image transformer (BIT) network (75.25%) and OA (96.78%) among all the models. The
FC[12] designs a context-information-based enhancer to Siam-conc and FC-Siam-dif perform slightly better than
extract related concepts in the token-based space-time, FC-EF, which indicates the Siamese network and feature
and projects the context-rich tokens back to original fea- diference/concatenation have benefits for the
preservatures for prediction. To validate the efectiveness of the tion of useful information. The F1 score and OA of the
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion</title>
      <p>BIT model are 83.22% and 98.06%, respectively, better
than other FC-based models. This demonstrates that the
tokens in spase-time can efectively capture the tempo- Inspired by the successful application of continuous
modral changes and enhance the context information. The ern Hopfield for pattern retrieval, we propose a deep
proposed FrNet achieves the highest F1 and OA among feature retrieved network (FrNet) for bitemporal change
all the studied methods and has better performance than detection. Our Hopfield pooling block introduces a
trainour base model. The improvements prove that the Hop- able weight matrix that aims to retrieve the global change
ifeld layer helps retrieve the deep features and the shared of interests for high-level features and capture the
disquery matrix can learn important information as part of criminative representations of one period related to the
the inputs for the decoder. other. To valuate the efectiveness of the proposed model,</p>
      <p>Fig. 3 illustrates change detection maps obtained by experiments are conducted on the LEVIR-CD data set.
diferent methods, where TPs, TNs, FPs, and FNs are Our empirical evidence confirms the superiority of the
represented in yellow, black, red, and green, respec- proposed FrNet in comparison with other
state-of-thetively. We can observe that FrNet achieves the best results arts methods.
among all the models. Firstly, FrNet can better
distinguish small-sized changed buildings that have relatively Acknowledgments
regular shapes by reducing false alarms compared with
other methods (e.g., the 1st, 2nd, and 3rd rows of Fig. 3).</p>
      <p>When the shapes of buildings are complex, our model
can also preserve the boundary of the objects (e.g., the
4th, 5th, and 6th rows of Fig. 3).</p>
      <p>The authors would like to thank the contributors of the
LEVIR-CD data set for making it publicly available, and
the authors of the FC-EF, FC-Siam-conc, FC-Siam-dif,
and the BIT methods for releasing their codes.
2205.12258.
[22] M. Widrich, M. Hofmarcher, V. P. Patil, A.
Bitto</p>
      <p>Nemling, S. Hochreiter, Modern hopfield networks
for return decomposition for delayed rewards, in:
Deep RL Workshop NeurIPS 2021, 2021. URL: https:
//openreview.net/forum?id=t0PQSDcqAiy.
[23] A. Fürst, E. Rumetshofer, J. Lehner, V. Tran, F. Tang,</p>
      <p>H. Ramsauer, D. Kreil, M. Kopp, G. Klambauer,
A. Bitto-Nemling, S. Hochreiter, Cloob: Modern
hopfield networks with infoloob outperform clip,
2021. URL: https://arxiv.org/abs/2110.11316. doi:10.</p>
      <p>48550/ARXIV.2110.11316.
[24] B. Schäfl, L. Gruber, A. Bitto-Nemling, S. Hochreiter,</p>
      <p>Hopular: Modern hopfield networks for tabular
data, 2022. URL: https://openreview.net/forum?id=
3zJVXU311-Q.
[25] K. Simonyan, A. Zisserman, Very deep
convolutional networks for large-scale image recognition,
arXiv preprint arXiv:1409.1556 (2014).
[26] H. Chen, Z. Shi, A spatial-temporal attention-based
method and a new dataset for remote sensing image
change detection, Remote Sensing 12 (2020) 1662.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>