A Deep Feature Retrieved Network for Bitemporal Remote
Sensing Image Change Detection
Shizhen Chang1 , Michael Kopp1 and Pedram Ghamisi1,2
1
 Institute of Advanced Research in Artificial Intelligence (IARAI), 1030 Vienna, Austria
2
 Helmholtz-Zentrum Dresden-Rossendorf, Helmholtz Institute Freiberg for Resource Technology, Machine Learning Group, 09599 Freiberg,
Germany


                                       Abstract
                                       The task of bitemporal change detection aims to identify the surface changes of specific scenes at two different points in
                                       time. In recent years, we have increasingly witnessed the success of deep learning in a variety of applications in remote
                                       sensing, including change detection and monitoring. In this paper, a novel deep feature retrieval neural network architecture
                                       for change detection is proposed that uses a trainable associative memory component to exploit potential similarities and
                                       connections of the deep features between image pairs. A key ingredient in our novel architecture is the use of a continuous
                                       modern Hopfield network component. The proposed method beats the current state-of-the-art on the well-known LEVIR-CD
                                       data set. The codes of this work will soon be available online (https://github.com/ShizhenChang).

                                       Keywords
                                       Remote sensing, change detection, modern Hopfield network, deep learning, Siamese network, convolutional neural network.


1. Introduction                                                                                                  gories: early fusion [7, 8] and late fusion [9] networks.
                                                                                                                 The early-fusion networks first concatenate multitem-
With the rapid development of technologies for Earth poral images into a unified data cube, and then, the pa-
observation, an ever-growing amount of very high reso- rameters are hierarchically fine-tuned. The late-fusion
lution (VHR) remote sensing data has become available networks usually learn single-temporal features individ-
for geographical analysis and image processing [1]. VHR ually and share the parameters by using a Siamese net-
images can provide detailed information about land sur- work. Compared to early-fusion networks, late-fusion
faces, and images collected at different time epochs from methods can better utilize the features of the inputs and
the scene are able to record changes regularly. There- return clearer contours of the change objects. However,
fore, as one of the most important remote sensing tasks, the features of shallower layers may not be sufficiently
change detection has been widely applied in many areas learned and utilized due to the gradient vanishing prob-
of land-use and land-cover analysis, such as environmen- lem. Therefore, learning information from both shallow
tal monitoring, urban growth, deforestation assessment, and deep layers are very important to effectively detect
shifting cultivation evaluation, and so on.                                                                      changes using deep-learning-based approaches.
              A variety of deep neural networks, such as the convo-                                                 In order to accurately extract features, deeper and more
lutional neural networks (CNNs) [2], autoencoders (AEs) complex CNN-based networks have been designed, that
[3], recurrent neural networks [4], generative adversar- include architecture components such as Long Short-
ial network (GAN) [5], and deep belief network (DBNs) Term Memory (LSTM) [10] and attention mechanisms
[6], have been successfully utilized for remote sensing (self-attention [11], spatial attention [12], and channel
change detection over the last few years. Among them, attention [8]). The successful combination of CNNs and
CNN-based methods can take full use of the spatial infor- other networks has shown that discriminative features
mation of VHR remote sensing images, thus, can better within the image pairs can be better extracted and the
extract high-level deep features and abstract semantic detection accuracy can be greatly improved. However,
contents to learn discriminative differences between the limited by the architecture of CNNs, as the high-level
periods.                                                                                                         features are only related to the shallower layers through
              Strategies that have been applied to extract deep fea- larger receptive fields, the global and temporal informa-
tures of the inputs, can be broadly divided into two cate- tion between the image pairs are still not sufficiently
CDCEO 2022: 2nd Workshop on Complex Data Challenges in Earth utilized.
Observation, July 25, 2022, Vienna, Austria                                                                         To address this issue, we design a Hopfield pooling
Envelope-Open shizhen.chang@iarai.ac.at (S. Chang); michael.kopp@iarai.ac.at block to interactively retrieve the high-level concepts of
(M. Kopp); pedram.ghamisi@iarai.ac.at (P. Ghamisi)                                                               changes. This idea is inspired by the successful appli-
Orcid 0000-0002-9785-7937 (S. Chang); 0000-0002-1385-1109
                                                                                                                 cation of the modern Hopfield network for continuous
(M. Kopp); 0000-0003-1203-741X (P. Ghamisi)
                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License pattern retrieval [13]. Our assumption is that the seman-
                    Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings     CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                                 tic information between the image pairs in deeper layers
                    Z                                         Z                                           Z


                                      Y                                          Y


                    R                                                                                    R
                    (a)                                      (b)                                         (c)

Figure 1: A brief illustration of three types of Hopfield layers for deep learning [13], where both the stored patterns Y and
the query patterns R can be obtained from the previous layers or the input or can be learned. The output Z are the retrieved
patterns for the queries, each being a linear combination of stored patterns lying in the convex hull of the simplex spanned by
the stored patterns. (a) This Hopfield layer associates two sets R and Y to propagate sets of vectors. (b) Layer Hopfield Pooling
layer performs a pooling operation to the set Y via learned queries. (c) The Hopfield layer learns a new set of stored patterns
based on the input R.


can be represented using a common matrix, i.e. a query,            hull of the simplex spanned by the {𝑥1 , ..., 𝑥𝑁 }, such the
that can be learned during the training process. We use            following energy function is minimized:
this query to retrieve related semantic features between                             𝑁
given images. These retrieved features reflect a com-                                                         1       1
                                                                   𝐸 = −𝛽 −1 log(∑ exp (𝛽𝑥𝑖⊤ 𝜉 ))+𝛽 −1 log 𝑁 + 𝜉 ⊤ 𝜉 + 𝑀 2 ,
mon spatio-temporal context and are used by subsequent                           𝑖=1                          2       2
layers in our network. Concretely, we incorporate a Hop-
field network block into a Siamese fully convolutional             where 𝑀 is the largest norm of the {𝑥1 , ..., 𝑥𝑁 } in ℝ𝑑 . As
network (FCN) resulting in the design of our proposed              shown in [13, 18], 𝜉 𝑛𝑒𝑤 is defined by the following update
deep feature retrieved network (FrNet) for bitemporal              rule:
remote sensing change detection. It should be noted that                     𝜉 𝑛𝑒𝑤 = 𝑓 (𝜉 ; 𝑋 , 𝛽) = 𝑋softmax(𝛽𝑋 ⊤ 𝜉 ),      (1)
different from previous change detection models, both
semantic and temporal information can be fully consid-          which will converge globally, almost always, to a local
ered; and it is our first attempt of using modern Hopfield      minima of the energy function in essentially one update
networks in the remote sensing community.                       step. Moreover, equation (1) is closely related to the well
   The rest of this article is organized as follows. Section    known transformer attention mechanism, showing that
II briefly reviews continuous modern Hopfield networks.         retrieval in modern Hopfield networks and transformer
Section III describes the proposed method. Experiments          attention coincide [13, 18].
are conducted and discussed in Section IV.                         With changable structures in deeper networks (as
                                                                shown in Fig. 1), continuous modern Hopfield networks
                                                                have greater application prospects in deep learning. It
2. Continuous Modern Hopfield                                   has been successfully applied to solving large scale multi-
     Network                                                    instance learning tasks [19], to few- and zero-shot chem-
                                                                ical reaction template prediction [20], to creating new
Binary modern Hopfield networks are associative mem- reinforcement learning algorithms [21, 22], to improving
ories on binary data that can retrieve data of exponen- contrastive learning of joint image- and text embedding
tially many stored patterns [14, 15], this being the key representations [23] and to tabular data [24].
distinguishing feature to their classical binary counter-          Inspired by continuous modern Hopfield networks,
parts [16, 17]. These binary modern Hopfield networks we design a Siamese Hopfield pooling layer and attempt
have been generalized to continuous modern Hopfield to capture deep feature differences for remote sensing
networks that, crucially, are differentiable and can thus bitemporal change detection.
be embedded in deep learning architectures trained by
gradient descent [13, 18]. Moreover, continuous modern
Hopfield networks retain the key ability to store exponen- 3. Deep Feature Retrieved
tially many patterns and they can furthermore retrieve               Network for Change Detection
patterns in only one update step.
   Given a matrix 𝑋 of shape 𝑑 × 𝑁 formed of column 3.1. Overview
vectors {𝑥1 , ..., 𝑥𝑁 } ∈ ℝ𝑑 , a query pattern 𝜉, also a column
                                                                As shown in Fig. 2, the proposed deep feature retrieved
vector, seeks to retrieve the best pattern in the convex
                                                                network (FrNet) is a Siamese network that contains three
                      𝑤𝑤
                                                                                                                                                                                                         𝑤𝑤


                                                                                                                                                                                    𝑤𝑤/2


                                                                                                                                                                  𝑤𝑤/4
                                                                              Reshape
  T1 Image


                                                            ℎ/32              ℎ𝑤𝑤/322 × 512
                                                                   𝑤𝑤/32


                                                                                              Softmax
                                                 ℎ/16      512                                                                             𝑤𝑤/8
                                          ℎ/8           𝑤𝑤/16
                                                512
                                 ℎ/4        𝑤𝑤/8
                                                                                                                     𝑤𝑤/16
                                      256
                           ℎ/2     𝑤𝑤/4
             ℎ                  128                                                                          𝑤𝑤/32
                            𝑤𝑤/2                                            Query                                                                                                               ℎ
                                          Shared Weight                                                                                                                       ℎ/2
                       64                                                  2 × ℎ𝑤𝑤/322                  C                            ℎ/8
                                                                                                                                                            ℎ/4

                                    64                                                                      ℎ/32
                                                                                                                   ℎ/16
                 𝑤𝑤                                                                                             16
                                                128
                                                                                                                      16 512   256     256 256        128     128 128    64     64 64      32       32        2
                                 𝑤𝑤/2                                                                                                                                                                                Change Map
                                                  256
                                         𝑤𝑤/4


                                                                                              Softmax
             ℎ
                                                        512
                           ℎ/2                   𝑤𝑤/8                                                                                                                                                             : Backbone blocks
                                 ℎ/4                𝑤𝑤/16     512
                                          ℎ/8                                 Reshape                                                                                                                             : 1×1 convolution
                                                 ℎ/16       ℎ/32              ℎ𝑤𝑤/322 × 512
  T2 Image


                                                                   𝑤𝑤/32                                                                                                                                          : Up sample

                                                                                                                                                                                                                  : Matrices difference

                                                                                                                                                                                                                  : Matrix multiplication


                                                                                                                                                                                                     C            : Concatenation


Figure 2: Flowchart of the proposed FrNet.


parts: a feature extractor, a Hopfield pooling block, and                                                                    that is related to changed objects from the bitemporal
a decoder. Bitemporal change detection can be viewed                                                                         deep features?”. We design a Hopfield pooling block to
as a segmentation task for image pairs that record the                                                                       pool the features of various channels into fewer channels,
same geographic information at different times. Since the                                                                    and at the same time, attempt to interactively retrieve
shapes and sizes of changed objects vary a lot, deeper lay-                                                                  semantic information during the period of changes using
ers of CNN-based approaches (e.g., U-Net and U-Net++)                                                                        the Hopfield update rule.
can effectively extract semantic features and retain details                                                                    Let us assume two temporal VHR images are denoted
with a larger receptive field. To extract useful informa-                                                                    by 𝑋𝑖 ∈ ℝ3×ℎ×𝑤 , where 𝑖 = {1, 2} represents the 𝑖-th time
tion from bitemporal images, a Siamese network with                                                                          period and ℎ and 𝑤 are the height and width of the im-
consistent architectures and shared weights are utilized                                                                     ages, respectively. Features obtained by the backbone
as the feature extractor in our implementation (shown                                                                                                  ̃
                                                                                                                             are denoted as 𝐹𝑖 ∈ ℝ𝑐×ℎ×𝑤̃ , where 𝑐, ℎ,̃ and 𝑤̃ represent
with green blocks in Fig. 2). The VGG-16 [25] with Ima-                                                                      the number of channels, height, and width of the feature,
geNet pretrained parameters is chosen as the backbone                                                                        respectively. For the proposed VGG-16 feature extractor,
network. Then the spatial dimensions of deep features                                                                        the channel size of 𝐹𝑖 is 512, and the height and width of
are flattened and input into the Hopfield pooling block.                                                                     the features are 1/32 of the original image.
The deep features of two periods are pooled and retrieved.                                                                      In the Hopfield pooling block, the features are first
After that, we feed the concatenation of the bitemporal                                                                                        ̃ ̃
                                                                                                                             reshaped into ℝℎ𝑤×𝑐   of row-wise vectors. Then, for the
retrieved features and the feature differences from shal-
                                                                                                                             time 1 image, we introduce a trainable weight matrix
lower layers into the decoder and obtain the change map.                                                                                          ̃
The decoding modules are shown in the right part in                                                                          𝑊𝑄 ∈ ℝ𝑐𝑄 ×ℎ𝑤̃ to retrieve the related deep features of 𝐹1
Fig. 2.                                                                                                                      related to the 2nd period. The output can be written as:

                                                                                                                                                            𝑍1 = softmax(𝛽𝑊𝑄 𝐹1⊤ )𝐹2 .                                                      (2)
3.2. Hopfield Pooling Block
                                                                                                                             The number of rows 𝑐𝑄 in 𝑊𝑄 is set to 2 in this paper
The Hopfield layer is proven to be capable of retrieving
                                                                                                                             which represents the change/unchange semantic infor-
key features of the input through one update. For the
                                                                                                                             mation we retrieved.
proposed bitemporal change detection task, the question
                                                                                                                               Similarly, the common weight matrix 𝑊𝑄 is utilized to
is: “how can we obtain the most typical information
retrieve 𝐹2 related to the 1st period:                         Table 1
                                                               Quantitative Analysis of Different Networks on the LEVIR-CD
                𝑍2 = softmax(𝛽𝑊𝑄 𝐹2⊤ )𝐹1 .               (3)   Data Set. The Best Values are shown in Bold
It should be noted that the retrieved output 𝑍1 and 𝑍2              Methods        Pre (%)   Rec (%)    F1 (%)    OA (%)
have the same size and contain both global and temporal              FC-EF          61.86     96.05      75.25     96.78
information of the image pairs.                                  FC-Siam-conc       67.87     97.53      80.04     97.52
   We concatenate the retrieved outputs together: 𝑍 =             FC-Siam-diff      71.37     95.42      81.66     97.82
[𝑍1 ; 𝑍2 ], restore their spatial dimensions, and feed them           BIT           80.82     92.86      86.42     98.51
into a 1 × 1 2D convolutional layer with 16 filters to            Base Model       85.24      92.26     88.61     98.79
generate a new feature map. After bilinear interpolation,           FrNet          86.32      92.10     89.12     98.85
the features through the Hopfield pooling block is finally
derived:
                     𝐻 = 𝑈 (𝑔(𝑊 ∗ 𝑍 + 𝑏)),               (4)  proposed FrNet, we also set a base model that consists of
where 𝑊 and 𝑏 represent the weight matrix and bias vec- the CNN backbone (VGG-16) and the decoder for com-
tor of the convolutional layers, ∗ denotes the 2D convo- parison.
lutional operation, 𝑔(⋅) denotes the batch normalization         For the evaluation part, the precision (Pre), recall (Rec),
with ReLU activation, and 𝑈 (⋅) denotes bilinear interpo-     F1 score, and overall accuracy (OA) are employed to quan-
lation with an upsampling rate of 2.                          titatively evaluate the performance of the studied meth-
                                                              ods. These metrics are calculated as follows:
                                                                                       𝑇𝑃
4. Experiments                                                                𝑃𝑟𝑒 =
                                                                                    𝑇𝑃 + 𝐹𝑃
                                                                                                                        (5)
                                                                                       𝑇𝑃
4.1. Data Set                                                                 𝑅𝑒𝑐 =                                     (6)
                                                                                    𝑇𝑃 + 𝐹𝑁
In the experimental part, the LEVIR-CD data set [26] is                             2𝑃𝑟𝑒 ⋅ 𝑅𝑒𝑐
                                                                               𝐹1 =                                     (7)
utilized to compare the change detection methods. The                               𝑃𝑟𝑒 + 𝑅𝑒𝑐
LEVIR-CD data set is composed of 637 VHR (0.5m/pixel)                                      𝑇𝑃 + 𝑇𝑁
                                                                              𝑂𝐴 =                                      (8)
Google Earth (GE) image pairs with the size of 1024×1024                            𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
pixels. These image pairs have been captured in differ- where TP (True Positive) represents the number of pix-
ent periods of 5 to 14 years and cover a total of 31,333 els of real changes that are correctly detected, FP (False
individual buildings for the task of building growth as- Positive) represents the number of pixels of unchanged
sessment. With the ratio of 7:1:2, these image pairs are objects that are falsely detected as changed objects,
split into the training set, validation set, and testing set. TN (True Negative) denotes the number of pixels of
Following the initial settings, we crop each image into 16 unchanged objects that are correctly regarded as non-
non-overlapped small patches with the size of 256×256 change, and FN (False Negative) denotes the number of
pixels. Thus, there are a total of 7120 image pairs for changed pixels that are not detected as changed objects.
training, 1024 for validation, and 2024 for testing.
                                       4.3. Experimental Results and Analysis
4.2. Comparative method and Evaluation
                                       In our experiments, the proposed FrNet is implemented
     Metrics                           with the Pytorch platform using a single NVIDIA A100
To verify the effectiveness of the proposed FrNet method,      GPU (with 40-GB RAM). During the training stage, the
four representative deep-learning-based change detec-          Adam optimizer with a weight decay of 1𝑒 − 5 was em-
tion networks are taken into consideration. The FC-EF          ployed. The batch size is set to 32, and the learning rate
[7] is an early fusion method based on U-Net that con-         is initially set to 1𝑒 − 4 and will linearly reduce to 0 over
catenates the bitemporal image pairs as the input. And its     50,000 iterations. The 𝛽 of the Hopfield layer is set to
extended versions, the FC-Siam-diff and FC-Siam-conc           1/ 𝑐𝑄 .
                                                                  √
[7], use Siamese networks with shared weights to ex-              The quantitative results for the precision, recall, F1
tract multi-level features and use feature difference and      score, and OA of all models are summarized in Table 1.
concatenation, respectively, to fuse bitemporal informa-       It can be found that FC-EF obtains the lowest F1 score
tion. The bitemporal image transformer (BIT) network           (75.25%) and OA (96.78%) among all the models. The FC-
[12] designs a context-information-based enhancer to           Siam-conc and FC-Siam-diff perform slightly better than
extract related concepts in the token-based space-time,        FC-EF, which indicates the Siamese network and feature
and projects the context-rich tokens back to original fea-     difference/concatenation have benefits for the preserva-
tures for prediction. To validate the effectiveness of the     tion of useful information. The F1 score and OA of the
      (a)           (b)           (c)           (d)            (e)           (f)           (g)            (h)           (i)


Figure 3: Visualization results of different methods using the LEVIR-CD data set. (a) T1 Image; (b) T2 Image; (c) Ground-truth;
(d) FC-EF; (e) FC-Siam-conc; (f) FC-Siam-diff; (g) BIT; (h) Base Model; (i) FrNet. Yellow, black, red, and green represent TP, TN,
FP, and FN, respectively.


BIT model are 83.22% and 98.06%, respectively, better             5. Conclusion
than other FC-based models. This demonstrates that the
tokens in spase-time can effectively capture the tempo-           Inspired by the successful application of continuous mod-
ral changes and enhance the context information. The              ern Hopfield for pattern retrieval, we propose a deep
proposed FrNet achieves the highest F1 and OA among               feature retrieved network (FrNet) for bitemporal change
all the studied methods and has better performance than           detection. Our Hopfield pooling block introduces a train-
our base model. The improvements prove that the Hop-              able weight matrix that aims to retrieve the global change
field layer helps retrieve the deep features and the shared       of interests for high-level features and capture the dis-
query matrix can learn important information as part of           criminative representations of one period related to the
the inputs for the decoder.                                       other. To valuate the effectiveness of the proposed model,
   Fig. 3 illustrates change detection maps obtained by           experiments are conducted on the LEVIR-CD data set.
different methods, where TPs, TNs, FPs, and FNs are               Our empirical evidence confirms the superiority of the
represented in yellow, black, red, and green, respec-             proposed FrNet in comparison with other state-of-the-
tively. We can observe that FrNet achieves the best results       arts methods.
among all the models. Firstly, FrNet can better distin-
guish small-sized changed buildings that have relatively
regular shapes by reducing false alarms compared with
                                                                  Acknowledgments
other methods (e.g., the 1st, 2nd, and 3rd rows of Fig. 3).       The authors would like to thank the contributors of the
When the shapes of buildings are complex, our model               LEVIR-CD data set for making it publicly available, and
can also preserve the boundary of the objects (e.g., the          the authors of the FC-EF, FC-Siam-conc, FC-Siam-diff,
4th, 5th, and 6th rows of Fig. 3).                                and the BIT methods for releasing their codes.
References                                                            high-resolution satellite images, IEEE Journal of
                                                                      Selected Topics in Applied Earth Observations and
 [1] P. Ghamisi, B. Rasti, N. Yokoya, Q. Wang, B. Hofle,              Remote Sensing 14 (2020) 1194–1206.
     L. Bruzzone, F. Bovolo, M. Chi, K. Anders,                  [12] H. Chen, Z. Qi, Z. Shi, Remote sensing image
     R. Gloaguen, et al., Multisource and multitemporal               change detection with transformers, IEEE Transac-
     data fusion in remote sensing: A comprehensive                   tions on Geoscience and Remote Sensing (2021).
     review of the state of the art, IEEE Geoscience and         [13] H. Ramsauer, B. Schäfl, J. Lehner, P. Seidl,
     Remote Sensing Magazine 7 (2019) 6–39.                           M. Widrich, T. Adler, L. Gruber, M. Holzleitner,
 [2] Z. Li, F. Lu, H. Zhang, L. Tu, J. Li, X. Huang, C. Robin-        M. Pavlović, G. K. Sandve, et al., Hopfield networks
     son, N. Malkin, N. Jojic, P. Ghamisi, et al., The                is all you need, arXiv preprint arXiv:2008.02217
     outcome of the 2021 IEEE GRSS data fusion con-                   (2020).
     test—track MSD: Multitemporal semantic change               [14] M. Demircigil, J. Heusel, M. Löwe, S. Upgang,
     detection, IEEE Journal of Selected Topics in Ap-                F. Vermet, On a model of associative memory
     plied Earth Observations and Remote Sensing 15                   with huge storage capacity, Journal of Statisti-
     (2022) 1643–1655.                                                cal Physics 168 (2017) 288–299. URL: https://doi.
 [3] Y. Wu, J. Li, Y. Yuan, A. Qin, Q.-G. Miao, M.-G. Gong,           org/10.1007%2Fs10955-017-1806-y. doi:10.1007/
     Commonality autoencoder: Learning common fea-                    s10955- 017- 1806- y .
     tures for change detection from heterogeneous im-           [15] D. Krotov, J. J. Hopfield, Dense associative mem-
     ages, IEEE Transactions on Neural Networks and                   ory for pattern recognition, Advances in neural
     Learning Systems (2021).                                         information processing systems 29 (2016).
 [4] B. Bai, W. Fu, T. Lu, S. Li, Edge-guided recurrent          [16] J. J. Hopfield, Neural networks and physical systems
     convolutional neural network for multitemporal                   with emergent collective computational abilities,
     remote sensing image building change detection,                  Proceedings of the national academy of sciences 79
     IEEE Transactions on Geoscience and Remote Sens-                 (1982) 2554–2558.
     ing (2021).                                                 [17] J. J. Hopfield, Neurons with graded response
 [5] X. Li, Z. Du, Y. Huang, Z. Tan, A deep translation               have collective computational properties like those
     (GAN) based change detection network for optical                 of two-state neurons., Proceedings of the Na-
     and SAR remote sensing images, ISPRS Journal of                  tional Academy of Sciences 81 (1984) 3088–3092.
     Photogrammetry and Remote Sensing 179 (2021)                     URL: https://www.pnas.org/doi/pdf/10.1073/pnas.
     14–34.                                                           81.10.3088. doi:10.1073/pnas.81.10.3088 .
 [6] F. Samadi, G. Akbarizadeh, H. Kaabi, Change de-             [18] H. Ramsauer, B. Schäfl, J. Lehner, P. Seidl,
     tection in SAR images using deep belief network:                 M. Widrich, L. Gruber, M. Holzleitner, T. Adler,
     a new training approach based on morphological                   D. Kreil, M. K. Kopp, G. Klambauer, J. Brandstetter,
     images, IET Image Processing 13 (2019) 2255–2264.                S. Hochreiter, Hopfield networks is all you need,
 [7] R. C. Daudt, B. Le Saux, A. Boulch, Fully convolu-               in: International Conference on Learning Represen-
     tional siamese networks for change detection, in:                tations, 2021. URL: https://openreview.net/forum?
     2018 25th IEEE International Conference on Image                 id=tL89RnzIiCd.
     Processing (ICIP), IEEE, 2018, pp. 4063–4067.               [19] M. Widrich, B. Schäfl, H. Ramsauer, M. Pavlović,
 [8] X. Peng, R. Zhong, Z. Li, Q. Li, Optical remote                  L. Gruber, M. Holzleitner, J. Brandstetter, G. K.
     sensing image change detection based on attention                Sandve, V. Greiff, S. Hochreiter, G. Klambauer,
     mechanism and image difference, IEEE Transac-                    Modern hopfield networks and attention for im-
     tions on Geoscience and Remote Sensing 59 (2020)                 mune repertoire classification (2020). URL: https:
     7296–7307.                                                       //arxiv.org/abs/2007.13505. doi:10.48550/ARXIV.
 [9] B. Hou, Q. Liu, H. Wang, Y. Wang, From W-Net                     2007.13505 .
     to CDGAN: Bitemporal change detection via deep              [20] P. Seidl, P. Renz, N. Dyubankova, P. Neves, J. Ver-
     learning techniques, IEEE Transactions on Geo-                   hoeven, M. Segler, J. K. Wegner, S. Hochreiter,
     science and Remote Sensing 58 (2019) 1790–1802.                  G. Klambauer, Modern hopfield networks for
[10] H. Chen, C. Wu, B. Du, L. Zhang, L. Wang, Change                 few- and zero-shot reaction template prediction,
     detection in multisource VHR images via deep                     2021. URL: https://arxiv.org/abs/2104.03279. doi:10.
     siamese convolutional multiple-layers recurrent                  48550/ARXIV.2104.03279 .
     neural network, IEEE Transactions on Geoscience             [21] F. Paischer, T. Adler, V. Patil, A. Bitto-Nemling,
     and Remote Sensing 58 (2019) 2848–2864.                          M. Holzleitner, S. Lehner, H. Eghbal-zadeh,
[11] J. Chen, Z. Yuan, J. Peng, L. Chen, H. Huang, J. Zhu,            S. Hochreiter, History compression via language
     Y. Liu, H. Li, Dasnet: Dual attentive fully convo-               models in reinforcement learning, 2022. URL: https:
     lutional siamese networks for change detection in                //arxiv.org/abs/2205.12258. doi:10.48550/ARXIV.
     2205.12258 .
[22] M. Widrich, M. Hofmarcher, V. P. Patil, A. Bitto-
     Nemling, S. Hochreiter, Modern hopfield networks
     for return decomposition for delayed rewards, in:
     Deep RL Workshop NeurIPS 2021, 2021. URL: https:
     //openreview.net/forum?id=t0PQSDcqAiy.
[23] A. Fürst, E. Rumetshofer, J. Lehner, V. Tran, F. Tang,
     H. Ramsauer, D. Kreil, M. Kopp, G. Klambauer,
     A. Bitto-Nemling, S. Hochreiter, Cloob: Modern
     hopfield networks with infoloob outperform clip,
     2021. URL: https://arxiv.org/abs/2110.11316. doi:10.
     48550/ARXIV.2110.11316 .
[24] B. Schäfl, L. Gruber, A. Bitto-Nemling, S. Hochreiter,
     Hopular: Modern hopfield networks for tabular
     data, 2022. URL: https://openreview.net/forum?id=
     3zJVXU311-Q.
[25] K. Simonyan, A. Zisserman, Very deep convolu-
     tional networks for large-scale image recognition,
     arXiv preprint arXiv:1409.1556 (2014).
[26] H. Chen, Z. Shi, A spatial-temporal attention-based
     method and a new dataset for remote sensing image
     change detection, Remote Sensing 12 (2020) 1662.