A Deep Feature Retrieved Network for Bitemporal Remote Sensing Image Change Detection Shizhen Chang1 , Michael Kopp1 and Pedram Ghamisi1,2 1 Institute of Advanced Research in Artificial Intelligence (IARAI), 1030 Vienna, Austria 2 Helmholtz-Zentrum Dresden-Rossendorf, Helmholtz Institute Freiberg for Resource Technology, Machine Learning Group, 09599 Freiberg, Germany Abstract The task of bitemporal change detection aims to identify the surface changes of specific scenes at two different points in time. In recent years, we have increasingly witnessed the success of deep learning in a variety of applications in remote sensing, including change detection and monitoring. In this paper, a novel deep feature retrieval neural network architecture for change detection is proposed that uses a trainable associative memory component to exploit potential similarities and connections of the deep features between image pairs. A key ingredient in our novel architecture is the use of a continuous modern Hopfield network component. The proposed method beats the current state-of-the-art on the well-known LEVIR-CD data set. The codes of this work will soon be available online (https://github.com/ShizhenChang). Keywords Remote sensing, change detection, modern Hopfield network, deep learning, Siamese network, convolutional neural network. 1. Introduction gories: early fusion [7, 8] and late fusion [9] networks. The early-fusion networks first concatenate multitem- With the rapid development of technologies for Earth poral images into a unified data cube, and then, the pa- observation, an ever-growing amount of very high reso- rameters are hierarchically fine-tuned. The late-fusion lution (VHR) remote sensing data has become available networks usually learn single-temporal features individ- for geographical analysis and image processing [1]. VHR ually and share the parameters by using a Siamese net- images can provide detailed information about land sur- work. Compared to early-fusion networks, late-fusion faces, and images collected at different time epochs from methods can better utilize the features of the inputs and the scene are able to record changes regularly. There- return clearer contours of the change objects. However, fore, as one of the most important remote sensing tasks, the features of shallower layers may not be sufficiently change detection has been widely applied in many areas learned and utilized due to the gradient vanishing prob- of land-use and land-cover analysis, such as environmen- lem. Therefore, learning information from both shallow tal monitoring, urban growth, deforestation assessment, and deep layers are very important to effectively detect shifting cultivation evaluation, and so on. changes using deep-learning-based approaches. A variety of deep neural networks, such as the convo- In order to accurately extract features, deeper and more lutional neural networks (CNNs) [2], autoencoders (AEs) complex CNN-based networks have been designed, that [3], recurrent neural networks [4], generative adversar- include architecture components such as Long Short- ial network (GAN) [5], and deep belief network (DBNs) Term Memory (LSTM) [10] and attention mechanisms [6], have been successfully utilized for remote sensing (self-attention [11], spatial attention [12], and channel change detection over the last few years. Among them, attention [8]). The successful combination of CNNs and CNN-based methods can take full use of the spatial infor- other networks has shown that discriminative features mation of VHR remote sensing images, thus, can better within the image pairs can be better extracted and the extract high-level deep features and abstract semantic detection accuracy can be greatly improved. However, contents to learn discriminative differences between the limited by the architecture of CNNs, as the high-level periods. features are only related to the shallower layers through Strategies that have been applied to extract deep fea- larger receptive fields, the global and temporal informa- tures of the inputs, can be broadly divided into two cate- tion between the image pairs are still not sufficiently CDCEO 2022: 2nd Workshop on Complex Data Challenges in Earth utilized. Observation, July 25, 2022, Vienna, Austria To address this issue, we design a Hopfield pooling Envelope-Open shizhen.chang@iarai.ac.at (S. Chang); michael.kopp@iarai.ac.at block to interactively retrieve the high-level concepts of (M. Kopp); pedram.ghamisi@iarai.ac.at (P. Ghamisi) changes. This idea is inspired by the successful appli- Orcid 0000-0002-9785-7937 (S. Chang); 0000-0002-1385-1109 cation of the modern Hopfield network for continuous (M. Kopp); 0000-0003-1203-741X (P. Ghamisi) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License pattern retrieval [13]. Our assumption is that the seman- Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 tic information between the image pairs in deeper layers Z Z Z Y Y R R (a) (b) (c) Figure 1: A brief illustration of three types of Hopfield layers for deep learning [13], where both the stored patterns Y and the query patterns R can be obtained from the previous layers or the input or can be learned. The output Z are the retrieved patterns for the queries, each being a linear combination of stored patterns lying in the convex hull of the simplex spanned by the stored patterns. (a) This Hopfield layer associates two sets R and Y to propagate sets of vectors. (b) Layer Hopfield Pooling layer performs a pooling operation to the set Y via learned queries. (c) The Hopfield layer learns a new set of stored patterns based on the input R. can be represented using a common matrix, i.e. a query, hull of the simplex spanned by the {π‘₯1 , ..., π‘₯𝑁 }, such the that can be learned during the training process. We use following energy function is minimized: this query to retrieve related semantic features between 𝑁 given images. These retrieved features reflect a com- 1 1 𝐸 = βˆ’π›½ βˆ’1 log(βˆ‘ exp (𝛽π‘₯π‘–βŠ€ πœ‰ ))+𝛽 βˆ’1 log 𝑁 + πœ‰ ⊀ πœ‰ + 𝑀 2 , mon spatio-temporal context and are used by subsequent 𝑖=1 2 2 layers in our network. Concretely, we incorporate a Hop- field network block into a Siamese fully convolutional where 𝑀 is the largest norm of the {π‘₯1 , ..., π‘₯𝑁 } in ℝ𝑑 . As network (FCN) resulting in the design of our proposed shown in [13, 18], πœ‰ 𝑛𝑒𝑀 is defined by the following update deep feature retrieved network (FrNet) for bitemporal rule: remote sensing change detection. It should be noted that πœ‰ 𝑛𝑒𝑀 = 𝑓 (πœ‰ ; 𝑋 , 𝛽) = 𝑋softmax(𝛽𝑋 ⊀ πœ‰ ), (1) different from previous change detection models, both semantic and temporal information can be fully consid- which will converge globally, almost always, to a local ered; and it is our first attempt of using modern Hopfield minima of the energy function in essentially one update networks in the remote sensing community. step. Moreover, equation (1) is closely related to the well The rest of this article is organized as follows. Section known transformer attention mechanism, showing that II briefly reviews continuous modern Hopfield networks. retrieval in modern Hopfield networks and transformer Section III describes the proposed method. Experiments attention coincide [13, 18]. are conducted and discussed in Section IV. With changable structures in deeper networks (as shown in Fig. 1), continuous modern Hopfield networks have greater application prospects in deep learning. It 2. Continuous Modern Hopfield has been successfully applied to solving large scale multi- Network instance learning tasks [19], to few- and zero-shot chem- ical reaction template prediction [20], to creating new Binary modern Hopfield networks are associative mem- reinforcement learning algorithms [21, 22], to improving ories on binary data that can retrieve data of exponen- contrastive learning of joint image- and text embedding tially many stored patterns [14, 15], this being the key representations [23] and to tabular data [24]. distinguishing feature to their classical binary counter- Inspired by continuous modern Hopfield networks, parts [16, 17]. These binary modern Hopfield networks we design a Siamese Hopfield pooling layer and attempt have been generalized to continuous modern Hopfield to capture deep feature differences for remote sensing networks that, crucially, are differentiable and can thus bitemporal change detection. be embedded in deep learning architectures trained by gradient descent [13, 18]. Moreover, continuous modern Hopfield networks retain the key ability to store exponen- 3. Deep Feature Retrieved tially many patterns and they can furthermore retrieve Network for Change Detection patterns in only one update step. Given a matrix 𝑋 of shape 𝑑 Γ— 𝑁 formed of column 3.1. Overview vectors {π‘₯1 , ..., π‘₯𝑁 } ∈ ℝ𝑑 , a query pattern πœ‰, also a column As shown in Fig. 2, the proposed deep feature retrieved vector, seeks to retrieve the best pattern in the convex network (FrNet) is a Siamese network that contains three 𝑀𝑀 𝑀𝑀 𝑀𝑀/2 𝑀𝑀/4 Reshape T1 Image β„Ž/32 β„Žπ‘€π‘€/322 Γ— 512 𝑀𝑀/32 Softmax β„Ž/16 512 𝑀𝑀/8 β„Ž/8 𝑀𝑀/16 512 β„Ž/4 𝑀𝑀/8 𝑀𝑀/16 256 β„Ž/2 𝑀𝑀/4 β„Ž 128 𝑀𝑀/32 𝑀𝑀/2 Query β„Ž Shared Weight β„Ž/2 64 2 Γ— β„Žπ‘€π‘€/322 C β„Ž/8 β„Ž/4 64 β„Ž/32 β„Ž/16 𝑀𝑀 16 128 16 512 256 256 256 128 128 128 64 64 64 32 32 2 𝑀𝑀/2 Change Map 256 𝑀𝑀/4 Softmax β„Ž 512 β„Ž/2 𝑀𝑀/8 : Backbone blocks β„Ž/4 𝑀𝑀/16 512 β„Ž/8 Reshape : 1Γ—1 convolution β„Ž/16 β„Ž/32 β„Žπ‘€π‘€/322 Γ— 512 T2 Image 𝑀𝑀/32 : Up sample : Matrices difference : Matrix multiplication C : Concatenation Figure 2: Flowchart of the proposed FrNet. parts: a feature extractor, a Hopfield pooling block, and that is related to changed objects from the bitemporal a decoder. Bitemporal change detection can be viewed deep features?”. We design a Hopfield pooling block to as a segmentation task for image pairs that record the pool the features of various channels into fewer channels, same geographic information at different times. Since the and at the same time, attempt to interactively retrieve shapes and sizes of changed objects vary a lot, deeper lay- semantic information during the period of changes using ers of CNN-based approaches (e.g., U-Net and U-Net++) the Hopfield update rule. can effectively extract semantic features and retain details Let us assume two temporal VHR images are denoted with a larger receptive field. To extract useful informa- by 𝑋𝑖 ∈ ℝ3Γ—β„ŽΓ—π‘€ , where 𝑖 = {1, 2} represents the 𝑖-th time tion from bitemporal images, a Siamese network with period and β„Ž and 𝑀 are the height and width of the im- consistent architectures and shared weights are utilized ages, respectively. Features obtained by the backbone as the feature extractor in our implementation (shown Μƒ are denoted as 𝐹𝑖 ∈ β„π‘Γ—β„ŽΓ—π‘€Μƒ , where 𝑐, β„Ž,Μƒ and 𝑀̃ represent with green blocks in Fig. 2). The VGG-16 [25] with Ima- the number of channels, height, and width of the feature, geNet pretrained parameters is chosen as the backbone respectively. For the proposed VGG-16 feature extractor, network. Then the spatial dimensions of deep features the channel size of 𝐹𝑖 is 512, and the height and width of are flattened and input into the Hopfield pooling block. the features are 1/32 of the original image. The deep features of two periods are pooled and retrieved. In the Hopfield pooling block, the features are first After that, we feed the concatenation of the bitemporal Μƒ Μƒ reshaped into β„β„Žπ‘€Γ—π‘ of row-wise vectors. Then, for the retrieved features and the feature differences from shal- time 1 image, we introduce a trainable weight matrix lower layers into the decoder and obtain the change map. Μƒ The decoding modules are shown in the right part in π‘Šπ‘„ ∈ ℝ𝑐𝑄 Γ—β„Žπ‘€Μƒ to retrieve the related deep features of 𝐹1 Fig. 2. related to the 2nd period. The output can be written as: 𝑍1 = softmax(π›½π‘Šπ‘„ 𝐹1⊀ )𝐹2 . (2) 3.2. Hopfield Pooling Block The number of rows 𝑐𝑄 in π‘Šπ‘„ is set to 2 in this paper The Hopfield layer is proven to be capable of retrieving which represents the change/unchange semantic infor- key features of the input through one update. For the mation we retrieved. proposed bitemporal change detection task, the question Similarly, the common weight matrix π‘Šπ‘„ is utilized to is: β€œhow can we obtain the most typical information retrieve 𝐹2 related to the 1st period: Table 1 Quantitative Analysis of Different Networks on the LEVIR-CD 𝑍2 = softmax(π›½π‘Šπ‘„ 𝐹2⊀ )𝐹1 . (3) Data Set. The Best Values are shown in Bold It should be noted that the retrieved output 𝑍1 and 𝑍2 Methods Pre (%) Rec (%) F1 (%) OA (%) have the same size and contain both global and temporal FC-EF 61.86 96.05 75.25 96.78 information of the image pairs. FC-Siam-conc 67.87 97.53 80.04 97.52 We concatenate the retrieved outputs together: 𝑍 = FC-Siam-diff 71.37 95.42 81.66 97.82 [𝑍1 ; 𝑍2 ], restore their spatial dimensions, and feed them BIT 80.82 92.86 86.42 98.51 into a 1 Γ— 1 2D convolutional layer with 16 filters to Base Model 85.24 92.26 88.61 98.79 generate a new feature map. After bilinear interpolation, FrNet 86.32 92.10 89.12 98.85 the features through the Hopfield pooling block is finally derived: 𝐻 = π‘ˆ (𝑔(π‘Š βˆ— 𝑍 + 𝑏)), (4) proposed FrNet, we also set a base model that consists of where π‘Š and 𝑏 represent the weight matrix and bias vec- the CNN backbone (VGG-16) and the decoder for com- tor of the convolutional layers, βˆ— denotes the 2D convo- parison. lutional operation, 𝑔(β‹…) denotes the batch normalization For the evaluation part, the precision (Pre), recall (Rec), with ReLU activation, and π‘ˆ (β‹…) denotes bilinear interpo- F1 score, and overall accuracy (OA) are employed to quan- lation with an upsampling rate of 2. titatively evaluate the performance of the studied meth- ods. These metrics are calculated as follows: 𝑇𝑃 4. Experiments π‘ƒπ‘Ÿπ‘’ = 𝑇𝑃 + 𝐹𝑃 (5) 𝑇𝑃 4.1. Data Set 𝑅𝑒𝑐 = (6) 𝑇𝑃 + 𝐹𝑁 In the experimental part, the LEVIR-CD data set [26] is 2π‘ƒπ‘Ÿπ‘’ β‹… 𝑅𝑒𝑐 𝐹1 = (7) utilized to compare the change detection methods. The π‘ƒπ‘Ÿπ‘’ + 𝑅𝑒𝑐 LEVIR-CD data set is composed of 637 VHR (0.5m/pixel) 𝑇𝑃 + 𝑇𝑁 𝑂𝐴 = (8) Google Earth (GE) image pairs with the size of 1024Γ—1024 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 pixels. These image pairs have been captured in differ- where TP (True Positive) represents the number of pix- ent periods of 5 to 14 years and cover a total of 31,333 els of real changes that are correctly detected, FP (False individual buildings for the task of building growth as- Positive) represents the number of pixels of unchanged sessment. With the ratio of 7:1:2, these image pairs are objects that are falsely detected as changed objects, split into the training set, validation set, and testing set. TN (True Negative) denotes the number of pixels of Following the initial settings, we crop each image into 16 unchanged objects that are correctly regarded as non- non-overlapped small patches with the size of 256Γ—256 change, and FN (False Negative) denotes the number of pixels. Thus, there are a total of 7120 image pairs for changed pixels that are not detected as changed objects. training, 1024 for validation, and 2024 for testing. 4.3. Experimental Results and Analysis 4.2. Comparative method and Evaluation In our experiments, the proposed FrNet is implemented Metrics with the Pytorch platform using a single NVIDIA A100 To verify the effectiveness of the proposed FrNet method, GPU (with 40-GB RAM). During the training stage, the four representative deep-learning-based change detec- Adam optimizer with a weight decay of 1𝑒 βˆ’ 5 was em- tion networks are taken into consideration. The FC-EF ployed. The batch size is set to 32, and the learning rate [7] is an early fusion method based on U-Net that con- is initially set to 1𝑒 βˆ’ 4 and will linearly reduce to 0 over catenates the bitemporal image pairs as the input. And its 50,000 iterations. The 𝛽 of the Hopfield layer is set to extended versions, the FC-Siam-diff and FC-Siam-conc 1/ 𝑐𝑄 . √ [7], use Siamese networks with shared weights to ex- The quantitative results for the precision, recall, F1 tract multi-level features and use feature difference and score, and OA of all models are summarized in Table 1. concatenation, respectively, to fuse bitemporal informa- It can be found that FC-EF obtains the lowest F1 score tion. The bitemporal image transformer (BIT) network (75.25%) and OA (96.78%) among all the models. The FC- [12] designs a context-information-based enhancer to Siam-conc and FC-Siam-diff perform slightly better than extract related concepts in the token-based space-time, FC-EF, which indicates the Siamese network and feature and projects the context-rich tokens back to original fea- difference/concatenation have benefits for the preserva- tures for prediction. To validate the effectiveness of the tion of useful information. The F1 score and OA of the (a) (b) (c) (d) (e) (f) (g) (h) (i) Figure 3: Visualization results of different methods using the LEVIR-CD data set. (a) T1 Image; (b) T2 Image; (c) Ground-truth; (d) FC-EF; (e) FC-Siam-conc; (f) FC-Siam-diff; (g) BIT; (h) Base Model; (i) FrNet. Yellow, black, red, and green represent TP, TN, FP, and FN, respectively. BIT model are 83.22% and 98.06%, respectively, better 5. Conclusion than other FC-based models. This demonstrates that the tokens in spase-time can effectively capture the tempo- Inspired by the successful application of continuous mod- ral changes and enhance the context information. The ern Hopfield for pattern retrieval, we propose a deep proposed FrNet achieves the highest F1 and OA among feature retrieved network (FrNet) for bitemporal change all the studied methods and has better performance than detection. Our Hopfield pooling block introduces a train- our base model. The improvements prove that the Hop- able weight matrix that aims to retrieve the global change field layer helps retrieve the deep features and the shared of interests for high-level features and capture the dis- query matrix can learn important information as part of criminative representations of one period related to the the inputs for the decoder. other. To valuate the effectiveness of the proposed model, Fig. 3 illustrates change detection maps obtained by experiments are conducted on the LEVIR-CD data set. different methods, where TPs, TNs, FPs, and FNs are Our empirical evidence confirms the superiority of the represented in yellow, black, red, and green, respec- proposed FrNet in comparison with other state-of-the- tively. We can observe that FrNet achieves the best results arts methods. among all the models. Firstly, FrNet can better distin- guish small-sized changed buildings that have relatively regular shapes by reducing false alarms compared with Acknowledgments other methods (e.g., the 1st, 2nd, and 3rd rows of Fig. 3). The authors would like to thank the contributors of the When the shapes of buildings are complex, our model LEVIR-CD data set for making it publicly available, and can also preserve the boundary of the objects (e.g., the the authors of the FC-EF, FC-Siam-conc, FC-Siam-diff, 4th, 5th, and 6th rows of Fig. 3). and the BIT methods for releasing their codes. References high-resolution satellite images, IEEE Journal of Selected Topics in Applied Earth Observations and [1] P. Ghamisi, B. Rasti, N. Yokoya, Q. Wang, B. Hofle, Remote Sensing 14 (2020) 1194–1206. L. Bruzzone, F. Bovolo, M. Chi, K. Anders, [12] H. Chen, Z. Qi, Z. Shi, Remote sensing image R. Gloaguen, et al., Multisource and multitemporal change detection with transformers, IEEE Transac- data fusion in remote sensing: A comprehensive tions on Geoscience and Remote Sensing (2021). review of the state of the art, IEEE Geoscience and [13] H. Ramsauer, B. SchΓ€fl, J. Lehner, P. Seidl, Remote Sensing Magazine 7 (2019) 6–39. M. Widrich, T. Adler, L. Gruber, M. Holzleitner, [2] Z. Li, F. Lu, H. Zhang, L. Tu, J. Li, X. Huang, C. Robin- M. PavloviΔ‡, G. K. Sandve, et al., Hopfield networks son, N. Malkin, N. Jojic, P. Ghamisi, et al., The is all you need, arXiv preprint arXiv:2008.02217 outcome of the 2021 IEEE GRSS data fusion con- (2020). testβ€”track MSD: Multitemporal semantic change [14] M. Demircigil, J. Heusel, M. LΓΆwe, S. Upgang, detection, IEEE Journal of Selected Topics in Ap- F. Vermet, On a model of associative memory plied Earth Observations and Remote Sensing 15 with huge storage capacity, Journal of Statisti- (2022) 1643–1655. cal Physics 168 (2017) 288–299. URL: https://doi. [3] Y. Wu, J. Li, Y. Yuan, A. Qin, Q.-G. Miao, M.-G. Gong, org/10.1007%2Fs10955-017-1806-y. doi:10.1007/ Commonality autoencoder: Learning common fea- s10955- 017- 1806- y . tures for change detection from heterogeneous im- [15] D. Krotov, J. J. Hopfield, Dense associative mem- ages, IEEE Transactions on Neural Networks and ory for pattern recognition, Advances in neural Learning Systems (2021). information processing systems 29 (2016). [4] B. Bai, W. Fu, T. Lu, S. Li, Edge-guided recurrent [16] J. J. Hopfield, Neural networks and physical systems convolutional neural network for multitemporal with emergent collective computational abilities, remote sensing image building change detection, Proceedings of the national academy of sciences 79 IEEE Transactions on Geoscience and Remote Sens- (1982) 2554–2558. ing (2021). [17] J. J. Hopfield, Neurons with graded response [5] X. Li, Z. Du, Y. Huang, Z. Tan, A deep translation have collective computational properties like those (GAN) based change detection network for optical of two-state neurons., Proceedings of the Na- and SAR remote sensing images, ISPRS Journal of tional Academy of Sciences 81 (1984) 3088–3092. Photogrammetry and Remote Sensing 179 (2021) URL: https://www.pnas.org/doi/pdf/10.1073/pnas. 14–34. 81.10.3088. doi:10.1073/pnas.81.10.3088 . [6] F. Samadi, G. Akbarizadeh, H. Kaabi, Change de- [18] H. Ramsauer, B. SchΓ€fl, J. Lehner, P. Seidl, tection in SAR images using deep belief network: M. Widrich, L. Gruber, M. Holzleitner, T. Adler, a new training approach based on morphological D. Kreil, M. K. Kopp, G. Klambauer, J. Brandstetter, images, IET Image Processing 13 (2019) 2255–2264. S. Hochreiter, Hopfield networks is all you need, [7] R. C. Daudt, B. Le Saux, A. Boulch, Fully convolu- in: International Conference on Learning Represen- tional siamese networks for change detection, in: tations, 2021. URL: https://openreview.net/forum? 2018 25th IEEE International Conference on Image id=tL89RnzIiCd. Processing (ICIP), IEEE, 2018, pp. 4063–4067. [19] M. Widrich, B. SchΓ€fl, H. Ramsauer, M. PavloviΔ‡, [8] X. Peng, R. Zhong, Z. Li, Q. Li, Optical remote L. Gruber, M. Holzleitner, J. Brandstetter, G. K. sensing image change detection based on attention Sandve, V. Greiff, S. Hochreiter, G. Klambauer, mechanism and image difference, IEEE Transac- Modern hopfield networks and attention for im- tions on Geoscience and Remote Sensing 59 (2020) mune repertoire classification (2020). URL: https: 7296–7307. //arxiv.org/abs/2007.13505. doi:10.48550/ARXIV. [9] B. Hou, Q. Liu, H. Wang, Y. Wang, From W-Net 2007.13505 . to CDGAN: Bitemporal change detection via deep [20] P. Seidl, P. Renz, N. Dyubankova, P. Neves, J. Ver- learning techniques, IEEE Transactions on Geo- hoeven, M. Segler, J. K. Wegner, S. Hochreiter, science and Remote Sensing 58 (2019) 1790–1802. G. Klambauer, Modern hopfield networks for [10] H. Chen, C. Wu, B. Du, L. Zhang, L. Wang, Change few- and zero-shot reaction template prediction, detection in multisource VHR images via deep 2021. URL: https://arxiv.org/abs/2104.03279. doi:10. siamese convolutional multiple-layers recurrent 48550/ARXIV.2104.03279 . neural network, IEEE Transactions on Geoscience [21] F. Paischer, T. Adler, V. Patil, A. Bitto-Nemling, and Remote Sensing 58 (2019) 2848–2864. M. Holzleitner, S. Lehner, H. Eghbal-zadeh, [11] J. Chen, Z. Yuan, J. Peng, L. Chen, H. Huang, J. Zhu, S. Hochreiter, History compression via language Y. Liu, H. Li, Dasnet: Dual attentive fully convo- models in reinforcement learning, 2022. URL: https: lutional siamese networks for change detection in //arxiv.org/abs/2205.12258. doi:10.48550/ARXIV. 2205.12258 . [22] M. Widrich, M. Hofmarcher, V. P. Patil, A. Bitto- Nemling, S. Hochreiter, Modern hopfield networks for return decomposition for delayed rewards, in: Deep RL Workshop NeurIPS 2021, 2021. URL: https: //openreview.net/forum?id=t0PQSDcqAiy. [23] A. FΓΌrst, E. Rumetshofer, J. Lehner, V. Tran, F. Tang, H. Ramsauer, D. Kreil, M. Kopp, G. Klambauer, A. Bitto-Nemling, S. Hochreiter, Cloob: Modern hopfield networks with infoloob outperform clip, 2021. URL: https://arxiv.org/abs/2110.11316. doi:10. 48550/ARXIV.2110.11316 . [24] B. SchΓ€fl, L. Gruber, A. Bitto-Nemling, S. Hochreiter, Hopular: Modern hopfield networks for tabular data, 2022. URL: https://openreview.net/forum?id= 3zJVXU311-Q. [25] K. Simonyan, A. Zisserman, Very deep convolu- tional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556 (2014). [26] H. Chen, Z. Shi, A spatial-temporal attention-based method and a new dataset for remote sensing image change detection, Remote Sensing 12 (2020) 1662.