1. Introduction

Network for Bitemporal Remote Sensing Image Change Detection

Shizhen Chang

shizhen.chang@iarai.ac.at 1

Michael Kopp

michael.kopp@iarai.ac.at 1

Pedram Ghamisi

pedram.ghamisi@iarai.ac.at 0 1 0 Helmholtz-Zentrum Dresden-Rossendorf, Helmholtz Institute Freiberg for Resource Technology, Machine Learning Group , 09599 Freiberg 1 Institute of Advanced Research in Artificial Intelligence (IARAI) , 1030 Vienna , Austria

The task of bitemporal change detection aims to identify the surface changes of specific scenes at two diferent points in time. In recent years, we have increasingly witnessed the success of deep learning in a variety of applications in remote sensing, including change detection and monitoring. In this paper, a novel deep feature retrieval neural network architecture for change detection is proposed that uses a trainable associative memory component to exploit potential similarities and connections of the deep features between image pairs. A key ingredient in our novel architecture is the use of a continuous modern Hopfield network component. The proposed method beats the current state-of-the-art on the well-known LEVIR-CD data set. The codes of this work will soon be available online (https://github.com/ShizhenChang). Remote sensing, change detection, modern Hopfield network, deep learning, Siamese network, convolutional neural network.

1. Introduction

With the rapid development of technologies for Earth lution (VHR) remote sensing data has become available for geographical analysis and image processing [1]. VHR images can provide detailed information about land surfaces, and images collected at diferent time epochs from the scene are able to record changes regularly. Therefore, as one of the most important remote sensing tasks, change detection has been widely applied in many areas of land-use and land-cover analysis, such as environmental monitoring, urban growth, deforestation assessment, shifting cultivation evaluation, and so on.

A variety of deep neural networks, such as the convo[3], recurrent neural networks [4], generative adversarial network (GAN) [5], and deep belief network (DBNs) [6], have been successfully utilized for remote sensing change detection over the last few years. Among them, CNN-based methods can take full use of the spatial information of VHR remote sensing images, thus, can better extract high-level deep features and abstract semantic contents to learn discriminative diferences between the periods.

Strategies that have been applied to extract deep features of the inputs, can be broadly divided into two catenEvelop-O gories: early fusion [7, 8] and late fusion [9] networks.

The early-fusion networks first concatenate multitemporal images into a unified data cube, and then, the panetworks usually learn single-temporal features individually and share the parameters by using a Siamese network. Compared to early-fusion networks, late-fusion methods can better utilize the features of the inputs and return clearer contours of the change objects. However, the features of shallower layers may not be suficiently learned and utilized due to the gradient vanishing problem. Therefore, learning information from both shallow and deep layers are very important to efectively detect changes using deep-learning-based approaches.

In order to accurately extract features, deeper and more include architecture components such as Long ShortTerm Memory (LSTM) [10] and attention mechanisms (self-attention [11], spatial attention [12], and channel attention [8]). The successful combination of CNNs and other networks has shown that discriminative features within the image pairs can be better extracted and the detection accuracy can be greatly improved. However, limited by the architecture of CNNs, as the high-level features are only related to the shallower layers through larger receptive fields, the global and temporal information between the image pairs are still not suficiently utilized.

To address this issue, we design a Hopfield pooling block to interactively retrieve the high-level concepts of changes. This idea is inspired by the successful application of the modern Hopfield network for continuous tic information between the image pairs in deeper layers © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License pattern retrieval [13]. Our assumption is that the semanpatterns for the queries, each being a linear combination of stored patterns lying in the convex hull of the simplex spanned by the stored patterns. (a) This Hopfield layer associates two sets R and Y to propagate sets of vectors. (b) Layer Hopfield Pooling layer performs a pooling operation to the set Y via learned queries. (c) The Hopfield layer learns a new set of stored patterns based on the input R. that can be learned during the training process. We use following energy function is minimized: can be represented using a common matrix, i.e. a query, hull of the simplex spanned by the { 1, ..., }, such the Binary modern Hopfield networks are associative memories on binary data that can retrieve data of exponen- contrastive learning of joint image- and text embedding this query to retrieve related semantic features between given images. These retrieved features reflect a common spatio-temporal context and are used by subsequent layers in our network. Concretely, we incorporate a Hopifeld network block into a Siamese fully convolutional network (FCN) resulting in the design of our proposed deep feature retrieved network (FrNet) for bitemporal remote sensing change detection. It should be noted that diferent from previous change detection models, both semantic and temporal information can be fully considered; and it is our first attempt of using modern Hopfield networks in the remote sensing community.

The rest of this article is organized as follows. Section II briefly reviews continuous modern Hopfield networks.

Section III describes the proposed method. Experiments are conducted and discussed in Section IV.

2. Continuous Modern Hopfield Network

tially many stored patterns [14, 15], this being the key distinguishing feature to their classical binary counterparts [16, 17]. These binary modern Hopfield networks have been generalized to continuous modern Hopfield networks that, crucially, are diferentiable and can thus be embedded in deep learning architectures trained by gradient descent [13, 18]. Moreover, continuous modern Hopfield networks retain the key ability to store exponentially many patterns and they can furthermore retrieve patterns in only one update step.

Given a matrix of shape × formed of column vectors { 1, ..., } ∈ ℝ , a query pattern , also a column vector, seeks to retrieve the best pattern in the convex 2 1 ⊤ + 1 2 2, = −

−1 log(∑ exp ( ⊤ ))+ −1 log + shown in [13, 18], rule: where is the largest norm of the { 1, ..., } in ℝ . As is defined by the following update = ( ; , ) = softmax ( ⊤ ), (1) which will converge globally, almost always, to a local minima of the energy function in essentially one update step. Moreover, equation (1) is closely related to the well known transformer attention mechanism, showing that retrieval in modern Hopfield networks and transformer attention coincide [13, 18].

With changable structures in deeper networks (as shown in Fig. 1), continuous modern Hopfield networks have greater application prospects in deep learning. It has been successfully applied to solving large scale multiinstance learning tasks [19], to few- and zero-shot chemical reaction template prediction [20], to creating new reinforcement learning algorithms [21, 22], to improving representations [23] and to tabular data [24].

Inspired by continuous modern Hopfield networks, we design a Siamese Hopfield pooling layer and attempt to capture deep feature diferences for remote sensing bitemporal change detection.

3. Deep Feature Retrieved Network for Change Detection 3.1. Overview

As shown in Fig. 2, the proposed deep feature retrieved network (FrNet) is a Siamese network that contains three C

Change Map : Backbone blocks : 1×1 convolution : Up sample : Matrices difference : Matrix multiplication : Concatenation

3.2. Hopfield Pooling Block

The Hopfield layer is proven to be capable of retrieving key features of the input through one update. For the proposed bitemporal change detection task, the question is: “how can we obtain the most typical information Let us assume two temporal VHR images are denoted by ∈ ℝ3×ℎ× , where = {1, 2} represents the -th time period and ℎ and are the height and width of the images, respectively. Features obtained by the backbone are denoted as ∈ ℝ

× ℎ̃ × ̃ , where , ℎ̃ , and ̃ represent respectively. For the proposed VGG-16 feature extractor, the features are 1/32 of the original image. the channel size of is 512, and the height and width of

In the Hopfield pooling block, the features are first reshaped into ℝℎ̃ ×̃ of row-wise vectors. Then, for the time 1 image, we introduce a trainable weight matrix ∈ ℝ ×ℎ̃ ̃ to retrieve the related deep features of 1 related to the 2nd period. The output can be written as: 1 = softmax ( 1⊤) 2. (2) The number of rows in is set to 2 in this paper which represents the change/unchange semantic information we retrieved.

Similarly, the common weight matrix is utilized to retrieve 2 related to the 1st period: 2 = softmax ( 2⊤) 1. (3) It should be noted that the retrieved output 1 and 2 have the same size and contain both global and temporal information of the image pairs.

We concatenate the retrieved outputs together: = [ 1; 2], restore their spatial dimensions, and feed them into a 1 × 1 2D convolutional layer with 16 filters to generate a new feature map. After bilinear interpolation, the features through the Hopfield pooling block is finally derived:

4. Experiments 4.1. Data Set

In the experimental part, the LEVIR-CD data set [26] is utilized to compare the change detection methods. The LEVIR-CD data set is composed of 637 VHR (0.5m/pixel) Google Earth (GE) image pairs with the size of 1024×1024 pixels. These image pairs have been captured in diferent periods of 5 to 14 years and cover a total of 31,333 individual buildings for the task of building growth assessment. With the ratio of 7:1:2, these image pairs are split into the training set, validation set, and testing set. Following the initial settings, we crop each image into 16 non-overlapped small patches with the size of 256×256 pixels. Thus, there are a total of 7120 image pairs for training, 1024 for validation, and 2024 for testing. = =

+ 1 = 2 ⋅

+ =

+ + + + (5) (6) (7) (8) where TP (True Positive) represents the number of pixels of real changes that are correctly detected, FP (False Positive) represents the number of pixels of unchanged objects that are falsely detected as changed objects, TN (True Negative) denotes the number of pixels of unchanged objects that are correctly regarded as nonchange, and FN (False Negative) denotes the number of changed pixels that are not detected as changed objects.

4.3. Experimental Results and Analysis 4.2. Comparative method and Evaluation Metrics

In our experiments, the proposed FrNet is implemented with the Pytorch platform using a single NVIDIA A100 To verify the efectiveness of the proposed FrNet method, GPU (with 40-GB RAM). During the training stage, the four representative deep-learning-based change detec- Adam optimizer with a weight decay of 1 − 5 was emtion networks are taken into consideration. The FC-EF ployed. The batch size is set to 32, and the learning rate [7] is an early fusion method based on U-Net that con- is initially set to 1 − 4 and will linearly reduce to 0 over catenates the bitemporal image pairs as the input. And its 50,000 iterations. The of the Hopfield layer is set to extended versions, the FC-Siam-dif and FC-Siam-conc 1/√ . [7], use Siamese networks with shared weights to ex- The quantitative results for the precision, recall, F1 tract multi-level features and use feature diference and score, and OA of all models are summarized in Table 1. concatenation, respectively, to fuse bitemporal informa- It can be found that FC-EF obtains the lowest F1 score tion. The bitemporal image transformer (BIT) network (75.25%) and OA (96.78%) among all the models. The FC[12] designs a context-information-based enhancer to Siam-conc and FC-Siam-dif perform slightly better than extract related concepts in the token-based space-time, FC-EF, which indicates the Siamese network and feature and projects the context-rich tokens back to original fea- diference/concatenation have benefits for the preservatures for prediction. To validate the efectiveness of the tion of useful information. The F1 score and OA of the (a) (b) (c) (d) (e) (f) (g) (h) (i)

5. Conclusion

BIT model are 83.22% and 98.06%, respectively, better than other FC-based models. This demonstrates that the tokens in spase-time can efectively capture the tempo- Inspired by the successful application of continuous modral changes and enhance the context information. The ern Hopfield for pattern retrieval, we propose a deep proposed FrNet achieves the highest F1 and OA among feature retrieved network (FrNet) for bitemporal change all the studied methods and has better performance than detection. Our Hopfield pooling block introduces a trainour base model. The improvements prove that the Hop- able weight matrix that aims to retrieve the global change ifeld layer helps retrieve the deep features and the shared of interests for high-level features and capture the disquery matrix can learn important information as part of criminative representations of one period related to the the inputs for the decoder. other. To valuate the efectiveness of the proposed model,

Fig. 3 illustrates change detection maps obtained by experiments are conducted on the LEVIR-CD data set. diferent methods, where TPs, TNs, FPs, and FNs are Our empirical evidence confirms the superiority of the represented in yellow, black, red, and green, respec- proposed FrNet in comparison with other state-of-thetively. We can observe that FrNet achieves the best results arts methods. among all the models. Firstly, FrNet can better distinguish small-sized changed buildings that have relatively Acknowledgments regular shapes by reducing false alarms compared with other methods (e.g., the 1st, 2nd, and 3rd rows of Fig. 3).

When the shapes of buildings are complex, our model can also preserve the boundary of the objects (e.g., the 4th, 5th, and 6th rows of Fig. 3).

The authors would like to thank the contributors of the LEVIR-CD data set for making it publicly available, and the authors of the FC-EF, FC-Siam-conc, FC-Siam-dif, and the BIT methods for releasing their codes. 2205.12258. [22] M. Widrich, M. Hofmarcher, V. P. Patil, A. Bitto

Nemling, S. Hochreiter, Modern hopfield networks for return decomposition for delayed rewards, in: Deep RL Workshop NeurIPS 2021, 2021. URL: https: //openreview.net/forum?id=t0PQSDcqAiy. [23] A. Fürst, E. Rumetshofer, J. Lehner, V. Tran, F. Tang,

H. Ramsauer, D. Kreil, M. Kopp, G. Klambauer, A. Bitto-Nemling, S. Hochreiter, Cloob: Modern hopfield networks with infoloob outperform clip, 2021. URL: https://arxiv.org/abs/2110.11316. doi:10.

48550/ARXIV.2110.11316. [24] B. Schäfl, L. Gruber, A. Bitto-Nemling, S. Hochreiter,

Hopular: Modern hopfield networks for tabular data, 2022. URL: https://openreview.net/forum?id= 3zJVXU311-Q. [25] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556 (2014). [26] H. Chen, Z. Shi, A spatial-temporal attention-based method and a new dataset for remote sensing image change detection, Remote Sensing 12 (2020) 1662.