1. Introduction

Point-Based Weakly Supervised Deep Learning for Water Extraction from High-Resolution Remote Sensing Imagery

Ming Lu

Leyuan Fang

Yi Zhang

0 0 the College of Computer Science, Sichuan University , Chengdu 610065 , China 1 the College of Electrical and Information Engineering, Hunan University , Changsha 410082 , China

The use of deep learning for water extraction requires precise pixel-level labels. However, it is very dificult to label highresolution remote sensing images at the pixel level. Therefore, we study how to utilize point labels to extract water bodies and propose a novel method called the neighbor feature aggregation network (NFANet). Compared with pixel-level labels, point labels are much easier to obtain, but they will lose a lot of information. In this paper, we take advantage of the similarity between the adjacent pixels of a local water body, and propose a neighbor sampler to resample remote sensing images. Then, the sampled images are sent to the network for feature aggregation. Our method uses neighboring features instead of global or local features to learn more representative features. The experimental results show that the proposed NFANet method not only outperforms other weakly supervised approaches, but also obtains similar results as the state-of-the-art ones.

eol>Deep learning weak supervision semantic segmentation water extraction

1. Introduction

ture extraction is highly dependent on the availability of suficient pixel-level labels for training. However, highWater-body extraction from high-resolution remote sens- resolution remote sensing images are large in scale and ing images is an important research topic in the field data volume, which makes pixel-level labeling extremely of remote sensing. Although the traditional algorithms laborious. The pixel-level annotation usually requires a have made some progress in water-body extraction, there lot of time and labor costs, as well as professional knowlare still problems such as low automation, cumbersome edge to accurately mark uncertain boundaries between manual feature extraction, and insuficient extraction diferent classes of interest, which hinders the extraction accuracy. In recent years, deep learning has become of informative features from high-resolution remote sensan emerging research hot spot in the field of artificial ing images to a certain extent. Training models using intelligence. The rapid development of deep learning weak labels have received more and more attention in the technology and the improvement of computer hardware ifeld of computer vision. Compared with fully-supervised performance have made deep learning, especially the semantic segmentation, weak-supervised learning does CNN-based techniques, successful in many important not require pixel-level labels, and has the characteristics tasks, such as image classification, target detection, and of fast labeling and low cost. However, the use of weak semantic segmentation, and their performance has sur- annotations makes the supervision information seriously passed many traditional algorithms. The work [ 1 ] in insuficient and, thus, key information such as shape, texproposes a method that combines graph convolutional ture, and edges are usually lost, which makes it dificult network (GCN) and CNN to fuse diferent Hyperspec- to extract water from high-resolution remote sensing tral features to improve the performance of hyperspectral images with complex scenes. classification. Work in [ 2 ] studys the multi-modal models Some researchers try to use traditional methods comand proposes a variety of plug-and-play fusion modules bined with deep learning to solve weak supervision probto fuse the features of remote sensing images of difer- lems. The work in [ 7 ] combines super-pixels and a local ent modalities. Work in [ 3 ] discusses the importance of map to obtain rough pseudo-labels to train a water exnonconvex modeling in interpretable AI models from traction model. Work in [ 8 ] combines super-pixel poolmultiple topics.Therefore, it is necessary to apply deep ing with multi-scale feature fusion to detect buildings. learning to extract water bodies [ 4, 5, 6 ]. Other researchers attempt to obtain better results by usUnfortunately, the success of deep learning for fea- ing the extraction capabilities of neural networks. Work in [ 9 ] learns from the principle of CAM [ 10 ] and exCDCEO 2021: 1st Workshop on Complex Data Challenges in Earth tracts feature maps from UNet [ 11 ] for hard-threshold Observation, November 1, 2021, Virtual Event, QLD, Australia. processing to obtain segmentation predictions. These y"zh1a1n4g8@46©s2c21u092.16eC@doupqy.rcqign.hcto(foYmr.thZ(isMhpaa.pneLrgbuy))i;tslaeuythuoras.nUfsae npegrm@itthednuund.eerdCure.actinve (L. Fang); wmeeathko-sdusphearvveisaecdhlieeavrendinpgrobmutisdinognroetscuoltnssiindetrhtehfieeldchoafrCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) acteristics of the image itself.

Unlike other natural objects, water bodies are usually the local information or global information of an image, liquid, the colors and textures of local water bodies are the neighbor feature aggregation efectively utilizes the very similar. Therefore, there is a high degree of simi- neighbor information and, therefore, more representative larity between neighbor pixels in water bodies, which features can be learned. makes the inherent diference of the features contained between the neighbor pixels of the water-body generally smaller than that of the non-water-body. We hope to 2. Method map the neighbor pixels of the remote sensing image to the same location in space, and then extract neighbor Figure 1 illustrates the proposed weakly supervised wafeatures from multiple neighbor pixels, and use the neigh- ter extraction framework. Figure 1.a shows the entire bor features to jointly decide whether the pixel at this recursive training process. We will describe it in the location belongs to the water bodies. Based on the above third section. The acquisition of pseudo-labels is shown motivation, we propose the neighbor feature aggrega- in Figure 1.b. We input neighbor images into the network tion network (NFANet) to make full use of this property. and use point labels for supervision to obtain neighbor Specifically, we utilize a sampling method called neigh- features. Then the feature aggregation module is used to bor sampler to generate a set of neighbor images from aggregate the features extracted from the previous step. high-resolution remote sensing images. The neighbor Finally, post-processing is performed to obtain pseudopixels of the original image are separately allocated to labels. We will describe the details of each of the above each neighbor image, so that the pixel values of any two steps in the following sections. neighbor images at the same position are similar but the pixel values are diferent. On the whole, neighbor image 2.1. Neighbor Sampler groups have similar but diferent characteristics. Then, we use an end-to-end model to perform feature extraction on each image of the neighbor image groups, and aggregate the features by using the feature aggregation module. Compared with other methods that only use

First, we introduce a neighbor sampler to obtain a neigh

bor images group (1 () , 2 () , . . . , ()) from a single optical remote sensing image . represents the number of neighbor images. Figure 2 shows the schematic diagram of generating a group of neighbor

CMax pooling is adopted to reduce the number of

channels of each neighbor feature to one. CMax pooling is defined mathematically in detail as follows: Given a three-dimensional feature maps tensor group = (1 () , 2 () , . . . , ()) ∈ R× × × , The operation of CMax pooling is as follows: ,,() = (,,1,(), ,,2,(), . . . , ,,3,()), (1)

= 1, 2, . . . , , = 1, 2, . . . , , = 1, 2, . . . , .

As a result, the feature maps group = (1 () , 2 () , . . . , ()) ∈ R× × is obtained. Then, the OTSU algorithm is used to binarize each feature in to obtain the result = (1 () , 2 () , . . . , ()) ∈ R× × . The formula is as follows: images using the neighbor sampler. Let us assume that the width, height, and channel of the input image are bo,rsa, mp, lreerspect=ive(ly.1T,h2e,i.m. p.l,eme)nitsatdioenscorfibtehde anseifgohl-- = () , = 1, 2, . . . , . (2) lows: Finally, we vote for all binarized neighbor features of the 1. The image is divided into × cells, where neighbor feature group to obtain the aggregated result the size of each cell is × × . is experimentally . is calculated using the following equation: set to 2 and, therefore, = × = 4.

2. For the − ℎ row and − ℎ column of the cell, the {︃1, ∑︀ 2 pixels in the adjacent positions of each cell are selected , = 0, ∑︀=1 ,, ≥ (3) in the order from top to bottom and from left to right, =1 ,, < 2 which are regarded as the (, ) − ℎ elements of = To sum up, the mathematical definition of the feature (1 () , 2 () , . . . , ()). When is set to 2, the aggregation module is detailed as follows: pixels at the upper left, upper right, lower left, and lower right adjacent positions are selected, respectively. = ( ( ( ))) (4) 3. For all × cells being divided in step 1, step 2 will be repeated until all the cells are resampled, and where ∈ R× × × represents the neighbor feaa neighbor sampler = (1, 2, . . . , ) is generated. tures group and ∈ R× is the output. Next, the Given an optical remote sensing image , neighbor im- aggregated result is input to the post-processing modages group (1 () , 2 () , . . . , ()) is generated, ule. The specific operations include filling small holes in where the size of each neighbor image is × × . the closed area by using area filling and removing noise

In this way, the neighbor image dataset can be gen- by using morphological operations. Then, we apply a erated from the original dataset. Neighbor images are point-label constraint to the processed results. If the area similar but not identical, because for any two neighbor in the result contains point labels, the entire area is reimages, (, ) − ℎ pixel comes from the neighboring tained, otherwise it is not retained. The generated results location of the original remote sensing image. are used as pseudo-labels and input into the recursive training as supervision information.

2.2. Neighbor feature aggregation and

post-processing

2.3. Recursive training

We input the neighbor images group to an Recursive training is a weakly supervised strategy. When end-to-end network to extract features, and ob- applying the resulting model over the training set, the nettain the corresponding neighbor feature group work outputs capture the shape of objects significantly (1 () , 2 () , . . . , ()) ∈ R× × × , where better than that of just pseudo-labels [ 12 ]. We have ob () ∈ R× × represents the feature maps served through experiments that when the training set is extracted from the − ℎ image in the neighbor input to the network again, the obtained network output iamsatghees fgeraotuupre. Wexetraucsteiotnheneentwcoodrekr.-deScpoedceificraslltyru,ctthuere will become smoother than the coarse-grained pseudofeature maps are extracted from the penultimate label, which improves the accuracy of the prediction convolutional layer. The network structure is shown result to a certain extent. in Figure 1.b. It is worth noting that the network is We embed the neighbor sampler into the recursive replaceable (in the experimental part, a variety of training so that the network can learn neighbor features network structures are used for feature extraction). (the flowchart is shown in Figure 1.a). Recursive training consists of three steps. First, the remote sensing image is used to generate neighbor images group. We apply the neighbor images group and point-label to train the network to obtain pseudo-label. Second, the pseudolabel is used to generate pseudo-labels groups. It is worth noting that the − ℎ image of the neighbor images group and the − ℎ image of the pseudo-labels group are resampled in the same way. Third, input the − ℎ image into the network and utilize the − ℎ label as the supervision information for training. After training the model with all training sets, the neighbor images group are input again to obtain the results group. When = 2, the number of results is 4. We perform a weighted average on the results group to obtain a new pseudolabel.

3. Experimental results 3.1. Datasets and evaluation To prove the efectiveness of the proposed method, we

applied the method to high-resolution visible spectrum images for water extraction. This water-body dataset comes from the Gaofen Challenge [ 18 ], which contains RGB pan-sharpened images with a resolution of 0.5 m and does not contain infrared bands or digital elevation models. All images are taken from Wuhan and Suzhou, China, mainly in rural areas supplemented by urban areas. The positive labels in the dataset include rivers, reservoirs, rice fields, ditches, ponds, and lakes, while all other non-water pixels are considered negative. The data set is cropped into 1000 images with the size of 492×492 without any overlap. We re-annotated the dataset. The rule is that each independent water body is randomly labeled with a point label of the size 5×5.

In the experiment, the weak supervision models use point labels as the initial supervision information, while other weak supervision methods can predict the local the full supervision models use pixel-level labels. Because area of the water body, there are errors in the detection of the remote sensing image segmentation/classification the water body boundary, while our method is relatively evaluation index of overall accuracy or Kappa coeficient more accurate. The studied weak supervised methods cannot efectively describe the real structure of image cannot detect small objects appropriately while this issue segmentation geometry, we choose to use fgIoU (fore- is solved to a great extent by the proposed method. ground IoU), bgIoU (background IoU), mIoU (mean IoU), fgDice (foreground Dice), bgDice (background Dice) and 3.4. Efectiveness of neighbor sampling mDice (mean Dice) to comprehensively evaluate the results. For each model, we performed five independent In the ablation experiment, the other settings are unruns to calculate the aforementioned evaluation indica- changed, and only the value of is changed. We set the tors and standard deviations. neighbor sampling parameter k of our proposed network from 1 to 4 and only use cross entropy and dice loss to 3.2. Comparison with Fully Supervised train the model. For diferent values of NFANet, in order to avoid interference from other modules, we only

Approaches select UNet as the feature extraction network for comIn Table 1, we report the water extraction performance parative experiments. In particular, when the value of of our proposed approach and compare it with the fully is set to 1, the neighbor image group degenerates into supervised approaches. Figure 3 also provides the visual the input image. As shown in Figure 5, with the gradual performance of all approaches. These approaches ran- increase of , mIoU first increases and then decreases. domly use 70% of the samples as the training set, and With the increase of neighborhood sampling paramethe remaining data as the test set. Experiments demon- ter , the number of adjacent pixels to be considered strate that our method achieves the best score using the increase geometrically, resulting in information redunNestedUNet-based model, and the visual performance dancy. The size of each reconstructed neighbor image is shows that the prediction results obtained by our method gradually reduced, and the boundary of the water body are very close to the ground truth. The mIoU of our will also become unclear. Therefore, we set equal to 2, method reached 75.22%, and the mDice reached 85.04%. because the neighbor features require less computation Compared with the best fully-supervised model DeepLab and achieves better performance.

V3+, the mIoU of our method is only reduced by 3.03%, and mDice is only reduced by 2.23%. But the labeling 3.5. Efectiveness of feature aggregation cost of our method is much less than that of the fully supervised method. Nevertheless, it is dificult to achieve fully supervised performance using only point labels.

3.3. Comparison with Weakly Supervised Approaches We compare our method with several other weakly su

pervised remote sensing approaches. The experimental results are shown in Table 2. To be fair, all methods are based on UNet. It can be seen that the mIoU of our method is 8.84%, which is higher than that of the U-CAM-based method with the mDice of 6.89%. In addition, Figure 4 shows the prediction results of other weakly supervised methods and our method. Although As shown in Table 3, when is set to 2 ( = 4), assuming that the features of the − ℎ neighboring image is , ∑︀

=1 means feature aggregation is used. Compared with the best method that does not use feature aggregation, the mIoU and mDice of our method are improved by 4.8% and 3.7% respectively. To a certain extent, the greater the number of neighboring images, the more features are available, and these features can complement each other. Therefore, after the feature aggregation, the performance of the prediction results can be improved.

3.6. Time consumption The hardware configurations for the experiments in this

paper consisted of Intel Core i7-9700k 3.60 GHz CPU, GeForce RTX 2080Ti GPU, and 16GB RAM. The results of the GPU inference time are shown in Table 4. The results in the table are the average GPU inference time of the data set. After using recursive training to improve the quality of pseudo-labels, we input the pseudo-labels and original images into the models consistent with the fullysupervised methods for training and inference. Therefore, the GPU inference time of the proposed method is the same as that of the fully-supervised methods. It can be observed from the table that the inference time of DLinkNet is the shortest. because DLinkNet compresses the feature channel in the decoder to reduce the computational cost. NestedUNet embeds U-Nets of diferent depths in its architecture, which requires more convolution calculations, thus increasing the consumption of inference time.

4. Conclusion In this paper, we proposed a network entitled NFANet.

Unlike traditional convolutional neural networks that only use global or local features for discrimination, NFANet uses neighbor features, which allows more representative features to be learned. We fuse these neighbor features to obtain pseudo-labels, and improve the label quality by recursive training. We tested it on water data sets and compared it with advanced fully supervised and weakly supervised methods. By using only point labels, the proposed method obtains comparable results with that of full supervision. As a possible future work, we will conduct research on weakly supervised or semisupervised methods of self-correction. Remote sensing images collected from satellites are usually afected by spectral variability.

Work in [ 19 ] uses endmember dictionary and spectral variability dictionary to model diferent spectral variability respectively. In addition, this method provides reasonable prior knowledge for the spectral variability dictionary. Our proposed method considers local neighbor pixels. Therefore, when encountering various degeneration, noise influences and other variability factors, it is necessary to analyze whether these variability factors cause greater interference between neighbor pixels. If the variability factors bring diferent efects to diferent local areas, it is very likely that the prediction results will lose the water bodies. In future work, we will consider introducing cross-local features to improve the network’s ability to learn features between diferent local water bodies.

[1]

Hong ,

Gao ,

Yao ,

Zhang ,

Plaza ,

Chanussot , Graph convolutional networks for hyperspectral image classification , IEEE Transactions on Geoscience and Remote Sensing 59 ( 2021 ) 5966 - 5978 . doi: 10 .1109/TGRS. 2020 . 3015157 .

[2]

Hong ,

Gao ,

Yokoya ,

Yao ,

Chanussot ,

Du , B. Zhang, More diverse means better: Multimodal deep learning meets remote-sensing imagery classification , IEEE Transactions on Geoscience and Remote Sensing 59 ( 2021 ) 4340 - 4354 . doi: 10 .1109/TGRS. 2020 . 3016820 .

[3]

Hong ,

He ,

Yokoya ,

Yao ,

Gao ,

Zhang ,

Chanussot ,

Zhu , Interpretable hyperspectral artificial intelligence: When nonconvex modeling meets hyperspectral remote sensing , IEEE Geoscience and Remote Sensing Magazine 9 ( 2021 ) 52 - 87 . doi: 10 .1109/MGRS. 2021 . 3064051 .

[4]

Ren ,

Xu ,

Liu ,

Li , Sea ice and open water classification of sar images using a deep learning model , in: IGARSS 2020 - 2020 IEEE International Geoscience and Remote Sensing Symposium , 2020 , pp. 3051 - 3054 . doi: 10 .1109/IGARSS39084. 2020 . 9323990 .

[5]

Poliyapram ,

Imamoglu ,

Nakamura , Deep learning model for water/ice/land classification using large-scale medium resolution satellite images , in: IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium , 2019 , pp. 3884 - 3887 . doi: 10 .1109/IGARSS. 2019 . 8900323 .

[6]

Yan ,

Dong , Optical remote sensing image waters extraction technology based on deep learning context-unet , in: 2019 IEEE International Conference on Signal, Information and Data Processing (ICSIDP) , 2019 , pp. 1 - 4 . doi: 10 .1109/ ICSIDP47821. 2019 . 9173433 .

[7]

Fu ,

Lu ,

Diao ,

Yan ,

Sun ,

Zhang ,

Sun , Wsf-net: Weakly supervised feature-fusion network for binary segmentation in remote sensing image , Remote Sensing 10 ( 2018 ).

[8]

Chen ,

He ,

Zhang , G. Sun,

Deng , Spmf-net: Weakly supervised building segmentation by combining superpixel pooling and multi-scale feature fusion , Remote Sensing 12 ( 2020 ) 1049 .

[9]

Wang ,

Chen ,

S. M.

Xie ,

Azzari ,

D. B.

Lobell , Weakly supervised deep learning for segmentation of remote sensing imagery , Remote Sensing 12 ( 2020 ) 207 .

[10]

Zhou ,

Khosla ,

Lapedriza ,

Oliva ,

Torralba , Learning deep features for discriminative localization , CVPR ( 2016 ).

[11]

Ronneberger ,

Fischer ,

Brox , U-net: Convolutional networks for biomedical image segmentation, International Conference on Medical Image Computing and Computer-Assisted Intervention ( 2015 ).

[12]

Khoreva ,

Benenson ,

Hosang ,

Hein ,

Schiele , Simple does it: Weakly supervised instance and semantic segmentation , in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016 .

[13]

Long , E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation , in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2015 .

[14]

Chu ,

Tian ,

Feng ,

Wang , Sea-land segmentation with res-unet and fully connected crf , in: IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium , 2019 , pp. 3840 - 3843 . doi: 10 .1109/IGARSS. 2019 . 8900625 .

[15]

Zhou ,

Siddiquee ,

Tajbakhsh ,

Liang , Unet++ : A nested u-net architecture for medical image segmentation, 4th Deep Learning in Medical Image Analysis (DLMIA ) Workshop ( 2018 ).

[16]

Zhou ,

Zhang , M. Wu , D-linknet: Linknet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction , in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , 2018 , pp. 192 - 1924 . doi: 10 .1109/CVPRW. 2018 . 00034 .

[17]

L. C.

Chen ,

Zhu ,

Papandreou ,

Schrof ,

Adam , Encoder-decoder with atrous separable convolution for semantic image segmentation , Springer, Cham ( 2018 ).

[18] 2020 gaofen challenge on automated highresolution earth observation image interpretation , http://en.sw.chreos.org/, 2020 .

[19]

Hong ,

Yokoya ,

Chanussot ,

X. X.

Zhu , An augmented linear mixing model to address spectral variability for hyperspectral unmixing , IEEE Transactions on Image Processing 28 ( 2019 ) 1923 - 1938 . doi: 10 .1109/TIP. 2018 . 2878958 .