=Paper=
{{Paper
|id=Vol-2491/paper120
|storemode=property
|title=Autoencoder-Learned Local Image Descriptor for Image Inpainting
|pdfUrl=https://ceur-ws.org/Vol-2491/paper120.pdf
|volume=Vol-2491
|authors=Nina Zizakic,Izumi Ito,Laurens Meeus,Aleksandra Pizurica
|dblpUrl=https://dblp.org/rec/conf/bnaic/ZizakicIMP19
}}
==Autoencoder-Learned Local Image Descriptor for Image Inpainting==
Autoencoder-learned local image descriptor for image inpainting Nina Žižakić1∗ , Izumi Ito2 , Laurens Meeus1 , and Aleksandra Pižurica1 1 Department of Telecommunications and Information Processing, TELIN – GAIM, Ghent University – imec, Ghent, Belgium {nina.zizakic, laurens.meeus, aleksandra.pizurica}@ugent.be 2 Information and Communications Engineering, Tokyo Institute of Technology, Tokyo, Japan ito@ict.e.titech.ac.jp Abstract. In this paper, we propose an efficient method for learning local image descriptors suitable for the use in image inpainting algo- rithms. We learn the descriptors using a convolutional autoencoder net- work that we design such that the network produces a computationally efficient extraction of patch descriptors through an intermediate image representation. This approach saves computational memory and time in comparison to existing methods when used with algorithms that require patch search and matching within a single image. We show these benefits by integrating our descriptor into an inpainting algorithm and compar- ing it to the existing autoencoder-based descriptor. We also show results indicating that our descriptor improves the robustness to missing areas of the patches. Keywords. Local image descriptors, patch descriptors, autoencoders, unsupervised deep learning, inpainting. 1 Introduction Local image descriptors are a crucial component of many image processing tasks – image denoising, inpainting, stitching, object tracking, to name a few. In recent years, the approach to designing these patch descriptors has shifted from using hand-crafted features to the (deep) learning approach. Learned descriptors have been shown to outperform some hand-crafted ones on benchmarks [1, 7]. How- ever, despite many advancements in the learning approach and their superior performance on benchmarks, hand-crafted descriptors still perform comparably or better than the learned descriptors in a practical context [16]. He et al. argue that descriptor learning should not be approached as a standalone problem, but rather as a component of a broader image processing task, which must also be taken into consideration [8]. ∗ Corresponding author Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 Nina Žižakić, Izumi Ito, Laurens Meeus, and Aleksandra Pižurica In this paper, we present a descriptor whose learning process is specifically tailored for use in an image inpainting algorithm. The descriptor is based on our previous work [19] and has been extended so that it can be applied to image inpainting tasks. We show that our descriptor improves the inpainting result and makes the task computationally feasible in cases where it was previously not. The computational time of the inpainting is reduced by leveraging our spe- cific autoencoder architecture that allows for the calculation of the intermediate representation of the image [19]. From the intermediate representation, we can obtain the descriptors using a simple operation. We hypothesise that the im- proved inpainting results are the result of (i ) high robustness to missing parts of patches in comparison to the other descriptors (shown in section 4), and (ii ) the fine-tuning of the descriptor on data similar to what will be used for the inpainting. This fine-tuning is possible since our descriptor is trained in an unsu- pervised fashion. To achieve such fine-tuning with other (supervised) descriptors, it would be necessary to have a labeled set for the type of images that need to be inpainted, which is not feasible in most cases. Furthermore, and in contrast to other local feature descriptors, our descriptor is intentionally designed to be translation and rotation sensitive. This property is not always desirable and many descriptors are designed to be translation invariant. However, translation invariant descriptors do not achieve good results in inpainting since they tend to retrieve patches that are rotated or scaled, and introduce irregularities in the inpainted edges. In the next section we discuss the current state of local image descriptors and, in particular, descriptor learning. We also include a brief introduction to autoencoders. We describe our method in Section 3. Section 4 contains the ex- perimental results on robustness to missing data and application to inpainting. We conclude our work in Section 5. 2 Related work The classical approach uses hand-crafted features to design local image descrip- tors. The most well-known descriptors include SIFT [12], SURF [3], BRIEF [4], ORB [14], which continue to be relevant to modern approaches. In recent years, the development of deep learning techniques has resulted in numerous learned descriptors [21, 17, 2, 8]. These descriptors are mostly learned in a supervised fashion, with the labels on pairs of patches, indicating their similarity or dissim- ilarity. While a few learned descriptors show high performance on benchmarks [1, 7], the established hand-crafted descriptors such as SIFT are consistently chosen over the learned ones in practical applications [16]. He et al. argue that this is due to the fact that the descriptors were trained too generally and that training a descriptor for an image processing task in mind can lead to the descriptor per- forming better than the general descriptors [8]. This effect is partially explained by the fact that different image processing tasks value different properties in descriptors. For example, in object tracking it is desirable that the descriptor is Autoencoder-learned local image descriptor for image inpainting 3 translation invariant. However, this property is not desirable in descriptors for inpainting. Since supervised methods are dependent on labeled data, it is often unfeasible to create a supervised descriptor for a specific image processing task. Further- more, since most labeled data for patch descriptors is designed for object tracking and hence is translation invariant, there are very few labelled datasets suitable for inpainting. On the other hand, unsupervised learning methods such as autoencoders do not suffer from the problem of depending on labeled data. Chen et al. were the first authors to propose using autoencoders to the general problem of descrip- tor learning [5]. Their autoencoder-learned descriptor showed promising results, however, the computational time and memory usage make their method infeasi- ble for high resolution image processing tasks. Their fully-connected network has more parameters to be trained and requires longer training times than convolu- tional autoencoder designs. Moreover, their descriptor does not allow different input sizes and therefore a separate autoencoder needs to be trained for every patch size, which renders the framework unusable in practical scenarios. In our previous work [19], we proposed an autoencoder-based patch descriptor designed for applications with many patch comparisons within a single image. Our specific network architecture yielded a special image representation that we refer to as the intermediate representation (IR). The IR is a compact way of storing the descriptors of all the patches of an image because the descriptors of overlapping patches overlap themselves. Extracting a descriptor from the IR is done fast using a max-pooling operation. Moreover, the use of convolutional layers ensures a more efficient learning process and usability of the descriptor for all patch sizes. In this paper, we build on our previous work to show that a descriptor de- signed specifically for an image processing task can outperform the general de- scriptors and can improve the performance of the task, both in terms of compu- tational time and in terms of visual inspection. Furthermore, the unsupervised nature of our descriptor facilitates fine-tuning on the specific type of data to be used in the image processing task in order to further improve its performance. 2.1 Autoencoders An autoencoder is a type of artificial neural network used to learn efficient data representations in an unsupervised fashion. An autoencoder consist of two parts: the encoder, which maps the input into an efficient representation, and the decoder, which maps the efficient representation back to the output (where, ideally, the output matches the input). Autoencoders are trained by minimis- ing the reconstruction error between the input and output, while imposing some constraints on the representation layer. Formally, an autoencoder with encoder P E and decoder D is trained to minimise the loss function J(X, E, D) = x∈X L(x, D(E(x))) + Ω(E(x)), where x ∈ X is a data sample, L is some re- construction loss metric and Ω(E(x)) is an optional sparsity regularisation term 4 Nina Žižakić, Izumi Ito, Laurens Meeus, and Aleksandra Pižurica convolution convolution upsampling upsampling upsampling convolution max-pooling deconvolution deconvolution deconvolution intermediate code layer representation (descriptor) Fig. 1. The proposed autoencoder architecture. Max-pooling layers after the first two convolutional layers have been omitted in order to obtain an intermediate representation (IR) of the image that preserves the spatial information. max pool max pool conv. layers max pool original image intermediate representation descriptors Fig. 2. Exploiting the proposed intermediate representation of an image in algorithms that require many patch comparisons. The intermediate representa- tion is calculated once from the original image through the convolutional layers of the encoder. In algorithms that need to compare patches (e.g. inpainting), the descriptors are extracted from the IR using the fast max-pooling operation, and then compared. imposed on the hidden (representation) layer. Autoencoders designed for work- ing on image data are usually built by alternating convolutional and max-pooling layers. In the l-th convolutional layer, the output neuron at location (i, j) of the k-th channel is expressed as follows: (l) (l) X fX−1 f X−1 (l,k) (l,k,c) (l−1,c) xij = wuv x(i+u)(j+v) + b(l,k) , (1) c∈C u=0 v=0 where C is the set of input channel indices, w(l,k,c) is the convolutional kernel for the l-th layer and k-th channel applied on the c-th input channel, b(l,k) is the bias for the l-th layer of the k-th channel, and f (l) is the size of the convolutional kernel (filter) for the l-th layer. 3 Proposed method The core contribution of this paper is a local image descriptor suitable for inte- gration with image inpainting. This integration is possible due to (i ) the specific autoencoder architecture that provides the intermediate representation (IR) of an image and (ii ) the translation-sensitive design of our descriptor. Autoencoder-learned local image descriptor for image inpainting 5 The introduction of the IR yields two main benefits: it is less memory inten- sive than storing the descriptors of all patches within an image and it allows a descriptor of a single patch to be extracted from the IR with minimal computa- tion. We accomplish this by proposing a novel convolutional neural network (CNN) architecture. While traditional CNNs consist of alternating convolutional and max-pooling layers, we introduce an architecture in which we remove all the max-pooling layers in the encoder except for the last one. This decision was encouraged by a recent study which showed that max-pooling layers are not necessary for a successful neural network [18]. Max-pooling is usually applied since it adds extra non-linearity and introduces dimensionality reduction. We omit max-pooling after the first two convolutions and instead employ non-linear activation functions. We leave only one max-pooling layer with a large spatial extent at the end of the encoder to reduce the dimensionality of the code layer. Unlike the fully-connected architecture used in [5], the convolutional layers that we use in our network are able to exploit the self-similarity property of nat- ural images and reduce the computational complexity and time, while achieving better results than fully connected-layers. Moreover, the use of convolutional layers is critical for the ability to extract patch descriptors from the IR, since convolutional layers preserve spatial information. The reduction of computational time and memory usage follows from the IR in our network. The IR is obtained by propagating the complete image (contain- ing patches of interest) through the convolutional layers in the encoder, but not the max-pooling layer. Figure 1 shows the architecture of our network and the IR, and Figure 2 shows how they can be used within an image processing task. Mathematically, we define our intermediate representation as follows. Let x := x(0,:) be the input image. The intermediate representation is obtained as IR(x) = x(L,:) , with x(L,k) = A(CL (A . . . (C1 (x(0,:) )))), (2) where L is the number of convolutional layers in the encoder E, x(l,k) is the k-th channel of the output of the l-th layer, x(l,:) denotes all channels of the output of the l-th layer, A is some activation function, and Cl is the l-th convolutional layer. From the intermediate representation of an image IR(x), we obtain the descriptor for a patch x(i,j) , whose upper left corner is positioned at (i, j), by performing the max-pooling on the patch of the IR, as follows E(x(i,j) ) = MP(IR(x)(i,j) ). (3) The activation function A from (2) is the rectified linear unit, A(x) = x+ = max(0, x). We trained the network with the Adadelta optimizer [22] and used binary cross entropy as the loss function, and scaling the pixel values to be in the range [0, 1]): X X J(X, E, D) = L(x, x̂) = (x log(x̂) + (1 − x) log(1 − x̂)), (4) x∈X x∈X 6 Nina Žižakić, Izumi Ito, Laurens Meeus, and Aleksandra Pižurica 1e9 Chen et al. descriptor Memory in terms of number of float32's 2.0 Proposed v128 Proposed v32 1.5 1.0 0.5 0.0 0 500 1000 1500 2000 2500 3000 3500 4000 Image size Fig. 3. A comparison of the memory requirements (expressed in the number of 32 bit floating points) as a function of image size in pixels, for the two versions of the proposed descriptor (v32 and v128) compared to Chen et al. [5]. where x̂ := D(E(x)) is the output of the autoencoder. We trained the autoen- coder in two different ways, creating two versions of the descriptor, v32 and v128 (named after the dimensionality of the descriptor for 16 × 16 patches). Image inpainting calls for sensitivity of the deployed patch descriptor to translation, rotation, and scaling. Although convolutional layers are known to be translation invariant, Kauderer-Abrams claimed that this invariance is due to the use of data augmentation, a large number of layers in the network, and the use of large filters in convolutional layers [9]. We have designed our autoencoder to avoid these sources of translation invariance. Our autoencoder is implemented using the Keras library for neural networks. We initially trained our network on a total of 200k 16 × 16 patches cropped from images from the datasets ImageNet [6], KonIQ [20], and VisualGenome [11]. For the case of inpainting, descriptors play a crucial role in the performance of the algorithm, since retrieving similar patches is the most prevalent operation in the inpainting. The unsupervised nature of our descriptor makes it possible to fine-tune the descriptor to the type of data which will be used in the inpaint- ing. For the fine-tuning, we use around 50k patches cropped from the images of interest. In our case, the images of interest were macro-photographs of the paintings from the Ghent Altarpiece painted by the Van Eyck brothers [10]. We describe the inpainting experiment in further detail in the following section. In Figure 3 we compare the effective memory usage required by different descriptors with respect to the image size. The results indicate potential for a tremendous decrease in memory usage for applications on a single image. This decrease could make some algorithms, which use many patch comparisons, fea- sible in high resolution images. Autoencoder-learned local image descriptor for image inpainting 7 Missing 0% Missing 10% Missing 20% Missing 30% Missing 40% Missing 50% 107 107 107 107 107 107 106 106 106 106 106 106 SSD 105 105 105 105 105 105 104 104 104 104 104 104 A B C D A B C D A B C D A B C D A B C D A B C D Fig. 4. Comparison of the descriptors’ robustness to missing data. A – pro- posed descriptor v32, B – proposed descriptor v128, C – Chen et al. [5], D – exhaustive search on pixel intensity values. The plots are showing the sum of squared differences (SSD) of ground truth pixel values of patches found by the descriptors (in A-C) and exhaustive search (D), based on the percentage of missing area in a patch. 4 Experimental evaluation The original motivation for implementing a local image descriptor was to improve the inpainting and to make it accessible for high-resolution images. In the first part of this section, we asses our descriptor’s robustness to missing areas in the patch, a property important for inpainting application. In the second part, we describe the integration of our descriptor with the inpainting algorithm. 4.1 Robustness to missing areas in the patch Robustness to missing regions is a desirable property of descriptors used for applications such as inpainting and scene reconstruction from multiview data. We compare our descriptor with the descriptor from [5] and the exhaustive search on pixel intensity values. We trained all the descriptor networks on the same set of colour patches. The setup of the experiment is as follows. We select a set of query patches, which we edit in order to introduce missing areas. For each query patch, we retrieve the k most similar patches either by comparing their descriptors or by using exhaustive search over the pixel values. The quality of the patch retrieval (and thus the robustness to missing data) is evaluated based on the sum of squared differences (SSD) between the complete (undamaged) query patches and the retrieved ones. The results are presented in Figure 4, comparing our two descriptor networks, descriptor from [5], and the exhaustive search over pixel intensity values. Visual examples of retrieved patches are shown in Figure 5. When the missing area in a patch is small, the exhaustive search retrieves the patches that are the most similar to the (original) query patch. However, as 8 Nina Žižakić, Izumi Ito, Laurens Meeus, and Aleksandra Pižurica proposed v128 proposed v128 query query Chen et al. Chen et al. exhaustive exhaustive proposed v128 proposed v128 query query Chen et al. Chen et al. exhaustive exhaustive Fig. 5. Patch retrieval where the query has missing parts. For each query, the first row corresponds to the proposed v128 descriptors; the second row: the descriptors from [5], and the third row: exhaustive search. The missing parts of the query patches on the right are shown in cyan. the missing area increases in size, our descriptor v128 begins to outperform the exhaustive search, showing more robustness to missing data. Our descriptors also show superior performance compared to the existing descriptor learned with autoencoders. Our method v128 shows significantly bet- ter performance than the method implemented by Chen et al. [5], while having the same patch descriptor dimensionality. Moreover, our method v32 that shows similar results to [5] has an order of magnitude lower dimensionality of the de- scriptor when encoding a single patch. The dimensionality comparison changes even more in the favour of our method when encoding the whole image due to the usage of the IR (Figure 3). 4.2 Inpainting We put our descriptor to a real-world test in the application of digital painting analysis, more specifically, digital inpainting of paint-loss areas. As a case study, we use images from the panels of Ghent Altarpiece [10]. The paint-loss areas to be inpainted are determined by the detection algorithm from [13]. We have modified the inpainting algorithm from [15] to use patch descriptors instead of their pixel values for comparisons. We fine-tuned the descriptor on the data from the patches extracted from the images of the paintings. The inpainting results are presented in Figure 6, showing the inpainting of one panels, the Prophet Zachary, from the altarpiece. On this particular panel, the paint-loss areas are showing as light brown. Figure 7 shows the zoomed details of the panel. We have also used the inpainting algorithm without the de- scriptors, however, we were not able to obtain the inpainting results on the whole panel without using the descriptor due to the memory error on our computer. Autoencoder-learned local image descriptor for image inpainting 9 Fig. 6. Image inpainting results. Left: original; Right: Inpainted with a patch-based method using the proposed patch descriptors. Image copyright: Ghent, Kathedrale Kerkfabriek, Lukasweb; photo courtesy of KIK-IRPA, Brussels. Moreover, we were not able to test the inpainting with the descriptor from [5] as it would require the retraining of their network to suit the needed patch size. The inpainting results are very promising and show that our descriptor was not only able to visually improve the inpainted images, but also to improve on the computational aspect of the inpainting. 5 Conclusion We propose a novel method for learning patch descriptors using autoencoders, for the use in image inpainting. Our approach saves computational memory and time in comparison to existing methods when used with algorithms such as those for inpainting that require patch search and matching within a single image. The proposed descriptor shows higher robustness to missing data when compared with an existing descriptor learned with autoencoders from [5] and exhaustive search over pixel intensity values. Furthermore, integrating our descriptor into an inpainting algorithm results in visual improvements over the inpainted images and the ability of the inpainting algorithm to handle higher resolution images. 10 Nina Žižakić, Izumi Ito, Laurens Meeus, and Aleksandra Pižurica Fig. 7. Image inpainting results. Left column: original images containing paint-loss (showing as light brown); Middle column: Inpainted with a patch-based method using the proposed patch descriptor; Right column: Inpainted without using the descriptors. Image copyright: Ghent, Kathedrale Kerkfabriek, Lukasweb; photo courtesy of KIK- IRPA, Brussels. References 1. Balntas, V., Lenc, K., Vedaldi, A., Mikolajczyk, K.: Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5173–5182 (2017) 2. Balntas, V., Riba, E., Ponsa, D., Mikolajczyk, K.: Learning local feature de- scriptors with triplets and shallow convolutional neural networks. In: Proceed- ings of the British Machine Vision Conference (BMVC). pp. 119.1–119.11 (2016), https://dx.doi.org/10.5244/C.30.119 3. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (surf). Computer vision and image understanding 110(3), 346–359 (2008) 4. Calonder, M., Lepetit, V., Strecha, C., Fua, P.: Brief: Binary robust independent elementary features. In: European conference on computer vision. pp. 778–792. Springer (2010) 5. Chen, L., Rottensteiner, F., Heipke, C.: Feature descriptor by convolution and pool- ing autoencoders. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences-ISPRS Archives 40 (2015), Nr. 3W2 40(3W2), 31–38 (2015) Autoencoder-learned local image descriptor for image inpainting 11 6. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large- scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009) 7. Fischer, P., Dosovitskiy, A., Brox, T.: Descriptor matching with convolutional neu- ral networks: a comparison to SIFT. arXiv preprint arXiv:1405.5769 (2014) 8. He, K., Lu, Y., Sclaroff, S.: Local descriptors optimized for average precision. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018) 9. Kauderer-Abrams, E.: Quantifying translation-invariance in convolutional neural networks. arXiv preprint arXiv:1801.01450 (2017) 10. KIK/IRPA: Closer to van eyck: The ghent altarpiece. http://closertovaneyck.kikirpa.be/ghentaltarpiece/#home/ (2019) 11. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123(1), 32–73 (2017), https://doi.org/10.1007/s11263-016-0981- 7 12. Lowe, D.G.: Object recognition from local scale-invariant features. In: ICCV. p. 1150. IEEE (1999) 13. Meeus, L., Huang, S., Devolder, B., Dubois, H., Pižurica, A.: Deep learning for paint loss detection with a multiscale, translation invariant network. In: Proceed- ings of the 11th International Symposium on Image and Signal Processing and Analysis (ISPA 2019). p. 5 (2019) 14. Rublee, E., Rabaud, V., Konolige, K., Bradski, G.R.: ORB: An efficient alternative to SIFT or SURF. In: ICCV. pp. 2564–2571. IEEE (2011) 15. Ružić, T., Pižurica, A.: Context-aware patch-based image inpainting using Markov random field modeling. IEEE Transactions on Image Processing 24(1), 444–456 (2015) 16. Schonberger, J.L., Hardmeier, H., Sattler, T., Pollefeys, M.: Comparative eval- uation of hand-crafted and learned local features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1482–1491 (2017) 17. Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., Moreno-Noguer, F.: Discriminative learning of deep convolutional feature point descriptors. In: Pro- ceedings of the IEEE International Conference on Computer Vision. pp. 118–126 (2015) 18. Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplic- ity: The all convolutional net. arXiv preprint arXiv:1412.6806 (2014) 19. Žižakić, N., Ito, I., Pižurica, A.: Learning local image descriptors with autoen- coders. In: Proc. IEICE Inform. and Commun. Technol. Forum ICTF 2019 (2019) 20. Wiedemann, O., Hosu, V., Lin, H., Saupe, D.: Disregarding the big picture: To- wards local image quality assessment. In: 10th International Conference on Quality of Multimedia Experience(QoMEX). IEEE (2018), http://database.mmsp-kn.de 21. Zagoruyko, S., Komodakis, N.: Learning to compare image patches via convolu- tional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4353–4361 (2015) 22. Zeiler, M.D.: ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012)