=Paper=
{{Paper
|id=Vol-2485/paper30
|storemode=property
|title=Perceptually Motivated Method for Image Inpainting Comparison
|pdfUrl=https://ceur-ws.org/Vol-2485/paper30.pdf
|volume=Vol-2485
|authors=Ivan Molodetskikh,Mikhail Erofeev,Dmitriy Vatolin
}}
==Perceptually Motivated Method for Image Inpainting Comparison==
Perceptually Motivated Method for Image Inpainting Comparison I.A. Molodetskikh1 , M.V. Erofeev1 , D.S. Vatolin1 ivan.molodetskikh@graphics.cs.msu.ru|merofeev@graphics.cs.msu.ru|dmitriy@graphics.cs.msu.ru 1 Lomonosov Moscow State University, Moscow, Russia The field of automatic image inpainting has progressed rapidly in recent years, but no one has yet proposed a standard method of evaluating algorithms. This absence is due to the problem’s challenging nature: imageinpainting algorithms strive for realism in the resulting images, but realism is a subjective concept intrinsic to human perception. Existing objective imagequality metrics provide a poor approximation of what humans consider more or less realistic. To improve the situation and to better organize both prior and future research in this field, we conducted a subjective comparison of nine stateoftheart inpainting algorithms and propose objective quality metrics that exhibit high correlation with the results of our comparison. Keywords: image inpainting, objective quality metric, quality perception, subjective evaluation, deep learning. 1. Introduction detection has seen moderate research, including both classi cal and deeplearningbased approaches. This field focuses Image inpainting, or hole filling, is the task of filling on detecting altered image regions, usually involving a set in missing parts of an image. Given an incomplete image of common manipulations: copymove (copying an image and a hole mask, an inpainting algorithm must generate the fragment and pasting it elsewhere in the same image), splic missing parts so that the result looks realistic. Inpainting is ing (pasting a fragment from another image), fragment re a widely researched topic. Many classical algorithms have moval (deleting an image fragment and then performing ei been proposed [5, 26], but over the past few years most re ther a copymove or inpainting to fill in the missing area), search has focused on using deep neural networks to solve various effects such as Gaussian blur, and recompression. this problem [12, 16, 17, 19, 23, 31, 32]. Among these manipulations, the most interesting for this Because of the many avenues of research in this field, work is fragment removal with inpainting. the need to evaluate algorithms emerges. The goal of an The approaches to imagemanipulation detection can inpainting algorithm is to make the final image as realis be divided into classical [13, 20], and deeplearningbased tic as possible, but image realism is a concept intrinsic to approaches [2, 21, 34, 35]. These algorithms aim to locate humans. Therefore, the most accurate way to evaluate an the manipulated image regions by outputting a mask or a set algorithm’s performance is a subjective experiment where of bounding boxes enclosing suspicious regions. Unfortu many participants compare the outcomes of different algo nately, they are not directly applicable to inpaintingquality rithms and choose the one they consider the most realistic. estimation because they have a different goal: whereas an Unfortunately, conducting a subjective experiment in objective qualityestimation metric should strive to accu volves considerable time and resources, so many authors re rately compare realistically inpainted images similar to the sort to evaluating their proposed methods using traditional originals, a forgerydetection algorithm should strive to ac objective imagesimilarity metrics such as PSNR, SSIM curately tell one apart from the other. and mean l2 loss relative to the groundtruth image. This strategy, however, is inadequate. One reason is that eval 3. Inpainting subjective evaluation uation by measuring similarity to the groundtruth image assumes that only a single, best inpainting result exists—a The gold standard for evaluating imageinpainting al false assumption in most cases. gorithms is human perception, since each algorithm strives Thus, a perceptually motivated objective metric for to produce images that look the most realistic to hu inpaintingquality assessment is desirable. The objective mans. Thus, to obtain a baseline for creating an objective metric should approximate the notion of image realism and inpaintingquality metric, we conducted a subjective evalu yield results similar to those of a subjective study when ation of multiple stateoftheart algorithms, including both comparing outputs from different algorithms. classical and deeplearningbased ones. To assess the over We conducted a subjective evaluation of nine stateof all quality and applicability of the current approaches and theart classical and deeplearningbased approaches to im to see how they compare with manual photo editing, we age inpainting. Using the results, we examine different also asked professional photo editors to fill in missing re methods of objective inpaintingquality evaluation, includ gions of the test photos. ing both fullreference methods (taking both the resulting image and the groundtruth image as an input) and no reference methods (taking the resulting image as an input). 3.1 Test data set Since human photo editors were to perform inpainting, 2. Related work our data set could not include publicly available images. Little work has been done on objective image We therefore created our own private set of test images by inpaintingquality evaluation or on inpainting detection in taking photographs of various outdoor scenes, which are general. The somewhat related field of manipulatedimage the most likely target for inpainting. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Fig. 1. Images for the subjective inpainting comparison. The black square in the center is the area to be inpainted. Overall (3 images) Overall (33 images) Ground Truth Ground Truth Artist #2 Artist #3 Generative Inpainting (Places2) Artist #1 ContentAware Fill Generative Inpainting (Places2) ContentAware Fill Generative Inpainting (ImageNet) Statistics of Patch Offsets Statistics of Patch Offsets ExemplarBased (patch size 13) ExemplarBased (patch size 9) ExemplarBased (patch size 13) Generative Inpainting (ImageNet) Partial Convolutions Partial Convolutions HighResolution Inpainting HighResolution Inpainting Globally and Locally Consistent Globally and Locally Consistent ShiftNet ShiftNet Deep Image Prior Deep Image Prior 0 1 2 3 4 5 Subjective Score 0 1 2 3 4 Ground Truth Classical Method Subjective Score Human Artist Deep LearningBased Method Ground Truth Classical Method Deep LearningBased Method Fig. 2. Subjectivecomparison results across three images Fig. 3. Subjectivecomparison results for 33 images inpainted by human artists. inpainted using automatic methods. Each test image was 512 × 512 pixels with a square hole in the middle measuring 180 × 180 pixels. We chose a Artist #1 Statistics of Patch Offsets [7] square instead of a freeform shape because one algorithm in our comparison [30] lacks the ability to fill in freeform holes. The data set comprised 33 images in total. Fig. 1 shows examples. 3.2 Inpainting methods We evaluated three classical [1, 5, 7] and six deep learningbased approaches [10, 16, 27, 29, 30, 32]. Ad ditionally, we hired three professional photorestoration and photoretouching artists to manually inpaint three ran domly selected images from our test data set. Fig. 4. Comparison of inpainting results from Artist #1 and statistics of patch offsets [7] (preferred in the 3.3 Test method subjective comparison). The subjective evaluation took place through the http://subjectify.us platform. Human observers were automatic algorithms, and out of the deeplearningbased shown pairs of images and asked to pick from each pair methods, only generative image inpainting [32] outper the one they found most realistic. Each pair consisted of formed the classical inpainting methods. two different inpainting results for the same picture (the The individual results for each of these three images ap set also contained the original image). In total, 6945 valid pear in Fig. 5. In only one case did an algorithm beat an pairwise judgements were collected from 215 participants. artist: statistics of patch offsets [7] scored higher than one The judgements were then used to fit a BradleyTerry artist on the “Urban Flowers” photo. Fig. 4 shows the model [3]. The resulting subjective scores maximize like respective results. Additionally, for the “Splashing Sea” lihood given the pairwise judgements. photo, two artists actually “outperformed” the original im age: their results turned out to be more realistic. 3.4 Results of the subjective comparison We additionally performed a subjective comparison of various inpainting algorithms among the entire 33image Fig. 2 shows the results for the three images in test set, collecting 3969 valid pairwise judgements across painted by the human artists. The artists outperformed all 147 participants. The overall results appear in Fig. 3. Urban Flowers Splashing Sea Forest Trail Ground Truth Artist #3 Ground Truth Artist #3 Artist #2 Artist #2 Artist #2 Ground Truth Artist #1 Statistics of Patch Offsets Artist #1 Artist #3 Artist #1 ExemplarBased (patch size 9) Partial Convolutions ContentAware Fill ExemplarBased (patch size 13) Generative Inpainting (Places2) Generative Inpainting (Places2) Generative Inpainting (Places2) ContentAware Fill ExemplarBased (patch size 13) Generative Inpainting (ImageNet) ExemplarBased (patch size 9) Generative Inpainting (ImageNet) ContentAware Fill Globally and Locally Consistent HighResolution Inpainting Statistics of Patch Offsets ExemplarBased (patch size 13) ExemplarBased (patch size 9) HighResolution Inpainting HighResolution Inpainting Partial Convolutions Partial Convolutions Generative Inpainting (ImageNet) Globally and Locally Consistent Deep Image Prior ShiftNet ShiftNet Globally and Locally Consistent Statistics of Patch Offsets Deep Image Prior ShiftNet Deep Image Prior 0 1 2 3 4 5 6 0 1 2 3 4 5 0 1 2 3 4 5 6 Subjective Score Subjective Score Subjective Score Ground Truth Human Artist Classical Method Deep LearningBased Method Fig. 5. Results of the subjective study comparing images inpainted by human artists with images inpainted by conventional and deeplearningbased methods. They confirm our observations from the first comparison: layer deep), ResNetV150 [8], ResNetV250 [9], Incep among the deeplearningbased approaches we evaluated, tionV3 [25], InceptionV4 [24] and PNASNetLarge [15]. generative image inpainting [32] seems to be the only one For training, we used clean and inpainted images based that can outperform the classical methods. on the COCO [14] data set. To create the inpainted images, we used five inpainting algorithms [5, 7, 10, 29, 32] in eight 4. Objective inpaintingquality estimation total configurations. Using the results we obtained from the subjective com The network architectures take a square image as an in parison, we evaluated several approaches to objective put and output the score—a single number where 0 means inpaintingquality estimation. In particular, we used these the image contains inpainted regions and 1 means the im objective metrics to estimate the inpainting quality of the age is “clean.” The loss function was mean squared error. images from our test set and then compared them with the Some network architectures were additionally trained to subjective results. For each of the 33 images, we applied output the predicted class using onehot encoding (similar every tested metric to every inpainting result (as well as to binary classification); the loss function for this case was to the groundtruth image) and computed the Pearson and softmax crossentropy. Spearman correlation coefficients with the subjective re The network architectures were identical to the ones sult. The final value was an average of the correlations used for image classification, with one difference: we al over all 33 test images. tered the number of outputs from the last fully connected layer. This change allowed us to initialize the weights of all previous layers from the models pretrained on ImageNet, 4.1 Fullreference metrics greatly improving the results compared with training from To construct a fullreference metric that encourages se random initialization. mantic similarity rather than perpixel similarity, as in [11], For some experiments we tried using the RGB noise we evaluated metrics that compute the difference between features [34] and the spectral weight normalization [18]. the groundtruth and inpaintedimage feature maps pro In addition to the typical validation on part of the data duced by an imageclassification neural network. We se set, we also monitored correlation of network predictions lected five of the most popular architectures: VGG [22] with the subjective scores collected in Section 3. We used (16 and 19layer deep variants), ResNetV150 [8], Incep the networks to estimate the inpainting quality of the 33 tionV3 [25], InceptionResNetV2 [24] and Xception [4]. image test set, then computed correlations with subjective We used the models pretrained on the ImageNet [6] data results in the same way as the final comparison. The train set. The mean squared error between the feature maps was ing of each network was stopped once the correlation of the the metric result. network predictions with the subjective scores peaked and We additionally included the structuralsimilarity started to decrease (possibly because the networks were (SSIM) index [28] as a fullreference metric. SSIM is overfitting to the inpainting results of the algorithms we widely used to compare image quality, but it falls short used to create the training data set). when applied to inpaintingquality estimation. 4.3 Results 4.2 Noreference metrics Fig. 6 shows the overall results. The noreference We picked several popular imageclassification neural methods achieve slightly weaker correlation with the network architectures and trained them to differentiate im subjectiveevaluation responses than do the best full ages without any inpainting from partially inpainted im reference methods. But the results of most noreference ages. The architectures included VGG [22] (16 and 19 methods are still considerably better than those of the Pearson Spearman VGG16 (block5_conv3) VGG16 (block5_pool) ResNetV150 (res5c_branch2c) InceptionResNetV2 (conv_7b) VGG19 (block5_conv4) Xception (block14_sepconv2_act) VGG16 (block5_conv2) VGG19 (block5_conv4) VGG19 (block5_pool) ResNetV150 (res5c_branch2c) Xception (block14_sepconv2_act) VGG16 (block5_conv3) InceptionV3 (mixed10) VGG19 (block5_pool) VGG16 (block5_pool) VGG16 (block5_conv2) Xception (block14_sepconv2) InceptionResNetV2 (conv_7b_ac) InceptionResNetV2 (conv_7b) Xception (block14_sepconv2) InceptionResNetV2 InceptionV3 (mixed10) InceptionV4 (spec. norm.) InceptionV4 (spec. norm.) VGG16 (block5_conv1) InceptionV3 (RGB noise) InceptionResNetV2 (conv_7b_ac) InceptionResNetV2 InceptionV3 (spec. norm.) InceptionResNetV2 (twoclass) InceptionV3 InceptionV3 (twoclass) InceptionV3 (RGB noise) VGG16 (block5_conv1) InceptionResNetV2 (twoclass) ResNetV150 (RGB noise) ResNetV150 (RGB noise) InceptionV4 (twoclass) InceptionV3 (twoclass) InceptionV3 (spec. norm.) SSIM VGG16 (spec. norm.) VGG16 (spec. norm.) InceptionV3 ResNetV150 (spec. norm.) InceptionV4 ResNetV150 ResNetV250 ResNetV250 ResNetV150 InceptionV4 (twoclass) ResNetV150 (spec. norm.) VGG16 VGG16 InceptionResNetV2 (spec. norm.) ResNetV250 (RGB noise) InceptionV4 InceptionResNetV2 (spec. norm.) ResNetV250 (RGB noise) SSIM PNASNETLarge PNASNETLarge 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 FullReference NoReference Fig. 6. Mean Pearson and Spearman correlations between objective inpaintingquality metrics and subjective human comparisons. The error bars show the standard deviations. fullreference SSIM. The best correlation among the no 7. References reference methods came from the InceptionV4 model with spectral weight normalization. [1] https://research.adobe.com/project/contentaware It is important to emphasize that we did not train the fill/. networks to maximize correlation with human responses. [2] J. H. Bappy, A. K. RoyChowdhury, J. Bunk, We trained them to distinguish “clean” images from in L. Nataraj, and B. S. Manjunath. Exploiting spatial painted images, yet their output showed good correlation structure for localizing manipulated image regions. In with human responses. This confirms the observations The IEEE International Conference on Computer Vi made in [33] that deep features are good for modelling hu sion (ICCV), Oct 2017. man perception. [3] R. A. Bradley and M. E. Terry. Rank analysis of in complete block designs: I. the method of paired com 5. Conclusion parisons. Biometrika, 39(3/4):324–345, 1952. [4] F. Chollet. Xception: Deep learning with depthwise We have proposed a number of perceptually moti separable convolutions. In The IEEE Conference on vated noreference and fullreference objective metrics for Computer Vision and Pattern Recognition (CVPR), imageinpainting quality. We evaluated the metrics by cor July 2017. relating them with human responses from a subjective com [5] A. Criminisi, P. Pérez, and K. Toyama. Region fill parison of stateoftheart imageinpainting algorithms. ing and object removal by exemplarbased image in The results of the subjective comparison indicate that painting. IEEE Transactions on Image Processing, although a deeplearningbased approach to image inpaint 13(9):1200–1212, 2004. ing holds the lead, classical algorithms remain among the [6] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and best in the field. L. FeiFei. Imagenet: A largescale hierarchical im We achieved good correlation with the subjective age database. In 2009 IEEE conference on computer comparison results without specifically training our vision and pattern recognition, pages 248–255, 2009. proposed objective qualityevaluation metrics on the [7] K. He and J. Sun. Statistics of patch offsets for image subjectivecomparison response data set. completion. In European Conference on Computer Vision, pages 16–29. Springer, 2012. 6. Acknowledgement [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep resid ual learning for image recognition. In The IEEE Con This work was partially supported by Russian Founda ference on Computer Vision and Pattern Recognition tion for Basic Research under Grant 190100785 a. (CVPR), June 2016. [9] K. He, X. Zhang, S. Ren, and J. Sun. Identity map AAAI Conference on Artificial Intelligence, 2017. pings in deep residual networks. In European confer [25] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and ence on computer vision, pages 630–645, 2016. Z. Wojna. Rethinking the inception architecture for [10] S. Iizuka, E. SimoSerra, and H. Ishikawa. Globally computer vision. In The IEEE Conference on Com and locally consistent image completion. ACM Trans puter Vision and Pattern Recognition (CVPR), June actions on Graphics (ToG), 36(4):107, 2017. 2016. [11] J. Johnson, A. Alahi, and L. FeiFei. Perceptual losses [26] A. Telea. An image inpainting technique based on for realtime style transfer and superresolution. In the fast marching method. Journal of Graphics Tools, European conference on computer vision, pages 694– 9(1):23–34, 2004. 711. Springer, 2016. [27] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Deep [12] H. Li, G. Li, L. Lin, H. Yu, and Y. Yu. Contextaware image prior. In The IEEE Conference on Computer semantic inpainting. IEEE Transactions on Cybernet Vision and Pattern Recognition (CVPR), June 2018. ics, 2018. [28] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, [13] H. Li, W. Luo, X. Qiu, and J. Huang. Image et al. Image quality assessment: from error visibility forgery localization via integrating tampering pos to structural similarity. IEEE transactions on image sibility maps. IEEE Transactions on Information processing, 13(4):600–612, 2004. Forensics and Security, 12(5):1240–1252, 2017. [29] Z. Yan, X. Li, M. Li, W. Zuo, and S. Shan. Shift [14] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, net: Image inpainting via deep feature rearrangement. D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft In The European Conference on Computer Vision coco: Common objects in context. In European con (ECCV), September 2018. ference on computer vision, pages 740–755, 2014. [30] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and [15] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, H. Li. Highresolution image inpainting using multi L.J. Li, L. FeiFei, A. Yuille, J. Huang, and K. Mur scale neural patch synthesis. In The IEEE Conference phy. Progressive neural architecture search. In The on Computer Vision and Pattern Recognition (CVPR), European Conference on Computer Vision (ECCV), July 2017. September 2018. [31] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. [16] G. Liu, F. A. Reda, K. J. Shih, T.C. Wang, A. Tao, Huang. Freeform image inpainting with gated con and B. Catanzaro. Image inpainting for irregular volution. arXiv preprint arXiv:1806.03589, 2018. holes using partial convolutions. In The European [32] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Conference on Computer Vision (ECCV), 2018. Huang. Generative image inpainting with contextual [17] P. Liu, X. Qi, P. He, Y. Li, M. R. Lyu, and I. King. attention. In The IEEE Conference on Computer Vi Semantically consistent image completion with fine sion and Pattern Recognition (CVPR), June 2018. grained details. arXiv preprint arXiv:1711.09345, [33] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and 2017. O. Wang. The unreasonable effectiveness of deep fea [18] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. tures as a perceptual metric. In The IEEE Conference Spectral normalization for generative adversarial net on Computer Vision and Pattern Recognition (CVPR), works. arXiv preprint arXiv:1802.05957, 2018. June 2018. [19] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and [34] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis. A. A. Efros. Context encoders: Feature learning by Learning rich features for image manipulation detec inpainting. In The IEEE Conference on Computer Vi tion. In The IEEE Conference on Computer Vision sion and Pattern Recognition (CVPR), June 2016. and Pattern Recognition (CVPR), June 2018. [20] C.M. Pun, X.C. Yuan, and X.L. Bi. Image forgery [35] X. Zhu, Y. Qian, X. Zhao, B. Sun, and Y. Sun. A detection using adaptive oversegmentation and fea deep learning approach to patchbased image inpaint ture point matching. IEEE Transactions on Informa ing forensics. Signal Processing: Image Communica tion Forensics and Security, 10(8):1705–1716, 2015. tion, 67:90–99, 2018. [21] R. Salloum, Y. Ren, and C.C. J. Kuo. Image splicing localization using a multitask fully convolutional net work (mfcn). Journal of Visual Communication and Image Representation, 51:201–209, 2018. [22] K. Simonyan and A. Zisserman. Very deep convo lutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014. [23] Y. Song, C. Yang, Z. Lin, H. Li, Q. Huang, and C. J. Kuo. Image inpainting using multiscale feature im age translation. arXiv preprint arXiv:1711.08590, 2, 2017. [24] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inceptionv4, inceptionresnet and the impact of residual connections on learning. In ThirtyFirst