Perceptually Motivated Method for Image Inpainting Comparison I.A. Molodetskikh1 , M.V. Erofeev1 , D.S. Vatolin1 ivan.molodetskikh@graphics.cs.msu.ru|merofeev@graphics.cs.msu.ru|dmitriy@graphics.cs.msu.ru 1 Lomonosov Moscow State University, Moscow, Russia The field of automatic image inpainting has progressed rapidly in recent years, but no one has yet proposed a standard method of evaluating algorithms. This absence is due to the problem’s challenging nature: image­inpainting algorithms strive for realism in the resulting images, but realism is a subjective concept intrinsic to human perception. Existing objective image­quality metrics provide a poor approximation of what humans consider more or less realistic. To improve the situation and to better organize both prior and future research in this field, we conducted a subjective comparison of nine state­of­the­art inpainting algorithms and propose objective quality metrics that exhibit high correlation with the results of our comparison. Keywords: image inpainting, objective quality metric, quality perception, subjective evaluation, deep learning. 1. Introduction detection has seen moderate research, including both classi­ cal and deep­learning­based approaches. This field focuses Image inpainting, or hole filling, is the task of filling on detecting altered image regions, usually involving a set in missing parts of an image. Given an incomplete image of common manipulations: copy­move (copying an image and a hole mask, an inpainting algorithm must generate the fragment and pasting it elsewhere in the same image), splic­ missing parts so that the result looks realistic. Inpainting is ing (pasting a fragment from another image), fragment re­ a widely researched topic. Many classical algorithms have moval (deleting an image fragment and then performing ei­ been proposed [5, 26], but over the past few years most re­ ther a copy­move or inpainting to fill in the missing area), search has focused on using deep neural networks to solve various effects such as Gaussian blur, and recompression. this problem [12, 16, 17, 19, 23, 31, 32]. Among these manipulations, the most interesting for this Because of the many avenues of research in this field, work is fragment removal with inpainting. the need to evaluate algorithms emerges. The goal of an The approaches to image­manipulation detection can inpainting algorithm is to make the final image as realis­ be divided into classical [13, 20], and deep­learning­based tic as possible, but image realism is a concept intrinsic to approaches [2, 21, 34, 35]. These algorithms aim to locate humans. Therefore, the most accurate way to evaluate an the manipulated image regions by outputting a mask or a set algorithm’s performance is a subjective experiment where of bounding boxes enclosing suspicious regions. Unfortu­ many participants compare the outcomes of different algo­ nately, they are not directly applicable to inpainting­quality rithms and choose the one they consider the most realistic. estimation because they have a different goal: whereas an Unfortunately, conducting a subjective experiment in­ objective quality­estimation metric should strive to accu­ volves considerable time and resources, so many authors re­ rately compare realistically inpainted images similar to the sort to evaluating their proposed methods using traditional originals, a forgery­detection algorithm should strive to ac­ objective image­similarity metrics such as PSNR, SSIM curately tell one apart from the other. and mean l2 loss relative to the ground­truth image. This strategy, however, is inadequate. One reason is that eval­ 3. Inpainting subjective evaluation uation by measuring similarity to the ground­truth image assumes that only a single, best inpainting result exists—a The gold standard for evaluating image­inpainting al­ false assumption in most cases. gorithms is human perception, since each algorithm strives Thus, a perceptually motivated objective metric for to produce images that look the most realistic to hu­ inpainting­quality assessment is desirable. The objective mans. Thus, to obtain a baseline for creating an objective metric should approximate the notion of image realism and inpainting­quality metric, we conducted a subjective evalu­ yield results similar to those of a subjective study when ation of multiple state­of­the­art algorithms, including both comparing outputs from different algorithms. classical and deep­learning­based ones. To assess the over­ We conducted a subjective evaluation of nine state­of­ all quality and applicability of the current approaches and the­art classical and deep­learning­based approaches to im­ to see how they compare with manual photo editing, we age inpainting. Using the results, we examine different also asked professional photo editors to fill in missing re­ methods of objective inpainting­quality evaluation, includ­ gions of the test photos. ing both full­reference methods (taking both the resulting image and the ground­truth image as an input) and no­ reference methods (taking the resulting image as an input). 3.1 Test data set Since human photo editors were to perform inpainting, 2. Related work our data set could not include publicly available images. Little work has been done on objective image We therefore created our own private set of test images by inpainting­quality evaluation or on inpainting detection in taking photographs of various outdoor scenes, which are general. The somewhat related field of manipulated­image the most likely target for inpainting. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Fig. 1. Images for the subjective inpainting comparison. The black square in the center is the area to be inpainted. Overall (3 images) Overall (33 images) Ground Truth Ground Truth Artist #2 Artist #3 Generative Inpainting (Places2) Artist #1 Content­Aware Fill Generative Inpainting (Places2) Content­Aware Fill Generative Inpainting (ImageNet) Statistics of Patch Offsets Statistics of Patch Offsets Exemplar­Based (patch size 13) Exemplar­Based (patch size 9) Exemplar­Based (patch size 13) Generative Inpainting (ImageNet) Partial Convolutions Partial Convolutions High­Resolution Inpainting High­Resolution Inpainting Globally and Locally Consistent Globally and Locally Consistent Shift­Net Shift­Net Deep Image Prior Deep Image Prior 0 1 2 3 4 5 Subjective Score 0 1 2 3 4 Ground Truth Classical Method Subjective Score Human Artist Deep Learning­Based Method Ground Truth Classical Method Deep Learning­Based Method Fig. 2. Subjective­comparison results across three images Fig. 3. Subjective­comparison results for 33 images inpainted by human artists. inpainted using automatic methods. Each test image was 512 × 512 pixels with a square hole in the middle measuring 180 × 180 pixels. We chose a Artist #1 Statistics of Patch Offsets [7] square instead of a free­form shape because one algorithm in our comparison [30] lacks the ability to fill in free­form holes. The data set comprised 33 images in total. Fig. 1 shows examples. 3.2 Inpainting methods We evaluated three classical [1, 5, 7] and six deep­ learning­based approaches [10, 16, 27, 29, 30, 32]. Ad­ ditionally, we hired three professional photo­restoration and photo­retouching artists to manually inpaint three ran­ domly selected images from our test data set. Fig. 4. Comparison of inpainting results from Artist #1 and statistics of patch offsets [7] (preferred in the 3.3 Test method subjective comparison). The subjective evaluation took place through the http://subjectify.us platform. Human observers were automatic algorithms, and out of the deep­learning­based shown pairs of images and asked to pick from each pair methods, only generative image inpainting [32] outper­ the one they found most realistic. Each pair consisted of formed the classical inpainting methods. two different inpainting results for the same picture (the The individual results for each of these three images ap­ set also contained the original image). In total, 6945 valid pear in Fig. 5. In only one case did an algorithm beat an pairwise judgements were collected from 215 participants. artist: statistics of patch offsets [7] scored higher than one The judgements were then used to fit a Bradley­Terry artist on the “Urban Flowers” photo. Fig. 4 shows the model [3]. The resulting subjective scores maximize like­ respective results. Additionally, for the “Splashing Sea” lihood given the pairwise judgements. photo, two artists actually “outperformed” the original im­ age: their results turned out to be more realistic. 3.4 Results of the subjective comparison We additionally performed a subjective comparison of various inpainting algorithms among the entire 33­image Fig. 2 shows the results for the three images in­ test set, collecting 3969 valid pairwise judgements across painted by the human artists. The artists outperformed all 147 participants. The overall results appear in Fig. 3. Urban Flowers Splashing Sea Forest Trail Ground Truth Artist #3 Ground Truth Artist #3 Artist #2 Artist #2 Artist #2 Ground Truth Artist #1 Statistics of Patch Offsets Artist #1 Artist #3 Artist #1 Exemplar­Based (patch size 9) Partial Convolutions Content­Aware Fill Exemplar­Based (patch size 13) Generative Inpainting (Places2) Generative Inpainting (Places2) Generative Inpainting (Places2) Content­Aware Fill Exemplar­Based (patch size 13) Generative Inpainting (ImageNet) Exemplar­Based (patch size 9) Generative Inpainting (ImageNet) Content­Aware Fill Globally and Locally Consistent High­Resolution Inpainting Statistics of Patch Offsets Exemplar­Based (patch size 13) Exemplar­Based (patch size 9) High­Resolution Inpainting High­Resolution Inpainting Partial Convolutions Partial Convolutions Generative Inpainting (ImageNet) Globally and Locally Consistent Deep Image Prior Shift­Net Shift­Net Globally and Locally Consistent Statistics of Patch Offsets Deep Image Prior Shift­Net Deep Image Prior 0 1 2 3 4 5 6 0 1 2 3 4 5 0 1 2 3 4 5 6 Subjective Score Subjective Score Subjective Score Ground Truth Human Artist Classical Method Deep Learning­Based Method Fig. 5. Results of the subjective study comparing images inpainted by human artists with images inpainted by conventional and deep­learning­based methods. They confirm our observations from the first comparison: layer deep), ResNet­V1­50 [8], ResNet­V2­50 [9], Incep­ among the deep­learning­based approaches we evaluated, tion­V3 [25], Inception­V4 [24] and PNASNet­Large [15]. generative image inpainting [32] seems to be the only one For training, we used clean and inpainted images based that can outperform the classical methods. on the COCO [14] data set. To create the inpainted images, we used five inpainting algorithms [5, 7, 10, 29, 32] in eight 4. Objective inpainting­quality estimation total configurations. Using the results we obtained from the subjective com­ The network architectures take a square image as an in­ parison, we evaluated several approaches to objective put and output the score—a single number where 0 means inpainting­quality estimation. In particular, we used these the image contains inpainted regions and 1 means the im­ objective metrics to estimate the inpainting quality of the age is “clean.” The loss function was mean squared error. images from our test set and then compared them with the Some network architectures were additionally trained to subjective results. For each of the 33 images, we applied output the predicted class using one­hot encoding (similar every tested metric to every inpainting result (as well as to binary classification); the loss function for this case was to the ground­truth image) and computed the Pearson and softmax cross­entropy. Spearman correlation coefficients with the subjective re­ The network architectures were identical to the ones sult. The final value was an average of the correlations used for image classification, with one difference: we al­ over all 33 test images. tered the number of outputs from the last fully connected layer. This change allowed us to initialize the weights of all previous layers from the models pretrained on ImageNet, 4.1 Full­reference metrics greatly improving the results compared with training from To construct a full­reference metric that encourages se­ random initialization. mantic similarity rather than per­pixel similarity, as in [11], For some experiments we tried using the RGB noise we evaluated metrics that compute the difference between features [34] and the spectral weight normalization [18]. the ground­truth and inpainted­image feature maps pro­ In addition to the typical validation on part of the data duced by an image­classification neural network. We se­ set, we also monitored correlation of network predictions lected five of the most popular architectures: VGG [22] with the subjective scores collected in Section 3. We used (16­ and 19­layer deep variants), ResNet­V1­50 [8], Incep­ the networks to estimate the inpainting quality of the 33­ tion­V3 [25], Inception­ResNet­V2 [24] and Xception [4]. image test set, then computed correlations with subjective We used the models pretrained on the ImageNet [6] data results in the same way as the final comparison. The train­ set. The mean squared error between the feature maps was ing of each network was stopped once the correlation of the the metric result. network predictions with the subjective scores peaked and We additionally included the structural­similarity started to decrease (possibly because the networks were (SSIM) index [28] as a full­reference metric. SSIM is overfitting to the inpainting results of the algorithms we widely used to compare image quality, but it falls short used to create the training data set). when applied to inpainting­quality estimation. 4.3 Results 4.2 No­reference metrics Fig. 6 shows the overall results. The no­reference We picked several popular image­classification neural­ methods achieve slightly weaker correlation with the network architectures and trained them to differentiate im­ subjective­evaluation responses than do the best full­ ages without any inpainting from partially inpainted im­ reference methods. But the results of most no­reference ages. The architectures included VGG [22] (16­ and 19­ methods are still considerably better than those of the Pearson Spearman VGG­16 (block5_conv3) VGG­16 (block5_pool) ResNet­V1­50 (res5c_branch2c) Inception­ResNet­V2 (conv_7b) VGG­19 (block5_conv4) Xception (block14_sepconv2_act) VGG­16 (block5_conv2) VGG­19 (block5_conv4) VGG­19 (block5_pool) ResNet­V1­50 (res5c_branch2c) Xception (block14_sepconv2_act) VGG­16 (block5_conv3) Inception­V3 (mixed10) VGG­19 (block5_pool) VGG­16 (block5_pool) VGG­16 (block5_conv2) Xception (block14_sepconv2) Inception­ResNet­V2 (conv_7b_ac) Inception­ResNet­V2 (conv_7b) Xception (block14_sepconv2) Inception­ResNet­V2 Inception­V3 (mixed10) Inception­V4 (spec. norm.) Inception­V4 (spec. norm.) VGG­16 (block5_conv1) Inception­V3 (RGB noise) Inception­ResNet­V2 (conv_7b_ac) Inception­ResNet­V2 Inception­V3 (spec. norm.) Inception­ResNet­V2 (two­class) Inception­V3 Inception­V3 (two­class) Inception­V3 (RGB noise) VGG­16 (block5_conv1) Inception­ResNet­V2 (two­class) ResNet­V1­50 (RGB noise) ResNet­V1­50 (RGB noise) Inception­V4 (two­class) Inception­V3 (two­class) Inception­V3 (spec. norm.) SSIM VGG­16 (spec. norm.) VGG­16 (spec. norm.) Inception­V3 ResNet­V1­50 (spec. norm.) Inception­V4 ResNet­V1­50 ResNet­V2­50 ResNet­V2­50 ResNet­V1­50 Inception­V4 (two­class) ResNet­V1­50 (spec. norm.) VGG­16 VGG­16 Inception­ResNet­V2 (spec. norm.) ResNet­V2­50 (RGB noise) Inception­V4 Inception­ResNet­V2 (spec. norm.) ResNet­V2­50 (RGB noise) SSIM PNASNET­Large PNASNET­Large 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Full­Reference No­Reference Fig. 6. Mean Pearson and Spearman correlations between objective inpainting­quality metrics and subjective human comparisons. The error bars show the standard deviations. full­reference SSIM. The best correlation among the no­ 7. References reference methods came from the Inception­V4 model with spectral weight normalization. [1] https://research.adobe.com/project/content­aware­ It is important to emphasize that we did not train the fill/. networks to maximize correlation with human responses. [2] J. H. Bappy, A. K. Roy­Chowdhury, J. Bunk, We trained them to distinguish “clean” images from in­ L. Nataraj, and B. S. Manjunath. Exploiting spatial painted images, yet their output showed good correlation structure for localizing manipulated image regions. In with human responses. This confirms the observations The IEEE International Conference on Computer Vi­ made in [33] that deep features are good for modelling hu­ sion (ICCV), Oct 2017. man perception. [3] R. A. Bradley and M. E. Terry. Rank analysis of in­ complete block designs: I. the method of paired com­ 5. Conclusion parisons. Biometrika, 39(3/4):324–345, 1952. [4] F. Chollet. Xception: Deep learning with depthwise We have proposed a number of perceptually moti­ separable convolutions. In The IEEE Conference on vated no­reference and full­reference objective metrics for Computer Vision and Pattern Recognition (CVPR), image­inpainting quality. We evaluated the metrics by cor­ July 2017. relating them with human responses from a subjective com­ [5] A. Criminisi, P. Pérez, and K. Toyama. Region fill­ parison of state­of­the­art image­inpainting algorithms. ing and object removal by exemplar­based image in­ The results of the subjective comparison indicate that painting. IEEE Transactions on Image Processing, although a deep­learning­based approach to image inpaint­ 13(9):1200–1212, 2004. ing holds the lead, classical algorithms remain among the [6] J. Deng, W. Dong, R. Socher, L.­J. Li, K. Li, and best in the field. L. Fei­Fei. Imagenet: A large­scale hierarchical im­ We achieved good correlation with the subjective­ age database. In 2009 IEEE conference on computer comparison results without specifically training our vision and pattern recognition, pages 248–255, 2009. proposed objective quality­evaluation metrics on the [7] K. He and J. Sun. Statistics of patch offsets for image subjective­comparison response data set. completion. In European Conference on Computer Vision, pages 16–29. Springer, 2012. 6. Acknowledgement [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep resid­ ual learning for image recognition. In The IEEE Con­ This work was partially supported by Russian Founda­ ference on Computer Vision and Pattern Recognition tion for Basic Research under Grant 190100785 a. (CVPR), June 2016. [9] K. He, X. Zhang, S. Ren, and J. Sun. Identity map­ AAAI Conference on Artificial Intelligence, 2017. pings in deep residual networks. In European confer­ [25] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and ence on computer vision, pages 630–645, 2016. Z. Wojna. Rethinking the inception architecture for [10] S. Iizuka, E. Simo­Serra, and H. Ishikawa. Globally computer vision. In The IEEE Conference on Com­ and locally consistent image completion. ACM Trans­ puter Vision and Pattern Recognition (CVPR), June actions on Graphics (ToG), 36(4):107, 2017. 2016. [11] J. Johnson, A. Alahi, and L. Fei­Fei. Perceptual losses [26] A. Telea. An image inpainting technique based on for real­time style transfer and super­resolution. In the fast marching method. Journal of Graphics Tools, European conference on computer vision, pages 694– 9(1):23–34, 2004. 711. Springer, 2016. [27] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Deep [12] H. Li, G. Li, L. Lin, H. Yu, and Y. Yu. Context­aware image prior. In The IEEE Conference on Computer semantic inpainting. IEEE Transactions on Cybernet­ Vision and Pattern Recognition (CVPR), June 2018. ics, 2018. [28] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, [13] H. Li, W. Luo, X. Qiu, and J. Huang. Image et al. Image quality assessment: from error visibility forgery localization via integrating tampering pos­ to structural similarity. IEEE transactions on image sibility maps. IEEE Transactions on Information processing, 13(4):600–612, 2004. Forensics and Security, 12(5):1240–1252, 2017. [29] Z. Yan, X. Li, M. Li, W. Zuo, and S. Shan. Shift­ [14] T.­Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, net: Image inpainting via deep feature rearrangement. D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft In The European Conference on Computer Vision coco: Common objects in context. In European con­ (ECCV), September 2018. ference on computer vision, pages 740–755, 2014. [30] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and [15] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, H. Li. High­resolution image inpainting using multi­ L.­J. Li, L. Fei­Fei, A. Yuille, J. Huang, and K. Mur­ scale neural patch synthesis. In The IEEE Conference phy. Progressive neural architecture search. In The on Computer Vision and Pattern Recognition (CVPR), European Conference on Computer Vision (ECCV), July 2017. September 2018. [31] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. [16] G. Liu, F. A. Reda, K. J. Shih, T.­C. Wang, A. Tao, Huang. Free­form image inpainting with gated con­ and B. Catanzaro. Image inpainting for irregular volution. arXiv preprint arXiv:1806.03589, 2018. holes using partial convolutions. In The European [32] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Conference on Computer Vision (ECCV), 2018. Huang. Generative image inpainting with contextual [17] P. Liu, X. Qi, P. He, Y. Li, M. R. Lyu, and I. King. attention. In The IEEE Conference on Computer Vi­ Semantically consistent image completion with fine­ sion and Pattern Recognition (CVPR), June 2018. grained details. arXiv preprint arXiv:1711.09345, [33] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and 2017. O. Wang. The unreasonable effectiveness of deep fea­ [18] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. tures as a perceptual metric. In The IEEE Conference Spectral normalization for generative adversarial net­ on Computer Vision and Pattern Recognition (CVPR), works. arXiv preprint arXiv:1802.05957, 2018. June 2018. [19] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and [34] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis. A. A. Efros. Context encoders: Feature learning by Learning rich features for image manipulation detec­ inpainting. In The IEEE Conference on Computer Vi­ tion. In The IEEE Conference on Computer Vision sion and Pattern Recognition (CVPR), June 2016. and Pattern Recognition (CVPR), June 2018. [20] C.­M. Pun, X.­C. Yuan, and X.­L. Bi. Image forgery [35] X. Zhu, Y. Qian, X. Zhao, B. Sun, and Y. Sun. A detection using adaptive oversegmentation and fea­ deep learning approach to patch­based image inpaint­ ture point matching. IEEE Transactions on Informa­ ing forensics. Signal Processing: Image Communica­ tion Forensics and Security, 10(8):1705–1716, 2015. tion, 67:90–99, 2018. [21] R. Salloum, Y. Ren, and C.­C. J. Kuo. Image splicing localization using a multi­task fully convolutional net­ work (mfcn). Journal of Visual Communication and Image Representation, 51:201–209, 2018. [22] K. Simonyan and A. Zisserman. Very deep convo­ lutional networks for large­scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [23] Y. Song, C. Yang, Z. Lin, H. Li, Q. Huang, and C. J. Kuo. Image inpainting using multi­scale feature im­ age translation. arXiv preprint arXiv:1711.08590, 2, 2017. [24] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception­v4, inception­resnet and the impact of residual connections on learning. In Thirty­First