Image Foreground Extraction and Its Application to Neural Style Transfer Victor Kitov 1[0000-0002-3198-5792] and Lubov Ponomareva 2 1 Plekhanov Russian University of Economics, 36 Stremyanny lane, Moscow, 115998, Russia v.v.kitov@yandex.ru 2 Lomonosov Moscow State University, Leninskie gory, 1, GSP-1, Moscow, 119991, Russia lponomareva98@yandex.ru Abstract. Foreground extraction plays important role in different computer vi- sion applications: photo enhancement, image classification and understanding, style transfer improvement and others. New images dataset with annotation into foreground/background is proposed. Several recent neural segmentation models are trained on this dataset to extract foreground automatically and their perfor- mance is compared. The benefits of automatic foreground extraction are demonstrated on style transfer task - a popular technique for automatic render- ing of photo (or content image) in the style defined by the style image, for ex- ample – the painting of a famous artist. Keywords: foreground extraction, background removal, image segmentation, image generation, style transfer. 1 Introduction Foreground extraction plays important role in computer vision applications, such as photo editing, photo enhancement, image classification, image and video under- standing, surveillance systems, style transfer. Common method to extract foreground uses GraphCut algorithm [1] but requires human interaction to select part of fore- ground area and limit the foreground into a bounding box. Some articles, such as [2] propose an automatic foreground extraction algorithms, which try to automate human interaction in GraphCut by automatic extraction of salient regions on the image. But such approaches require large training sets to work accurately. We propose a new image dataset with labeled foreground and background objects, which can be used to train and finetune automatic foreground extraction models. We propose to use seg- mentation algorithms for this purpose. A segmentation algorithm takes image as input and produces image of the same shape where each pixel is assigned to particular ob- ject class. We propose to use binary classification with two classes – foreground and background. Performance of two recent segmentation models is compared. Finally we demonstrate the benefit of automatic foreground extraction in style transfer application. Image style transfer is a popular task of rendering input photo (called content image) in arbitrary style, represented by style image, as shown on Proceedings of the 10th International Scientific and Practical Conference named after A. I. Kitov "Information Technologies and Mathematical Methods in Economics and Management (IT&MM-2020)", October 15-16, 2020, Moscow, Russia © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) fig.1. It may be applied in making creative and memorable advertisements, to improve design of sites, groups and community pages in social networks, interiors, etc. It may be used to apply effects in movies, cartoons, music clips and virtual reality systems such as computer games. Various online services provide this service, such as alterdraw.com, depart.io, ostagram.me, as well as desktop applications, such as Deep Art Effects and mobile applications, such as prisma, artiso, vinci to mention just a few. Adobe Photoshop has announced inclusion of this technique in their photo filters in the 2021 version. Fig. 1. Style transfer demo Early approaches [3, 4] used algorithms with human engineered features targeting to impose particular styles. In 2016 Gatys et al. [5] proposed an algorithm of imposing arbitrary style taken from user defined style image on arbitrary content image by us- ing representations of images that could be obtained with deep convolutional net- works. Gatys et al. [6] extended this framework in many ways. In particular weighted stylization loss was proposed to apply masks and mix different styles together. Style transfer produces changes to the original content image which can make im- portant parts of it, such as human faces, figures or gestures, unrecognizable. Schekalev et al. [7] proposed to solve this problem by using weighted approach from [6] to control spatial strength of stylization in different regions of content image – it was proposed to decrease the strength of stylization for important objects (to better preserve their structure) and to increase stylization strength on the rest of the image (to better impose the style). Important objects were selected by regular patches, superpixels [8] and segmentation results, which showed the best result. This paper extends their work. A new image dataset proposed in this work consists of over 6000 images with au- tomatically extracted and manually verified foreground/background mask. This da- taset allows to train new segmentation algorithms trained specifically to extract fore- ground objects new on images. Two neural segmentation models are trained on the dataset and their accuracies compared. The better model is applied to extended style transfer algorithm where foreground objects are stylized less and background objects are stylized more. Qualitative analysis of different resulting images shows that this extension improves the quality of style transfer, allowing to increase recognizability of foreground objects and by imposing style more vividly on background objects, which is especially important for portrait stylization and advertisements where central object on the image (a person or an advertised good) needs to stand out on the image. Proposed images dataset with labeled foreground and the neural algorithm, trained to extract foreground automatically on new images, may be helpful not only in style transfer applications but in other tasks, such as photo editing (to remove background), photo enhancement (by blurring the background), image classification and under- standing (by removing unimportant background objects from consideration) which may provide benefits in design, marketing, image & video search and recommenda- tions as well as in automatic surveillance systems. 2 Proposed New Images Dataset With Labeled Foreground To improve the quality of style transfer and for general photo enhancement, includ- ing background removal and background blurring it is very important to extract the foreground objects on the input image. This can be done with modern image segmen- tation architectures, but such architectures require a training dataset to learn their pa- rameters. Such dataset was proposed in [9], however, it consists only of 715 images, which is not enough to train a modern deep neural network model for accurate image segmentation. Some objects, labeled as foreground do not actually represent fore- ground according to our more strict criteria, such as on figure 2. Moreover, it does not contain images with missing foreground, which appear quite frequently in practice. Fig. 2. Examples of images from [9] having foreground objects, that are not consid- ered foreground according to our more strict criteria We propose a new image dataset with labeled foreground objects: github.com/victorkitov/foreground_dataset. It has 6073 images, 1057 of which do not contain foreground objects; the average image fraction, occupied by foreground is 0.26, the standard deviation of this fraction is 0.16. To form this dataset, postprocessed subset of images were used from the following datasets: MC COCO datasets [10], INRIA [11], Clothing Co-Parsing [12], SUN RGB-D [13]. Additionally 320 images without foreground were taken from publicly available images. Foreground was labeled using the fact that most often the foreground includes ob- jects that: • Occupy a certain share images (do not fill it entirely and are not too small); • Located approximately in its central part, not on the edges; • The distance to them is significantly less than to surrounding pixels; • Belong to the class that has small spatial dimensions (thus such classes as road, sky, sea, forest, etc. are excluded). For each dataset, formal criteria were determined for highlighting the foreground. Then all images pre-selected according to formal criteria were manually scanned for compliance of the selected objects with the notion of foreground. 2.1 SUN RGB-D Dataset Processing SUN RGB-D [13] contains 10335 images with semantic labeling of objects and corresponding depth maps (a depth map is a grayscale image with the same size, each pixel value corresponds to the distance of object, located at that pixel, from the cam- era). Objects were ordered by their proximity to the camera, most distant objects were excluded, as well as objects from the following excluded classes: wall, floor, door, window, picture, blinds, desk, curtain, mirror, clothes, ceiling, paper, whiteboard and toilet. Also objects were excluded that occupied less than 5% of total image area or less than 40% of area occupied by all objects of their class. Finally objects not inter- secting with central part of the image (a rectangle with width and height equal to 80% of width and height of the original image) were also excluded. All other objects were combined to represent the foreground of the image. 2.2 Microsoft COCO Dataset Preprocessing MC COCO object detection 2018 validation dataset [10] consists of 5000 images with segmented objects. For each object, information about its area and minimal bounding box, containing the whole object, is supplied. Objects having area below 8% of total image area were not considered as well as objects, whose center did not belong to central region of the image (a rectangle with width and height equal to 80% of width and height of the original image). Foreground was represented by largest object augmented by smaller objects having intersecting bounding box with the larg- est object. 2.3 INRIA and Clothing Co-Parsing Datasets Preprocessing On INRIA images [11] people and cars were separately annotated (420 images of people, 311 with cars). Our annotation was obtained from the original one by combin- ing the masks of objects located in the center of the image and occupying more than 40% of its area. 2.4 Summary statistics Our combined images dataset with annotated foreground consists of 6073 images, 1057 of which do not contain foreground objects. All segmentation results were man- ually scanned for compliance with the notion of foreground. Original Stanford SUN MC CO- INRIA Clothing Other set Background RGB-D CO Co- (Val2017) parsing Total 714 10335 5000 731 1094 - Included 429 1830 2060 340 1094 320 3 Automatic Foreground Segmentation To apply foreground segmentation automatically two segmentation models are trained: LW RefineNet [14] and Fast-SCNN [15]. LW RefineNet stands for Light Weight RefineNet and is a more efficient implementation of RefineNet model [16]. Fast-SCNN uses two path architecture – image is encoded and passed through two paths, the outputs of both paths are summed and the result is passed through a decod- er. The first path contains a convolution and the second path has multiple bottleneck convolution layers as well as spatial pooling and upsampling. The second path ex- tracts high level low resolution features whereas the first path contains low lever high resolution features. LW RefineNet encodes the image using pretrained convolution blocks of ResNet- 50 classification model and applies splitting of image representations into multiple streams with different resolutions and levels of feature abstraction which are later joined by bilinear upscaling and summation. We used python realizations of LW RefineNet [17] and Fast-SCNN [18] in pytorch framework. Our foreground dataset was divided into train (4530 images), validation (1043 images) and test sets (500 images). Images were rescaled to equal size and we applied stochastic gradient descent algorithm with learning rate until cross en- tropy loss stopped decaying on the validation set (210 epochs). To compensate class imbalance, foreground class was accounted for with weight 0.7 and background class – with weight 0.3. Performance comparison for both models is shown on table 1. Table 1. Quality comparison between LW RefineNet and Fast-SCNN. Model LW RefineNet Fast-SCNN Pixel Accuracy 0.85 0.62 Intersection over Union 0.56 0.42 Loss Function 0.08 0.47 LW RefineNet is more accurate than Fast-SCNN. This may be attributed to better structure: it uses pretrained convolution blocks from ResNet-50 classification model and combines features of diverse resolution and diverse levels of abstraction to gener- ate better result. Qualitative analysis shows that LW RefineNet model selects fore- ground objects in most cases, while Fast-SCNN model frequently may also select some of the background objects. LW RefineNet mask has smoother edges. On images without foreground, LW RefineNet works better than Fast-SCNN. Fig. 3. Foreground extraction comparison between LW RefineNet and Fast-SCNN. 4 Foreground Extraction In Style Transfer We use style transfer method of Gatys et al. [5] modified to preserve foreground objects. We apply modification proposed in [7], namely we use spatially weighted multiplier of content loss. This multiplier is initialized using predicted foreground by foreground segmentation model, trained on our image dataset with labeled fore- ground. Higher values of the multiplier are set for foreground area and lower values – for background area. Such weighting allows to preserve foreground recognizability by stylizing it less and impose vivid style on the background. Stylization results for standard and proposed method (with foreground preservation) are shown on figure 4. For illustrative purposes stylization strength is set to zero (no stylization) for fore- ground objects. It can be seen that proposed image foreground dataset is sufficient to train an accu- rate foreground extraction model, which in turn can be used for style transfer im- provement: important foreground objects are stylized less and are preserved more, whereas style is applied vividly to the background. 5 Discussion A new images dataset with labeled foreground objects was proposed, which may be used for training automatic foreground extraction algorithms for wide range of purposes, including: photo editing (automatic background removal), photo enhance- ment (automatic background blurring), better image compression (with better preser- vation of important objects on the foreground), automatic image captioning and scene understanding improvement, surveillance systems (tracking of foreground objects) and more. Two recent segmentation models were trained on the dataset and their ac- curacy compared – LW RefineNet and Fast-SCNN. The former has better quality which may be attributed to more advanced structure, utilizing ResNet-50 encoder with skip-connections, and combinations of multiple features with different levels of abstraction. We demonstrated the benefit of automatic foreground extraction for improving neural style transfer. By applying spatially weighted style transfer it becomes possible to improve stylization result by decreasing stylization strength of foreground objects (allowing to preserve them better) and increasing stylization strength of background (allowing to transfer style more vividly). Such improved approach has applications in advertisement generation, design, virtual reality and entertainment industry in general. 6 Conclusion This work proposed a new images dataset with labeled foreground objects, together with methodology of foreground extraction and discussion of statistical properties of the obtained dataset. Two recent automatic segmentation models were trained on this dataset and their quality compared. Such models have many perspective applications in various computer vision tasks. In particular it was shown how to improve image style transfer using such models by applying style weaker to the foreground and stronger – to the background of the image, which may have applications in design, marketing, virtual reality, entertainment and other industries. Fig. 4. Comparison of standard and foreground aware style transfer. References 1. Rother, C., Kolmogorov, V., Blake, A. "GrabCut" interactive foreground extraction using iterated graph cuts. ACM transactions on graphics 23(3), 309-314 (2004). 2. Tang, Z., Miao, Z., Wan, Y., & Li, J. Automatic foreground extraction for images and vid- eos. In: 2010 IEEE International Conference on Image Processing. pp. 2993-2996 (2010). 3. Gooch, B., Gooch, A.: Non-photorealistic rendering. CRC Press, USA (2001). 4. Strothotte, T., Schlechtweg, S.: Non-photorealistic computer graphics: modeling, render- ing, and animation. Morgan Kaufmann, USA (2002). 5. Gatys, L., Ecker, A., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2414–2423 (2016). 6. Gatys, L., Alexander S., Matthias B., Aaron H., Eli S.: Controlling perceptual factors in neural style transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition, pp. 3985-3993 (2017). 7. Schekalev, A., Kitov V.: Style Transfer with Adaptation to the Central Objects of the Sce- ne. In: International Conference on Neuroinformatics 2019, pp. 342-350. Springer, Cham (2019). 8. Superpixels introduction, https://medium.com/@darshita1405, last accessed 2020/10/30. 9. Gould, S., Fulton, R. and Koller, D.: Decomposing a scene into geometric and semantical- ly consistent regions. In: 2009 IEEE 12th international conference on computer vision. pp. 1-8. (2009). 10. Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Zitnick, C. Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740- 755. (2014). 11. Marszalek, M., Schmid, C. Accurate object localization with shape masks. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition. pp. 1-8 (2007). 12. Yang, W., Luo, P., Lin, L. Clothing co-parsing by joint image segmentation and labeling. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3182-3189. (2014). 13. Song, S., Lichtenberg, S. P., Xiao, J. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 567-576. (2015). 14. Nekrasov, V., Shen, C., Reid, I. Light-weight refinenet for real-time semantic segmenta- tion. arXiv preprint arXiv:1810.03272 (2018). 15. Poudel, R. P., Liwicki, S., Cipolla, R. Fast-SCNN: Fast semantic segmentation network. arXiv preprint arXiv:1902.04502 (2019). 16. Lin, G., Milan, A., Shen, C., Reid, I. Refinenet: Multi-path refinement networks for high- resolution semantic segmentation. In Proceedings of the IEEE conference on computer vi- sion and pattern recognition , pp. 1925-1934 (2017). 17. LW RefineNet realization, https://github.com/DrSleep/lightweight-refinenet, last accessed 2020/10/30. 18. Fast-SCNN realization, https://github.com/Tramac/Fast-SCNN-pytorch, last accessed 2020/10/30.