1. Introduction

Staging E-Commerce Products for Online Advertising using Retrieval Assisted Image Generation

Yueh-Ning Ku

0 1 2 0 Mikhail Kuznetsov 1 Paloma de Juan 2 Snap Inc. , USA 3 Yahoo Research , USA

Online ads showing e-commerce products typically rely on the product images in a catalog sent to the advertising platform by an e-commerce platform. In the broader ads industry such ads are called dynamic product ads (DPA). It is common for DPA catalogs to be in the scale of millions (corresponding to the scale of products which can be bought from the e-commerce platform). However, not all product images in the catalog may be appealing when directly re-purposed as an ad image, and this may lead to lower click-through rates (CTRs). In particular, products just placed against a solid background may not be as enticing and realistic as a product staged in a natural environment. To address such shortcomings of DPA images at scale, we propose a generative adversarial network (GAN) based approach to generate staged backgrounds for un-staged product images. Generating the entire staged background is a challenging task susceptible to hallucinations. To get around this, we introduce a simpler approach called copy-paste staging using retrieval assisted GANs. In copy paste staging, we first retrieve (from the catalog) staged products similar to the un-staged input product, and then copy-paste the background of the retrieved product in the input image. A GAN based in-painting model is used to fill the holes left after this copy-paste operation. We show the eficacy of our copy-paste staging method via ofline metrics, and human evaluation. In addition, we show how our staging approach can enable animations of moving products leading to a video ad from a product image.

eol>image generation generative adversarial networks online advertising e-commerce

1. Introduction

The choice of image for an online ad can have a significant impact on the online user exposed to the ad. If the ad image is enticing enough, it can not only create brand awareness among online users but also drive them to click the ad and make subsequent purchases (conversions) [1, 2]. However, if the ad image is not properly designed to capture the user’s attention, it would lead to poor user interactions and adversely afect the advertising platform (by Figure 1: Un-staged and staged version of a chair sold lowering revenue) and the advertiser (by lowering by an e-commerce vendor (staged version is likely to conversion rate). In this context, a common ob- have better CTR). servation [3] is that ad images with products in a natural or real world setting (lifestyle images) tend to have better online performance. For example, an ad selling a chair is expected to perform better if the image shows a chair in a living room versus a chair against a solid (synthetic) background (as shown in Figure 1). However, such staging of products may be expensive and time consuming, specially when a vendor is selling multiple products at the same time.

In DPA oferings from ad platforms (e.g., Yahoo),

the catalog images from an e-commerce vendor (e.g.,

Walmart, Amazon) are typically used directly as

ad images. As described later in our data analysis (based on data from an ad platform), a major fraction of such images are not staged, and hence there is a scope to enhance such images (e.g., by AdKDD’23, Long Beach, CA ∗Corresponding author. †Work done while at Yahoo Research.

© 2022 Copyright for this paper by its authors. Use permitted under generating a suitable background for the product). CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g 4CC.r0Ee)a.UtivRe CWomormkosnhsoLpicePnsreoAcetterdibiuntgiosn(4C.0EInUteRrn-aWtioSn.aolr(gC)C BY Image generation has been an actively studied topic for the past few years. Difusion models [ 4] 2. Related work and GANs [5, 6] have been powering the state-ofthe-art results in this area. Image in-painting [7] is Online advertising: In online advertising, the ad a slightly easier version of the problem where only creative (text and image) plays an important role parts of the image need to be generated as opposed in influencing online users towards brand awareness, to the whole image. To the best of our knowledge, clicks and purchases [2, 8, 9]. Studying ad images we are the first to study GAN based image gener- and text using state-of-the-art deep learning models ation approaches for enhancing product images to in computer vision and natural language processing serve as ad images. Our main contributions can be (NLP) is an emerging area of research. In [10], ad summarized as follows. image content was studied using computer vision models, and their dataset had manual annotations 1. We study three tasks (as outlined below): (i) for: ad category, reasons to buy products advertised vanilla staging, (ii) copy-paste staging, and in the ad, and expected user response given the ad. (iii) image-to-parallax animation. Using this dataset, [11, 12] used ranking models to 2. In task 1, we aim to generate the entire back- recommend themes for ad creative design using a ground for a product. We use pix2pix [6] brand’s Wikipedia page. In [3], object tag recomto train a GAN model with pairs of images mendations for improving an ad image was studied (input: segmented out product image, out- using data from A/B tests. Although related to ads, put: staged product image with ground truth the above methods are not applicable in our setup background). since none of them are image generation methods.

3. In task 2, our goal is to retrieve a similar

product image (with staging) and copy-paste GANs for image generation Generative the background while filling in gaps (holes) Adversarial Networks (GANs) are a popular created in the process of swapping products. approach for image generation. While vanilla We leverage GAN-based in-painting to fill GANs [5] generate images from random noise, in the gaps mentioned above. We also intro- Conditional GANs [13] also allow to use extra duce a weighted boundary loss for in-painting information in a generative process. Recently to focus on the image generation quality at developed pix2pix method [6] and its successors [14] product boundaries. Through Frechet incep- use Conditional GANs for image-to-image tion distance (FID) score, and human eval- generation optimizing GAN objective together uations we show that copy-paste staging is with distance to a target image. While we focus significantly better than the vanilla staging on GANs in this paper, our copy-paste staging baseline. approach can be generalized to more recent 4. In task 3, we use GAN-based in-painting to difusion models [ 4] (discussed in Section 7). create a sequence of images simulating the main product’s movement against the staged background as in a parallax animation. The foreground and background both move, but at diferent speeds, creating the illusion of depth. This is to show how our approach can lead to video ads from product images.

Saliency detection Saliency detection in product

images is needed to understand which parts of an image correspond to the main product being advertised (as opposed to the background). Salient object detection (SOD) aims to detect the most visually attractive objects with precise boundaries in Our retrieval based approach (second task above) images (i.e., it returns a boundary map which can be shares the intuition common in text generation: used to segment out objects from the image). With retrieval augmented generation (RAG) has better the introduction of convolutional neural networks context understanding and generation quality. The (CNNs) in computer vision, SOD accuracy has remainder of this paper is organized as follows: re- witnessed remarkable improvements. Recently, U2lated work in Section 2, problem formulation in Net [15] achieved state-of-the-art results for saliency Section 3, relevant data in Section 4, and our pro- detection by using a nested version of U-Net [16]posed approaches in Section 5. We go over our like architecture to capture richer local and global experimental results in Section 6, and end with a information. In our proposed approaches, we use discussion in Section 7. saliency detection via U2-Net as a building block.

5. Product staging using GANs

Task 2: retrieval assisted copy-paste staging. Task 2 is a simpler version of task 1. Here, we are given a pool of existing product images , and we need to retrieve a similar product image with staging such that we can copy-paste the staged background from the retrieved image onto the input image as shown in Figure 2. Image generation (in-painting) is needed to fill in the gaps after copy-pasting (since the input product and the product in the retrieved image are not identical, gaps will be created when we swap products).

In this section, we first explain the salient object

detection model which is common to our proposed approaches for all the tasks outlined in Section 3.

Next, we cover our proposed approaches for tasks 1, 2, and 3 (tasks explained in Section 3) in Sections 5.2, 5.3, and 5.4 respectively. Our major

modeling contributions are in the approaches for tasks 2 and 3; for task 1, although we define the task, we use existing approaches (pix2pix [6]) to solve it. Task 2 is an easier version of task 1, and Task 3: image-to-parallax animation. In this task, our proposed approach leads to more realistic staged the goal is to take an input image (as shown in product images compared to pix2pix for task 1. Figure 2 for task 2), and create an animation (i.e., sequence of images), where the object in the 5.1. Product segmentation via saliency input image (as in Figure 2) appears to be moving against a stationary but staged background. Such animations are expected to lead to higher user engagement [17].

4. Data For our experiments, we sampled data from Yahoo Gemini DPA (spanning November-December

2020). With an impression threshold of 10, 000, For tasks 1, 2 and 3, we use U2-Net [15] as our saliency object detector. The saliency object detector plays a crucial role in the first step of our approaches since it separates the main product(s) that will be replaced or copy-pasted versus the background in a product image. Once saliency probability maps are obtained from U2-Net, we set the threshold at 0.5 to generate binary masks and separate foreground pixels from the background pixels. For each task, we use product segmentation in a diferent manner as explained below.

5.2. Vanilla staging

For vanilla staging, we use the pix2pix method to generate an image background for a product which is segmented by a saliency mask. The algorithm is a conditional GAN optimizing the loss combining 1) Figure 5: Proposed copy-paste staging algorithm. regular GAN objective, and 2) ℓ1 distance between original and restored images. We use product segmentation to prepare pairs of images to train pix2pix for stage (background generation). In particular, given an image with a staged product, we remove the background via product segmentation and use this as the input image to pix2pix. Figure 4 shows an example for this approach (original image in the middle, segmented product on the left, and (a) Top-2 similar images the version with generated staging on the right).

5.3. Retrieval assisted copy-paste staging

For copy paste staging, we bypass the problem of generating the entire image background by using backgrounds from other relevant images. Our method consists of the following steps: 1. For a given segmented product (Figure 1, left), retrieve top- similar products from a training collection (Fig. 6a). Similarity measure is a cosine distance between embeddings of corresponding product images provided by Inception-V3 [18]. 2. For the top- similar images, segment out the original products (Fig. 6b) and fill in the holes by inpainting using EdgeConnect [7] (GAN based model) and a new loss function that we introduce in Section 5.3.1.

3. Copy-paste the original product image to

the inpainted top- similar images, aligning shape and center mass for the corresponding product masks (Fig. 6c).

The above algorithm is illustrated with examples

in Figure 5. We provide additional examples in Figure 6. After completing steps 1-3, we generate product images with various backgrounds, only (b) Products masked out (c) Copy-paste result after inpainting small parts of which (holes around the product before/after) are generated by GANs, which makes the images look more real if comparing with vanilla staging. For better background generation we introduce a new loss function as described below. (a) (b) (c)

5.3.1. Weighted Boundary Loss Recent works [7] and [19] explore coarse-to-fine

inpainting approaches, since the structures of in a two-dimensional image. Generally, parallax objects are complex and diverse, adding an efect requires independent foreground images and intermediate step, like edge maps or monochromic background images, and proper technique to make images, can help models to learn progressively and transparent backgrounds. In our proposed approach, eventually generate better final inpainted outputs. by leveraging the power of salient object detection We propose a weighted boundary loss (WBL) to not and in-painting, a parallax efect animation can be only simplify the learning process (since the model generated from a 2D image. Practically, we run needs to focus on lesser area), but also mimic the salient object detection to define foreground pixels, end application use case. Following prior work [7], then gradually move the foreground object around our total generator loss consists of a conventional creating empty gaps between the current position of adve.rIsnaraiadldiltoisosn t o thaesnedtwaofelaotsuserse,- msinactcehoinugr gloosasl tghaep,owbejetchtenanudseoirmigaingealinp-opsaiitniotinn.gTmoofildletlhteo ienm-ppatiynt is to make the model learn better at the boundary those pixels and create serial realistic images. We of the masked area, we add weighted boundary loss illustrate the sample results of the above approach to amplify the loss penalty at the boundary in Figure 8. area pixels. WBL is: (a) (b)

(1) where is ground truth edge map of input images, is predicted edge map generated by the generator. The is a pixel-wise weighted map and has the same size as input masked images and ground truth. To be more specific, the has for pixels around the boundary between masked area and unmasked area, and − for pixels away from the boundary, the pixel-wise 1− will multiply the corresponding as we calculate . As Figure 7 illustrates, for each training sample, we create free-form dense masks by the method proposed by [20]. Then, we find the boundary area of the free-form mask and assign (white area in Figure 7c) and − (gray area in Figure 7c). For experiments, we fixed = 0.9 and − = 0.1.

5.4. Image-to-animation Parallax efect happens when the background pixels move slower than foreground objects in an animation, thereby creating an illusion of depth 6. Results We first go over some sample results for tasks 1-3 followed by ofline metrics (retrieval performance, generation quality) and human perceptual study results. 6.1. Sample results for tasks 1, 2 and 3

Task 1 (vanilla staging): sample results for task

1 (obtained via pix2pix) are shown in Figures 9

(b) and 10 (b). Compared to the original image with the background, the generated image has a lot of artifacts, and does not look so realistic. As we discuss below, the copy-paste staging results look more realistic.

Task 2 (copy-paste staging): Figures 9 (e) and 10 (e) show sample results for task 2 using the proposed copy-paste staging approach. Overall, the copy-paste staging results look much more realistic compared to pix2pix results. 0.409 0.664 0.374 0.734

Generation quality: we measure the performance

of our copy-paste staging results by evaluating Frechet inception distance (FID) [ 21 ]. FID is a popular metric for evaluating the quality of images created by GANs. The Wasserstein-2 distance in FID is calculated by comparing the features distribution of in-painted images with the distribution of real images, where the features are generated by a pre-trained InceptionV3 model. The comparison results are shown in Table 2. Since the copy-paste method in-paints only small regions of image around an object, it achieves much better FID score than vanilla staging. WBL further improves

FID score in both methods. 6.3. Human evaluation

(e) Results of Copy-paste staging

6.2. Ofline metrics

Similar product image retrieval performance: a retrieved image is considered similar, if it belongs to the same subcategory as the input product. For example, if the input is a queen bed of subcategory ”Furniture > Bedroom > Headboards > Queen”, • 0% of pix2pix images were better than

ground truth; • 3% of copy-paste images were better than ground truth; • 76% of copy-paste images were better than [7] K. Nazeri, E. Ng, T. Joseph, F. Qureshi, pix2pix. M. Ebrahimi, Edgeconnect: Generative image inpainting with adversarial edge learning, 2019.

The above results clearly demonstrate the [8] S. Mishra, M. Kuznetsov, G. Srivastava, M. Svirisuperiority of copy-paste staging and are in line denko, Visualtextrank: Unsupervised graphwith ofline FID scores. based content extraction for automating ad text to image search, KDD ’21, 2021. [9] M. Verma, S. Mishra, Recommendation systems 7. Discussion for ad creation: A view from the trenches, RecSys ’22, 2022.

Our proposed approach provides low budget [10] Z. Hussain, M. Zhang, X. Zhang, K. Ye, advertisers a way to stage products digitally without C. Thomas, Z. Agha, N. Ong, A. Kovashka, Auhaving to spend on the physical resources needed tomatic understanding of image and video adverfor staging. Staging a room can easily cost up to few tisements, in: CVPR, 2017. hundred dollars for an advertiser, and with image [11] S. Mishra, M. Verma, J. Gligorijevic, Guiding cregeneration methods like the ones we have proposed, ative design in online advertising, in: Proceedings this would be basically free of cost (except for the of the 13th ACM Conference on Recommender legalities around copying backgrounds from other Systems, RecSys ’19, 2019. images). Leveraging the recent progress in prompt [12] Y. Zhou, S. Mishra, M. Verma, N. Bhamidipati, based image generation models, our approach can W. Wang, Recommending themes for ad creative be further improved along the following lines: design via visual-linguistic representations, in: backgrounds from similar images could be used Proceedings of The Web Conference 2020, WWW to generate prompts which then generate the ’20, 2020. background of the original product image. In [13] M. Mirza, S. Osindero, Conditional generative addition, staged ads and parallax animations are adversarial nets, arXiv preprint arXiv:1411.1784 expected to drive user engagement, and validating (2014). such hypothesis via an A/B test is one of our next [14] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, steps. J. Kautz, B. Catanzaro, High-resolution image synthesis and semantic manipulation with References conditional gans, in: IEEE CVPR, 2018, pp. 8798–8807. [1] N. Bhamidipati, R. Kant, S. Mishra, A large [15] X. Qin, Z. Zhang, C. Huang, M. Dehghan, O. R. scale prediction engine for app install clicks and Zaiane, M. Jagersand, U2-net: Going deeper with conversions, in: Proceedings of the 2017 ACM nested u-structure for salient object detection, on Conference on Information and Knowledge Pattern Recognition 106 (2020) 107404.

Management, CIKM ’17, 2017. [16] O. Ronneberger, P. Fischer, T. Brox, U-net: Con[2] Y. Zhou, S. Mishra, J. Gligorijevic, T. Bhatia, volutional networks for biomedical image segmenN. Bhamidipati, Understanding consumer journey tation, in: International Conference on Medical using attention based recurrent neural networks, image computing and computer-assisted intervenin: Proceedings of the 25th ACM SIGKDD Inter- tion, Springer, 2015, pp. 234–241. national Conference on Knowledge Discovery & [17] New study by verizon media, magna, & Data Mining, KDD ’19, 2019. ipg media lab finds interactive ad formats [3] S. Mishra, M. Verma, Y. Zhou, K. Thadani, engage hard-to-convince audiences, https: W. Wang, Learning to create better ads: Gen- //www.verizonmedia.com/press/2021/04/12/ eration and ranking approaches for ad creative new-study-by-verizon-media-magna-ipg-media-lab, refinement, in: Proceedings of the 29th ACM In- ???? ternational Conference on Information & Knowl- [18] C. Szegedy, V. Vanhoucke, S. Iofe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, in: IEEE CVPR, 2016, pp.

2818–2826. [19] T. Wang, H. Ouyang, Q. Chen, Image inpainting with external-internal learning and monochromic bottleneck, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5120–5129.

edge Management, CIKM ’20, 2020. [4] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss,

A. Radford, M. Chen, I. Sutskever, Zero-shot text-to-image generation, ICML 2021, ???? [5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,

D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, Advances in neural information processing systems 27 (2014). [6] P. Isola, J.-Y. Zhu, T. Zhou, A. A. Efros, Image- [20] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, T. S. Huang, Free-form image inpainting with gated convolution, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4471–4480. to-image translation with conditional adversarial networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134.

[21]

Heusel ,

Ramsauer ,

Unterthiner ,

Nessler ,

Hochreiter , Gans trained by a two time-scale update rule converge to a local nash equilibrium , Advances in neural information processing systems 30 ( 2017 ).