<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Staging E-Commerce Products for Online Advertising using Retrieval Assisted Image Generation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yueh-Ning Ku</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Mikhail Kuznetsov</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Paloma de Juan</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Snap Inc.</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Yahoo Research</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Online ads showing e-commerce products typically rely on the product images in a catalog sent to the advertising platform by an e-commerce platform. In the broader ads industry such ads are called dynamic product ads (DPA). It is common for DPA catalogs to be in the scale of millions (corresponding to the scale of products which can be bought from the e-commerce platform). However, not all product images in the catalog may be appealing when directly re-purposed as an ad image, and this may lead to lower click-through rates (CTRs). In particular, products just placed against a solid background may not be as enticing and realistic as a product staged in a natural environment. To address such shortcomings of DPA images at scale, we propose a generative adversarial network (GAN) based approach to generate staged backgrounds for un-staged product images. Generating the entire staged background is a challenging task susceptible to hallucinations. To get around this, we introduce a simpler approach called copy-paste staging using retrieval assisted GANs. In copy paste staging, we first retrieve (from the catalog) staged products similar to the un-staged input product, and then copy-paste the background of the retrieved product in the input image. A GAN based in-painting model is used to fill the holes left after this copy-paste operation. We show the eficacy of our copy-paste staging method via ofline metrics, and human evaluation. In addition, we show how our staging approach can enable animations of moving products leading to a video ad from a product image.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;image generation</kwd>
        <kwd>generative adversarial networks</kwd>
        <kwd>online advertising</kwd>
        <kwd>e-commerce</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The choice of image for an online ad can have a
significant impact on the online user exposed to the
ad. If the ad image is enticing enough, it can not
only create brand awareness among online users but
also drive them to click the ad and make subsequent
purchases (conversions) [1, 2]. However, if the ad
image is not properly designed to capture the user’s
attention, it would lead to poor user interactions
and adversely afect the advertising platform (by Figure 1: Un-staged and staged version of a chair sold
lowering revenue) and the advertiser (by lowering by an e-commerce vendor (staged version is likely to
conversion rate). In this context, a common ob- have better CTR).
servation [3] is that ad images with products in a
natural or real world setting (lifestyle images) tend
to have better online performance. For example, an
ad selling a chair is expected to perform better if the
image shows a chair in a living room versus a chair
against a solid (synthetic) background (as shown in
Figure 1). However, such staging of products may
be expensive and time consuming, specially when a
vendor is selling multiple products at the same time.</p>
      <sec id="sec-1-1">
        <title>In DPA oferings from ad platforms (e.g., Yahoo),</title>
        <p>the catalog images from an e-commerce vendor (e.g.,</p>
      </sec>
      <sec id="sec-1-2">
        <title>Walmart, Amazon) are typically used directly as</title>
        <p>ad images. As described later in our data
analysis (based on data from an ad platform), a major
fraction of such images are not staged, and hence
there is a scope to enhance such images (e.g., by
AdKDD’23, Long Beach, CA
∗Corresponding author.
†Work done while at Yahoo Research.</p>
        <p>© 2022 Copyright for this paper by its authors. Use permitted under generating a suitable background for the product).
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g 4CC.r0Ee)a.UtivRe CWomormkosnhsoLpicePnsreoAcetterdibiuntgiosn(4C.0EInUteRrn-aWtioSn.aolr(gC)C BY Image generation has been an actively studied
topic for the past few years. Difusion models [ 4] 2. Related work
and GANs [5, 6] have been powering the
state-ofthe-art results in this area. Image in-painting [7] is Online advertising: In online advertising, the ad
a slightly easier version of the problem where only creative (text and image) plays an important role
parts of the image need to be generated as opposed in influencing online users towards brand awareness,
to the whole image. To the best of our knowledge, clicks and purchases [2, 8, 9]. Studying ad images
we are the first to study GAN based image gener- and text using state-of-the-art deep learning models
ation approaches for enhancing product images to in computer vision and natural language processing
serve as ad images. Our main contributions can be (NLP) is an emerging area of research. In [10], ad
summarized as follows. image content was studied using computer vision
models, and their dataset had manual annotations
1. We study three tasks (as outlined below): (i) for: ad category, reasons to buy products advertised
vanilla staging, (ii) copy-paste staging, and in the ad, and expected user response given the ad.
(iii) image-to-parallax animation. Using this dataset, [11, 12] used ranking models to
2. In task 1, we aim to generate the entire back- recommend themes for ad creative design using a
ground for a product. We use pix2pix [6] brand’s Wikipedia page. In [3], object tag
recomto train a GAN model with pairs of images mendations for improving an ad image was studied
(input: segmented out product image, out- using data from A/B tests. Although related to ads,
put: staged product image with ground truth the above methods are not applicable in our setup
background). since none of them are image generation methods.</p>
      </sec>
      <sec id="sec-1-3">
        <title>3. In task 2, our goal is to retrieve a similar</title>
        <p>product image (with staging) and copy-paste GANs for image generation Generative
the background while filling in gaps (holes) Adversarial Networks (GANs) are a popular
created in the process of swapping products. approach for image generation. While vanilla
We leverage GAN-based in-painting to fill GANs [5] generate images from random noise,
in the gaps mentioned above. We also intro- Conditional GANs [13] also allow to use extra
duce a weighted boundary loss for in-painting information in a generative process. Recently
to focus on the image generation quality at developed pix2pix method [6] and its successors [14]
product boundaries. Through Frechet incep- use Conditional GANs for image-to-image
tion distance (FID) score, and human eval- generation optimizing GAN objective together
uations we show that copy-paste staging is with distance to a target image. While we focus
significantly better than the vanilla staging on GANs in this paper, our copy-paste staging
baseline. approach can be generalized to more recent
4. In task 3, we use GAN-based in-painting to difusion models [ 4] (discussed in Section 7).
create a sequence of images simulating the
main product’s movement against the staged
background as in a parallax animation. The
foreground and background both move, but
at diferent speeds, creating the illusion of
depth. This is to show how our approach
can lead to video ads from product images.</p>
      </sec>
      <sec id="sec-1-4">
        <title>Saliency detection Saliency detection in product</title>
        <p>images is needed to understand which parts of
an image correspond to the main product being
advertised (as opposed to the background). Salient
object detection (SOD) aims to detect the most
visually attractive objects with precise boundaries in
Our retrieval based approach (second task above) images (i.e., it returns a boundary map which can be
shares the intuition common in text generation: used to segment out objects from the image). With
retrieval augmented generation (RAG) has better the introduction of convolutional neural networks
context understanding and generation quality. The (CNNs) in computer vision, SOD accuracy has
remainder of this paper is organized as follows: re- witnessed remarkable improvements. Recently,
U2lated work in Section 2, problem formulation in Net [15] achieved state-of-the-art results for saliency
Section 3, relevant data in Section 4, and our pro- detection by using a nested version of U-Net
[16]posed approaches in Section 5. We go over our like architecture to capture richer local and global
experimental results in Section 6, and end with a information. In our proposed approaches, we use
discussion in Section 7. saliency detection via U2-Net as a building block.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>5. Product staging using GANs</title>
      <p>Task 2: retrieval assisted copy-paste staging. Task
2 is a simpler version of task 1. Here, we are given
a pool of existing product images  , and we need
to retrieve a similar product image with staging
such that we can copy-paste the staged background
from the retrieved image onto the input image as
shown in Figure 2. Image generation (in-painting)
is needed to fill in the gaps after copy-pasting (since
the input product and the product in the retrieved
image are not identical, gaps will be created when
we swap products).</p>
      <sec id="sec-2-1">
        <title>In this section, we first explain the salient object</title>
        <p>detection model which is common to our proposed
approaches for all the tasks outlined in Section 3.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Next, we cover our proposed approaches for tasks</title>
      </sec>
      <sec id="sec-2-3">
        <title>1, 2, and 3 (tasks explained in Section 3) in</title>
      </sec>
      <sec id="sec-2-4">
        <title>Sections 5.2, 5.3, and 5.4 respectively. Our major</title>
        <p>modeling contributions are in the approaches for
tasks 2 and 3; for task 1, although we define the
task, we use existing approaches (pix2pix [6]) to
solve it. Task 2 is an easier version of task 1, and
Task 3: image-to-parallax animation. In this task, our proposed approach leads to more realistic staged
the goal is to take an input image (as shown in product images compared to pix2pix for task 1.
Figure 2 for task 2), and create an animation
(i.e., sequence of images), where the object in the 5.1. Product segmentation via saliency
input image (as in Figure 2) appears to be moving
against a stationary but staged background. Such
animations are expected to lead to higher user
engagement [17].</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Data</title>
      <sec id="sec-3-1">
        <title>For our experiments, we sampled data from</title>
      </sec>
      <sec id="sec-3-2">
        <title>Yahoo Gemini DPA (spanning November-December</title>
        <p>2020). With an impression threshold of 10, 000,
For tasks 1, 2 and 3, we use U2-Net [15] as
our saliency object detector. The saliency object
detector plays a crucial role in the first step of our
approaches since it separates the main product(s)
that will be replaced or copy-pasted versus the
background in a product image. Once saliency
probability maps are obtained from U2-Net, we set
the threshold at 0.5 to generate binary masks and
separate foreground pixels from the background
pixels. For each task, we use product segmentation
in a diferent manner as explained below.</p>
        <sec id="sec-3-2-1">
          <title>5.2. Vanilla staging</title>
          <p>For vanilla staging, we use the pix2pix method to
generate an image background for a product which
is segmented by a saliency mask. The algorithm is
a conditional GAN optimizing the loss combining 1) Figure 5: Proposed copy-paste staging algorithm.
regular GAN objective, and 2) ℓ1 distance between
original and restored images. We use product
segmentation to prepare pairs of images to train
pix2pix for stage (background generation). In
particular, given an image with a staged product, we
remove the background via product segmentation
and use this as the input image to pix2pix. Figure 4
shows an example for this approach (original image
in the middle, segmented product on the left, and (a) Top-2 similar images
the version with generated staging on the right).</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>5.3. Retrieval assisted copy-paste staging</title>
          <p>For copy paste staging, we bypass the problem of
generating the entire image background by using
backgrounds from other relevant images. Our
method consists of the following steps:
1. For a given segmented product (Figure 1,
left), retrieve top- similar products from
a training collection (Fig. 6a). Similarity
measure is a cosine distance between
embeddings of corresponding product images
provided by Inception-V3 [18].
2. For the top- similar images, segment out
the original products (Fig. 6b) and fill in the
holes by inpainting using EdgeConnect [7]
(GAN based model) and a new loss function
that we introduce in Section 5.3.1.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3. Copy-paste the original product image to</title>
        <p>the inpainted top- similar images, aligning
shape and center mass for the corresponding
product masks (Fig. 6c).</p>
      </sec>
      <sec id="sec-3-4">
        <title>The above algorithm is illustrated with examples</title>
        <p>in Figure 5. We provide additional examples in
Figure 6. After completing steps 1-3, we generate
 product images with various backgrounds, only
(b) Products masked out
(c) Copy-paste result after inpainting
small parts of which (holes around the product
before/after) are generated by GANs, which makes
the images look more real if comparing with vanilla
staging. For better background generation we
introduce a new loss function as described below.
(a)
(b)
(c)</p>
      </sec>
      <sec id="sec-3-5">
        <title>5.3.1. Weighted Boundary Loss</title>
      </sec>
      <sec id="sec-3-6">
        <title>Recent works [7] and [19] explore coarse-to-fine</title>
        <p>inpainting approaches, since the structures of in a two-dimensional image. Generally, parallax
objects are complex and diverse, adding an efect requires independent foreground images and
intermediate step, like edge maps or monochromic background images, and proper technique to make
images, can help models to learn progressively and transparent backgrounds. In our proposed approach,
eventually generate better final inpainted outputs. by leveraging the power of salient object detection
We propose a weighted boundary loss (WBL) to not and in-painting, a parallax efect animation can be
only simplify the learning process (since the model generated from a 2D image. Practically, we run
needs to focus on lesser area), but also mimic the salient object detection to define foreground pixels,
end application use case. Following prior work [7], then gradually move the foreground object around
our total generator loss consists of a conventional creating empty gaps between the current position of
adve.rIsnaraiadldiltoisosn t o thaesnedtwaofelaotsuserse,- msinactcehoinugr gloosasl tghaep,owbejetchtenanudseoirmigaingealinp-opsaiitniotinn.gTmoofildletlhteo ienm-ppatiynt
is to make the model learn better at the boundary those pixels and create serial realistic images. We
of the masked area, we add weighted boundary loss illustrate the sample results of the above approach
   to amplify the loss penalty at the boundary in Figure 8.
area pixels. WBL is:
(a)
(b)</p>
        <p>(c)
  
=  
∗  ℓ1−
(  ,   ),</p>
        <p>(1)
where   is ground truth edge map of input
images,   is predicted edge map generated by
the generator. The   is a pixel-wise weighted
map and has the same size as input masked images
and ground truth. To be more specific, the  
has   for pixels around the boundary between
masked area and unmasked area, and  −
for pixels away from the boundary, the pixel-wise
 1− will multiply the corresponding  as we
calculate    . As Figure 7 illustrates, for each
training sample, we create free-form dense masks
by the method proposed by [20]. Then, we find
the boundary area of the free-form mask and assign
  (white area in Figure 7c) and  −
(gray area in Figure 7c). For experiments, we fixed
  = 0.9 and  − = 0.1.</p>
        <sec id="sec-3-6-1">
          <title>5.4. Image-to-animation</title>
        </sec>
      </sec>
      <sec id="sec-3-7">
        <title>Parallax efect happens when the background pixels move slower than foreground objects in an animation, thereby creating an illusion of depth</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Results</title>
      <sec id="sec-4-1">
        <title>We first go over some sample results for tasks 1-3 followed by ofline metrics (retrieval performance, generation quality) and human perceptual study results.</title>
        <sec id="sec-4-1-1">
          <title>6.1. Sample results for tasks 1, 2 and 3</title>
          <p>Task 1 (vanilla staging): sample results for task</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>1 (obtained via pix2pix) are shown in Figures 9</title>
        <p>(b) and 10 (b). Compared to the original image
with the background, the generated image has a lot
of artifacts, and does not look so realistic. As we
discuss below, the copy-paste staging results look
more realistic.</p>
        <p>Task 2 (copy-paste staging): Figures 9 (e) and
10 (e) show sample results for task 2 using the
proposed copy-paste staging approach. Overall, the
copy-paste staging results look much more realistic
compared to pix2pix results.
0.409
0.664
0.374
0.734</p>
      </sec>
      <sec id="sec-4-3">
        <title>Generation quality: we measure the performance</title>
        <p>
          of our copy-paste staging results by evaluating
Frechet inception distance (FID) [
          <xref ref-type="bibr" rid="ref1">21</xref>
          ]. FID is
a popular metric for evaluating the quality of
images created by GANs. The Wasserstein-2
distance in FID is calculated by comparing the
features distribution of in-painted images with the
distribution of real images, where the features are
generated by a pre-trained InceptionV3 model. The
comparison results are shown in Table 2. Since the
copy-paste method in-paints only small regions of
image around an object, it achieves much better FID
score than vanilla staging. WBL further improves
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>FID score in both methods.</title>
        <sec id="sec-4-4-1">
          <title>6.3. Human evaluation</title>
          <p>(e) Results of Copy-paste staging</p>
        </sec>
        <sec id="sec-4-4-2">
          <title>6.2. Ofline metrics</title>
          <p>Similar product image retrieval performance: a
retrieved image is considered similar, if it belongs
to the same subcategory as the input product. For
example, if the input is a queen bed of subcategory
”Furniture &gt; Bedroom &gt; Headboards &gt; Queen”,
• 0% of pix2pix images were better than</p>
          <p>ground truth;
• 3% of copy-paste images were better than
ground truth;
• 76% of copy-paste images were better than [7] K. Nazeri, E. Ng, T. Joseph, F. Qureshi,
pix2pix. M. Ebrahimi, Edgeconnect: Generative image
inpainting with adversarial edge learning, 2019.</p>
          <p>The above results clearly demonstrate the [8] S. Mishra, M. Kuznetsov, G. Srivastava, M.
Svirisuperiority of copy-paste staging and are in line denko, Visualtextrank: Unsupervised
graphwith ofline FID scores. based content extraction for automating ad text
to image search, KDD ’21, 2021.
[9] M. Verma, S. Mishra, Recommendation systems
7. Discussion for ad creation: A view from the trenches, RecSys
’22, 2022.</p>
          <p>Our proposed approach provides low budget [10] Z. Hussain, M. Zhang, X. Zhang, K. Ye,
advertisers a way to stage products digitally without C. Thomas, Z. Agha, N. Ong, A. Kovashka,
Auhaving to spend on the physical resources needed tomatic understanding of image and video
adverfor staging. Staging a room can easily cost up to few tisements, in: CVPR, 2017.
hundred dollars for an advertiser, and with image [11] S. Mishra, M. Verma, J. Gligorijevic, Guiding
cregeneration methods like the ones we have proposed, ative design in online advertising, in: Proceedings
this would be basically free of cost (except for the of the 13th ACM Conference on Recommender
legalities around copying backgrounds from other Systems, RecSys ’19, 2019.
images). Leveraging the recent progress in prompt [12] Y. Zhou, S. Mishra, M. Verma, N. Bhamidipati,
based image generation models, our approach can W. Wang, Recommending themes for ad creative
be further improved along the following lines: design via visual-linguistic representations, in:
backgrounds from similar images could be used Proceedings of The Web Conference 2020, WWW
to generate prompts which then generate the ’20, 2020.
background of the original product image. In [13] M. Mirza, S. Osindero, Conditional generative
addition, staged ads and parallax animations are adversarial nets, arXiv preprint arXiv:1411.1784
expected to drive user engagement, and validating (2014).
such hypothesis via an A/B test is one of our next [14] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao,
steps. J. Kautz, B. Catanzaro, High-resolution
image synthesis and semantic manipulation with
References conditional gans, in: IEEE CVPR, 2018, pp.
8798–8807.
[1] N. Bhamidipati, R. Kant, S. Mishra, A large [15] X. Qin, Z. Zhang, C. Huang, M. Dehghan, O. R.
scale prediction engine for app install clicks and Zaiane, M. Jagersand, U2-net: Going deeper with
conversions, in: Proceedings of the 2017 ACM nested u-structure for salient object detection,
on Conference on Information and Knowledge Pattern Recognition 106 (2020) 107404.</p>
          <p>Management, CIKM ’17, 2017. [16] O. Ronneberger, P. Fischer, T. Brox, U-net:
Con[2] Y. Zhou, S. Mishra, J. Gligorijevic, T. Bhatia, volutional networks for biomedical image
segmenN. Bhamidipati, Understanding consumer journey tation, in: International Conference on Medical
using attention based recurrent neural networks, image computing and computer-assisted
intervenin: Proceedings of the 25th ACM SIGKDD Inter- tion, Springer, 2015, pp. 234–241.
national Conference on Knowledge Discovery &amp; [17] New study by verizon media, magna, &amp;
Data Mining, KDD ’19, 2019. ipg media lab finds interactive ad formats
[3] S. Mishra, M. Verma, Y. Zhou, K. Thadani, engage hard-to-convince audiences, https:
W. Wang, Learning to create better ads: Gen- //www.verizonmedia.com/press/2021/04/12/
eration and ranking approaches for ad creative new-study-by-verizon-media-magna-ipg-media-lab,
refinement, in: Proceedings of the 29th ACM In- ????
ternational Conference on Information &amp; Knowl- [18] C. Szegedy, V. Vanhoucke, S. Iofe, J. Shlens,
Z. Wojna, Rethinking the inception architecture
for computer vision, in: IEEE CVPR, 2016, pp.</p>
          <p>2818–2826.
[19] T. Wang, H. Ouyang, Q. Chen, Image inpainting
with external-internal learning and monochromic
bottleneck, in: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern
Recognition, 2021, pp. 5120–5129.</p>
          <p>edge Management, CIKM ’20, 2020.
[4] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss,</p>
          <p>A. Radford, M. Chen, I. Sutskever, Zero-shot
text-to-image generation, ICML 2021, ????
[5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,</p>
          <p>D. Warde-Farley, S. Ozair, A. Courville, Y.
Bengio, Generative adversarial nets, Advances in
neural information processing systems 27 (2014).
[6] P. Isola, J.-Y. Zhu, T. Zhou, A. A. Efros, Image- [20] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, T. S.
Huang, Free-form image inpainting with gated
convolution, in: Proceedings of the IEEE/CVF
International Conference on Computer Vision, 2019,
pp. 4471–4480.
to-image translation with conditional adversarial
networks, in: Proceedings of the IEEE conference
on computer vision and pattern recognition, 2017,
pp. 1125–1134.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>M.</given-names>
            <surname>Heusel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ramsauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Nessler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          ,
          <article-title>Gans trained by a two time-scale update rule converge to a local nash equilibrium</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>