=Paper= {{Paper |id=Vol-2410/paper14.pdf |storemode=property |title=Fashion Outfit Generation for E-commerce |pdfUrl=https://ceur-ws.org/Vol-2410/paper14.pdf |volume=Vol-2410 |authors=Elaine Bettaney,Stephen Hardwick,Benjamin Paul Chamberlain,Odysseas Zisimopoulos |dblpUrl=https://dblp.org/rec/conf/sigir/BettaneyHCZ19 }} ==Fashion Outfit Generation for E-commerce== https://ceur-ws.org/Vol-2410/paper14.pdf
                             Fashion Outfit Generation for E-commerce
                             Elaine M. Bettaney                                                        Stephen R. Hardwick
                                  ASOS.com                                                                    ASOS.com
                                  London, UK                                                                 London, UK
                          elaine.bettaney@asos.com                                                   stephen.hardwick@asos.com

                         Odysseas Zisimopoulos                                                    Benjamin Paul Chamberlain
                                ASOS.com                                                                    ASOS.com
                                London, UK                                                                  London, UK
                     odysseas.zisimopoulos@asos.com                                                  ben.chamberlain@asos.com

ABSTRACT
Combining items of clothing into an outfit is a major task in fashion
retail. Recommending sets of items that are compatible with a
particular seed item is useful for providing users with guidance and
inspiration, but is currently a manual process that requires expert
stylists and is therefore not scalable or easy to personalise. We use a
multilayer neural network fed by visual and textual features to learn
embeddings of items in a latent style space such that compatible
items of different types are embedded close to one another. We
train our model using the ASOS outfits dataset, which consists
of a large number of outfits created by professional stylists and
which we release to the research community. Our model shows
strong performance in an offline outfit compatibility prediction
task. We use our model to generate outfits and for the first time
in this field perform an AB test, comparing our generated outfits
to those produced by a baseline model which matches appropriate
product types but uses no information on style. Users approved
of outfits generated by our model 21% and 34% more frequently
than those generated by the baseline model for womenswear and
menswear respectively.

KEYWORDS
Representation learning, fashion, multi-modal deep learning

1    INTRODUCTION
User needs based around outfits include answering questions such
as "What trousers will go with this shirt?", "What can I wear to a                    Figure 1: An ASOS fashion product together with associated
party?" or "Which items should I add to my wardrobe for summer?".                     product data and styling products in a Buy the Look (BTL)
The key to answering these questions requires an understanding                        carousel as shown on a Product Description Page (PDP).
of style. Style encompasses a broad range of properties including                        This paper describes a system for Generating Outfit Recom-
but not limited to, colour, shape, pattern and fabric. It may also                    mendations from Deep Networks (GORDN) under development
incorporate current fashion trends, user’s style preferences and an                   at ASOS.com. ASOS is a global e-commerce company focusing on
awareness of the context in which the outfits will be worn. In the                    fashion and beauty. With approximately 87,000 products on site at
growing world of fashion e-commerce it is becoming increasingly                       any one time, it is difficult for customers to perform an exhaustive
important to be able to fulfill these needs in a way that is scalable,                search to find products that can be worn together. Each fashion
automated and ultimately personalised.                                                product added to our catalogue is photographed on a model as part
                                                                                      of an individually curated outfit of compatible products chosen
Copyright © 2019 by the paper’s authors. Copying permitted for private and academic   by our stylists to create images for its Product Description Page
purposes.
In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.):                        (PDP). The products comprising the outfit are then displayed to the
Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at   customer in a Buy the Look (BTL) carousel (Figure 1). This offering
http://ceur-ws.org                                                                    however is not scalable as it requires manual input for every outfit.
                                                                                      We aim to learn from the information encoded in these outfits to
                                                                                      automatically generate an unlimited number of outfits.
SIGIR 2019 eCom, July 2019, Paris, France                                                                                                  E. M. Bettaney et al.


   A common way for people to compose outfits is to first pick                                        104
                                                                                                                                            As hero product
a seed item, such as a patterned shirt, and then find other com-                                                                            As styling product
                                                                                                      103




                                                                           Frequency of occurrence
patible items. We focus on this task: completing an outfit based




                                                                            in Buy the Look outfits
on a seed item. This is useful in an e-commerce setting as outfit
suggestions can be seeded with a particular product page or a user’s
                                                                                                      102
past purchases. Our ASOS outfits dataset comprises a set of outfits
originating from BTL carousels on PDPs. These contain a seed, or
‘hero product’, which can be bought from the PDP. All other items                                     101
in the outfit we refer to as ‘styling products’.
   There is an asymmetry between hero and styling products. Whilst                                    100
all items are used as hero products (in an e-commerce setting),
                                                                                                            100   101   102          103         104
styling products are selected as the best matches for the hero prod-                                                          Rank
uct and this matching is directional. For example when the hero
product is a pair of Wellington boots it may create an engaging
                                                                          Figure 2: Frequency of occurrence of each womenswear item
outfit to style them with a dress. However if the hero product is a
                                                                          in our ASOS outfits dataset. Items are ranked by how fre-
dress then it is unlikely a pair of Wellington boots would be the best
                                                                          quently they occur as styling products. Each item appears
choice of styling product to recommend. Hence in general styling
                                                                          once at most as a hero product (red), while there is a heav-
products tend to be more conservative than hero products. Our
                                                                          ily skewed distribution in the frequency with which items
approach takes this difference into account by explicitly including
                                                                          appear as styling products (blue).
this information as a feature.
   We formulate our training task as binary classification, where
                                                                          Pooling allows them to consider outfits of variable size. Tangseng
GORDN learns to tell the difference between BTL and randomly
                                                                          et al. [21] create item embeddings solely from images. They are able
generated negative outfits. We consider an outfit to be a set of
                                                                          to use outfits of variable size by padding their set of item images to
fashion items and train a model that projects items into a single
                                                                          a fixed length with a ‘mean image’. Our method is similar to these
style space. Compatible items will appear close in style space en-
                                                                          as we combine multi-modal item embeddings, however we aim not
abling good outfits to be constructed from nearby items. GORDN
                                                                          to lose information by pooling or padding.
is a neural network which combines embeddings of multi-modal
                                                                             Vasileva et al. [23] extend this concept by noting that compat-
features for all items in an outfit and outputs a single score. When
                                                                          ibility is dependent on context - in this case the pair of clothing
generating outfits, GORDN is used as a scorer to assess the validity
                                                                          types being matched. They create learned type-aware projections
of different combinations of items.
                                                                          from their style space to calculate compatibility between different
   In summary, our contributions are:
                                                                          types of clothing.
    (1) A novel model that uses multi-modal data to generate outfits
        that can be trained on images in the wild i.e. dressed people     3                 OUTFIT DATASETS
        rather than individual item flat shots. Outfits generated by
                                                                          The ASOS outfits dataset consists of 586,520 outfits, each containing
        our model outperform a challenging baseline by 21% for
                                                                          between 2 and 5 items (see Table 1). In total these outfits contain
        womenswear and 34% for menswear.
                                                                          591,725 unique items representing 18 different womenswear (WW)
    (2) A new research dataset consisting of 586,320 fashion outfits
                                                                          product types and 22 different menswear (MW) product types. As
        (images and textual descriptions) composed by ASOS stylists.
                                                                          all of our outfits have been created by ASOS stylists, they are rep-
        This is the world’s largest annotated outfit dataset and is the
                                                                          resentative of a particular fashion style.
        first to contain Menswear items.
                                                                             Most previous outfit generators have used either co-purchase
                                                                          data from Amazon [12, 24] or user created outfits taken from
2    RELATED WORK                                                         Polyvore [4, 5, 10, 14, 20, 21, 23], both of which represent a di-
Our work follows an emerging body of related work on learning             verse range of styles and tastes. Co-purchase is not a strong signal
clothing style [11, 24], clothing compatibility [18, 20, 24] and outfit   of compatibility as co-purchased items are typically not bought
composition [4, 7, 10, 23]. Successful outfit composition encom-          with the intention of being worn together. Instead it is more likely
passes an understanding of both style and compatibility.                  to reflect a user’s style preference. Data collected from Polyvore
   A popular approach is to embed items in a latent style or com-         gives a stronger signal of compatibility and furthermore provide
patibility space often using multi-modal features [10, 20, 21, 24].       complete outfits.
A challenge with this approach is how to use item embeddings to              The largest previously available outfits dataset was collected
measure the overall outfit compatibility. This challenge is increased     from Polyvore and contained 68,306 outfits and 365,054 items en-
when considering outfits of multiple sizes. Song et al. [20] only         tirely from WW [23]. Our dataset is the first to contain MW as
consider outfits of size 2 made of top-bottom pairs. Veit et al. [24]     well. Our WW dataset contains an order of magnitude more outfits
use a Siamese CNN, a technique which allows only consideration            than the Polyvore set, but has slightly fewer fashion items. This is
of pairwise compatibilities. Li et al. [10] combine text and image        a consequence of ASOS stylists choosing styling products from a
embeddings to create multi-modal item embeddings which are then           subset of items held in our studios meaning that styling products
combined using pooling to create an overall outfit representation.        can appear in many outfits.
Fashion Outfit Generation for E-commerce                                                                     SIGIR 2019 eCom, July 2019, Paris, France

                                                Table 1: Statistics of the ASOS outfits dataset

      Department       Number of Outfits      Number of Items       Outfits of size 2      Outfits of size 3     Outfits of size 4     Outfits of size 5
      Womenswear             314,200                321,672                 155,083             109,308                42,028                  7,781
       Menswear              272,120                270,053                 100,395             102,666                58,544                 10,515


   For each item we have four images, a text title and descrip-
tion, a high-level product type and a product category. We process
both the images and the text title and description to obtain lower-
dimensional embeddings, which are included in this dataset along-
side the raw images and text to allow full reproducibility of our
work. The methods used to extract these embeddings are described
in Sections 4.3 and 4.4, respectively. Although we have four images
for each item, in these experiments we only use the first image as
it consistently shows the entire item, from the front, within the
context of an outfit, whilst the other images can focus on close
ups or different angles, and do not follow consistent rules between
product types.

4     METHODOLOGY
Our approach uses a deep neural network. We acknowledge some
recent approaches that use LSTM neural networks [4, 14]. We have
not adopted this approach because fundamentally an outfit is a set of
fashion items and treating it as a sequence is an artificial construct.
LSTMs are also designed to progressively forget past items when
moving through a sequence which in this context would mean that
compatibility is not enforced between all outfit items.
    We consider an outfit to be a set of fashion items of arbitrary
length which match stylistically and can be worn together. In order
for the outfit to work, each item must be compatible with all other
items. Our aim is to model this by embedding each item into a
latent space such that for two items (Ii , I j ) the dot product of their
embeddings (zi , zj ) reflects their compatibility. We aim for the               Figure 3: Network architecture of GORDN’s item embedder.
embeddings of compatible items to have large dot products and                    For each item the embedder takes visual features, a textual
the embeddings of items which are incompatible to have small dot                 embedding of the item’s title and description, a pre-trained
products. We map input data for each item Ii to its embedding zi via             GloVe embedding of the item’s product category and a bi-
a multi-layer neural network. As we are treating hero products and               nary flag indicating if the item is the outfit’s hero product.
styling products differently, we learn two embeddings in the same                Each set of features is passed through a dense layer and the
space for each item; one for when the item is the hero product,                  outputs of these layers are concatenated along with the hero
 (h)                                                          (s)                product flag before being passed through two further dense
zi and one for when the item is a styling product, zi ; which
is reminiscent of the context specific representations in language               layers. The output is an embedding for the item in our style
modelling [13, 15].                                                              space. We train separate item embedders for womenswear
                                                                                 and menswear items.
4.1    Network Architecture                                                      4.2     Outfit Scoring
For each item, the inputs to our network are a textual title and                 We use the dot product of item embeddings to quantify pairwise
description embedding (1024 dimensions), a visual embedding (512                 compatibility. Outfit compatibility is then calculated as the sum
dimensions), a pre-trained GloVe embedding [15] for each product                 over pairwise dot products for all pairs of items in the outfit (Figure
category (50 dimensions) and a binary flag indicating the hero                   4).
product. First, each of the three input feature vectors is passed                    For an outfit S = {I 1 , I 2 , ..., I N } consisting of N items, the overall
through their own fully connected ReLU layer. The outputs from                   outfit score is defined by
these layers, as well as the hero product flag, are then concatenated
and passed through two further fully connected ReLU layers to
                                                                                                                            N
produce an item embedding with 256 dimensions (Figure 3). We use                                                   1
                                                                                                             ©            Õ             ª
                                                                                                    y(S) = σ ­                  zi · zj ® ,                  (1)
                                                                                                             ­                          ®
batch normalization after each fully connected layer and a dropout                                           ­ N (N − 1)                ®
                                                                                                                         i, j=1
rate of 0.5 during training.                                                                                 «            i