=Paper= {{Paper |id=Vol-2410/paper14.pdf |storemode=property |title=Fashion Outfit Generation for E-commerce |pdfUrl=https://ceur-ws.org/Vol-2410/paper14.pdf |volume=Vol-2410 |authors=Elaine Bettaney,Stephen Hardwick,Benjamin Paul Chamberlain,Odysseas Zisimopoulos |dblpUrl=https://dblp.org/rec/conf/sigir/BettaneyHCZ19 }} ==Fashion Outfit Generation for E-commerce== https://ceur-ws.org/Vol-2410/paper14.pdf

Fashion Outfit Generation for E-commerce
Elaine M. Bettaney Stephen R. Hardwick
ASOS.com ASOS.com
London, UK London, UK
elaine.bettaney@asos.com stephen.hardwick@asos.com

Odysseas Zisimopoulos Benjamin Paul Chamberlain
ASOS.com ASOS.com
London, UK London, UK
odysseas.zisimopoulos@asos.com ben.chamberlain@asos.com

ABSTRACT
Combining items of clothing into an outfit is a major task in fashion
retail. Recommending sets of items that are compatible with a
particular seed item is useful for providing users with guidance and
inspiration, but is currently a manual process that requires expert
stylists and is therefore not scalable or easy to personalise. We use a
multilayer neural network fed by visual and textual features to learn
embeddings of items in a latent style space such that compatible
items of different types are embedded close to one another. We
train our model using the ASOS outfits dataset, which consists
of a large number of outfits created by professional stylists and
which we release to the research community. Our model shows
strong performance in an offline outfit compatibility prediction
task. We use our model to generate outfits and for the first time
in this field perform an AB test, comparing our generated outfits
to those produced by a baseline model which matches appropriate
product types but uses no information on style. Users approved
of outfits generated by our model 21% and 34% more frequently
than those generated by the baseline model for womenswear and
menswear respectively.

KEYWORDS
Representation learning, fashion, multi-modal deep learning

1 INTRODUCTION
User needs based around outfits include answering questions such
as "What trousers will go with this shirt?", "What can I wear to a Figure 1: An ASOS fashion product together with associated
party?" or "Which items should I add to my wardrobe for summer?". product data and styling products in a Buy the Look (BTL)
The key to answering these questions requires an understanding carousel as shown on a Product Description Page (PDP).
of style. Style encompasses a broad range of properties including This paper describes a system for Generating Outfit Recom-
but not limited to, colour, shape, pattern and fabric. It may also mendations from Deep Networks (GORDN) under development
incorporate current fashion trends, user’s style preferences and an at ASOS.com. ASOS is a global e-commerce company focusing on
awareness of the context in which the outfits will be worn. In the fashion and beauty. With approximately 87,000 products on site at
growing world of fashion e-commerce it is becoming increasingly any one time, it is difficult for customers to perform an exhaustive
important to be able to fulfill these needs in a way that is scalable, search to find products that can be worn together. Each fashion
automated and ultimately personalised. product added to our catalogue is photographed on a model as part
of an individually curated outfit of compatible products chosen
Copyright © 2019 by the paper’s authors. Copying permitted for private and academic by our stylists to create images for its Product Description Page
purposes.
In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.): (PDP). The products comprising the outfit are then displayed to the
Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at customer in a Buy the Look (BTL) carousel (Figure 1). This offering
http://ceur-ws.org however is not scalable as it requires manual input for every outfit.
We aim to learn from the information encoded in these outfits to
automatically generate an unlimited number of outfits.
SIGIR 2019 eCom, July 2019, Paris, France E. M. Bettaney et al.

A common way for people to compose outfits is to first pick 104
As hero product
a seed item, such as a patterned shirt, and then find other com- As styling product
103

Frequency of occurrence
patible items. We focus on this task: completing an outfit based

in Buy the Look outfits
on a seed item. This is useful in an e-commerce setting as outfit
suggestions can be seeded with a particular product page or a user’s
102
past purchases. Our ASOS outfits dataset comprises a set of outfits
originating from BTL carousels on PDPs. These contain a seed, or
‘hero product’, which can be bought from the PDP. All other items 101
in the outfit we refer to as ‘styling products’.
There is an asymmetry between hero and styling products. Whilst 100
all items are used as hero products (in an e-commerce setting),
100 101 102 103 104
styling products are selected as the best matches for the hero prod- Rank
uct and this matching is directional. For example when the hero
product is a pair of Wellington boots it may create an engaging
Figure 2: Frequency of occurrence of each womenswear item
outfit to style them with a dress. However if the hero product is a
in our ASOS outfits dataset. Items are ranked by how fre-
dress then it is unlikely a pair of Wellington boots would be the best
quently they occur as styling products. Each item appears
choice of styling product to recommend. Hence in general styling
once at most as a hero product (red), while there is a heav-
products tend to be more conservative than hero products. Our
ily skewed distribution in the frequency with which items
approach takes this difference into account by explicitly including
appear as styling products (blue).
this information as a feature.
We formulate our training task as binary classification, where
Pooling allows them to consider outfits of variable size. Tangseng
GORDN learns to tell the difference between BTL and randomly
et al. [21] create item embeddings solely from images. They are able
generated negative outfits. We consider an outfit to be a set of
to use outfits of variable size by padding their set of item images to
fashion items and train a model that projects items into a single
a fixed length with a ‘mean image’. Our method is similar to these
style space. Compatible items will appear close in style space en-
as we combine multi-modal item embeddings, however we aim not
abling good outfits to be constructed from nearby items. GORDN
to lose information by pooling or padding.
is a neural network which combines embeddings of multi-modal
Vasileva et al. [23] extend this concept by noting that compat-
features for all items in an outfit and outputs a single score. When
ibility is dependent on context - in this case the pair of clothing
generating outfits, GORDN is used as a scorer to assess the validity
types being matched. They create learned type-aware projections
of different combinations of items.
from their style space to calculate compatibility between different
In summary, our contributions are:
types of clothing.
(1) A novel model that uses multi-modal data to generate outfits
that can be trained on images in the wild i.e. dressed people 3 OUTFIT DATASETS
rather than individual item flat shots. Outfits generated by
The ASOS outfits dataset consists of 586,520 outfits, each containing
our model outperform a challenging baseline by 21% for
between 2 and 5 items (see Table 1). In total these outfits contain
womenswear and 34% for menswear.
591,725 unique items representing 18 different womenswear (WW)
(2) A new research dataset consisting of 586,320 fashion outfits
product types and 22 different menswear (MW) product types. As
(images and textual descriptions) composed by ASOS stylists.
all of our outfits have been created by ASOS stylists, they are rep-
This is the world’s largest annotated outfit dataset and is the
resentative of a particular fashion style.
first to contain Menswear items.
Most previous outfit generators have used either co-purchase
data from Amazon [12, 24] or user created outfits taken from
2 RELATED WORK Polyvore [4, 5, 10, 14, 20, 21, 23], both of which represent a di-
Our work follows an emerging body of related work on learning verse range of styles and tastes. Co-purchase is not a strong signal
clothing style [11, 24], clothing compatibility [18, 20, 24] and outfit of compatibility as co-purchased items are typically not bought
composition [4, 7, 10, 23]. Successful outfit composition encom- with the intention of being worn together. Instead it is more likely
passes an understanding of both style and compatibility. to reflect a user’s style preference. Data collected from Polyvore
A popular approach is to embed items in a latent style or com- gives a stronger signal of compatibility and furthermore provide
patibility space often using multi-modal features [10, 20, 21, 24]. complete outfits.
A challenge with this approach is how to use item embeddings to The largest previously available outfits dataset was collected
measure the overall outfit compatibility. This challenge is increased from Polyvore and contained 68,306 outfits and 365,054 items en-
when considering outfits of multiple sizes. Song et al. [20] only tirely from WW [23]. Our dataset is the first to contain MW as
consider outfits of size 2 made of top-bottom pairs. Veit et al. [24] well. Our WW dataset contains an order of magnitude more outfits
use a Siamese CNN, a technique which allows only consideration than the Polyvore set, but has slightly fewer fashion items. This is
of pairwise compatibilities. Li et al. [10] combine text and image a consequence of ASOS stylists choosing styling products from a
embeddings to create multi-modal item embeddings which are then subset of items held in our studios meaning that styling products
combined using pooling to create an overall outfit representation. can appear in many outfits.
Fashion Outfit Generation for E-commerce SIGIR 2019 eCom, July 2019, Paris, France

Table 1: Statistics of the ASOS outfits dataset

Department Number of Outfits Number of Items Outfits of size 2 Outfits of size 3 Outfits of size 4 Outfits of size 5
Womenswear 314,200 321,672 155,083 109,308 42,028 7,781
Menswear 272,120 270,053 100,395 102,666 58,544 10,515

For each item we have four images, a text title and descrip-
tion, a high-level product type and a product category. We process
both the images and the text title and description to obtain lower-
dimensional embeddings, which are included in this dataset along-
side the raw images and text to allow full reproducibility of our
work. The methods used to extract these embeddings are described
in Sections 4.3 and 4.4, respectively. Although we have four images
for each item, in these experiments we only use the first image as
it consistently shows the entire item, from the front, within the
context of an outfit, whilst the other images can focus on close
ups or different angles, and do not follow consistent rules between
product types.

4 METHODOLOGY
Our approach uses a deep neural network. We acknowledge some
recent approaches that use LSTM neural networks [4, 14]. We have
not adopted this approach because fundamentally an outfit is a set of
fashion items and treating it as a sequence is an artificial construct.
LSTMs are also designed to progressively forget past items when
moving through a sequence which in this context would mean that
compatibility is not enforced between all outfit items.
We consider an outfit to be a set of fashion items of arbitrary
length which match stylistically and can be worn together. In order
for the outfit to work, each item must be compatible with all other
items. Our aim is to model this by embedding each item into a
latent space such that for two items (Ii , I j ) the dot product of their
embeddings (zi , zj ) reflects their compatibility. We aim for the Figure 3: Network architecture of GORDN’s item embedder.
embeddings of compatible items to have large dot products and For each item the embedder takes visual features, a textual
the embeddings of items which are incompatible to have small dot embedding of the item’s title and description, a pre-trained
products. We map input data for each item Ii to its embedding zi via GloVe embedding of the item’s product category and a bi-
a multi-layer neural network. As we are treating hero products and nary flag indicating if the item is the outfit’s hero product.
styling products differently, we learn two embeddings in the same Each set of features is passed through a dense layer and the
space for each item; one for when the item is the hero product, outputs of these layers are concatenated along with the hero
(h) (s) product flag before being passed through two further dense
zi and one for when the item is a styling product, zi ; which
is reminiscent of the context specific representations in language layers. The output is an embedding for the item in our style
modelling [13, 15]. space. We train separate item embedders for womenswear
and menswear items.
4.1 Network Architecture 4.2 Outfit Scoring
For each item, the inputs to our network are a textual title and We use the dot product of item embeddings to quantify pairwise
description embedding (1024 dimensions), a visual embedding (512 compatibility. Outfit compatibility is then calculated as the sum
dimensions), a pre-trained GloVe embedding [15] for each product over pairwise dot products for all pairs of items in the outfit (Figure
category (50 dimensions) and a binary flag indicating the hero 4).
product. First, each of the three input feature vectors is passed For an outfit S = {I 1 , I 2 , ..., I N } consisting of N items, the overall
through their own fully connected ReLU layer. The outputs from outfit score is defined by
these layers, as well as the hero product flag, are then concatenated
and passed through two further fully connected ReLU layers to
N
produce an item embedding with 256 dimensions (Figure 3). We use 1
© Õ ª
y(S) = σ zi · zj ® , (1)
®
batch normalization after each fully connected layer and a dropout N (N − 1) ®
i, j=1
rate of 0.5 during training. « i