Fashion Outfit Generation for E-commerce
                             Elaine M. Bettaney                                                        Stephen R. Hardwick
                                  ASOS.com                                                                    ASOS.com
                                  London, UK                                                                 London, UK
                          elaine.bettaney@asos.com                                                   stephen.hardwick@asos.com

                         Odysseas Zisimopoulos                                                    Benjamin Paul Chamberlain
                                ASOS.com                                                                    ASOS.com
                                London, UK                                                                  London, UK
                     odysseas.zisimopoulos@asos.com                                                  ben.chamberlain@asos.com

ABSTRACT
Combining items of clothing into an outfit is a major task in fashion
retail. Recommending sets of items that are compatible with a
particular seed item is useful for providing users with guidance and
inspiration, but is currently a manual process that requires expert
stylists and is therefore not scalable or easy to personalise. We use a
multilayer neural network fed by visual and textual features to learn
embeddings of items in a latent style space such that compatible
items of different types are embedded close to one another. We
train our model using the ASOS outfits dataset, which consists
of a large number of outfits created by professional stylists and
which we release to the research community. Our model shows
strong performance in an offline outfit compatibility prediction
task. We use our model to generate outfits and for the first time
in this field perform an AB test, comparing our generated outfits
to those produced by a baseline model which matches appropriate
product types but uses no information on style. Users approved
of outfits generated by our model 21% and 34% more frequently
than those generated by the baseline model for womenswear and
menswear respectively.

KEYWORDS
Representation learning, fashion, multi-modal deep learning

1    INTRODUCTION
User needs based around outfits include answering questions such
as "What trousers will go with this shirt?", "What can I wear to a                    Figure 1: An ASOS fashion product together with associated
party?" or "Which items should I add to my wardrobe for summer?".                     product data and styling products in a Buy the Look (BTL)
The key to answering these questions requires an understanding                        carousel as shown on a Product Description Page (PDP).
of style. Style encompasses a broad range of properties including                        This paper describes a system for Generating Outfit Recom-
but not limited to, colour, shape, pattern and fabric. It may also                    mendations from Deep Networks (GORDN) under development
incorporate current fashion trends, user’s style preferences and an                   at ASOS.com. ASOS is a global e-commerce company focusing on
awareness of the context in which the outfits will be worn. In the                    fashion and beauty. With approximately 87,000 products on site at
growing world of fashion e-commerce it is becoming increasingly                       any one time, it is difficult for customers to perform an exhaustive
important to be able to fulfill these needs in a way that is scalable,                search to find products that can be worn together. Each fashion
automated and ultimately personalised.                                                product added to our catalogue is photographed on a model as part
                                                                                      of an individually curated outfit of compatible products chosen
Copyright © 2019 by the paper’s authors. Copying permitted for private and academic   by our stylists to create images for its Product Description Page
purposes.
In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.):                        (PDP). The products comprising the outfit are then displayed to the
Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at   customer in a Buy the Look (BTL) carousel (Figure 1). This offering
http://ceur-ws.org                                                                    however is not scalable as it requires manual input for every outfit.
                                                                                      We aim to learn from the information encoded in these outfits to
                                                                                      automatically generate an unlimited number of outfits.
SIGIR 2019 eCom, July 2019, Paris, France                                                                                                  E. M. Bettaney et al.


   A common way for people to compose outfits is to first pick                                        104
                                                                                                                                            As hero product
a seed item, such as a patterned shirt, and then find other com-                                                                            As styling product
                                                                                                      103


                                                                           Frequency of occurrence
patible items. We focus on this task: completing an outfit based


                                                                            in Buy the Look outfits
on a seed item. This is useful in an e-commerce setting as outfit
suggestions can be seeded with a particular product page or a user’s
                                                                                                      102
past purchases. Our ASOS outfits dataset comprises a set of outfits
originating from BTL carousels on PDPs. These contain a seed, or
‘hero product’, which can be bought from the PDP. All other items                                     101
in the outfit we refer to as ‘styling products’.
   There is an asymmetry between hero and styling products. Whilst                                    100
all items are used as hero products (in an e-commerce setting),
                                                                                                            100   101   102          103         104
styling products are selected as the best matches for the hero prod-                                                          Rank
uct and this matching is directional. For example when the hero
product is a pair of Wellington boots it may create an engaging
                                                                          Figure 2: Frequency of occurrence of each womenswear item
outfit to style them with a dress. However if the hero product is a
                                                                          in our ASOS outfits dataset. Items are ranked by how fre-
dress then it is unlikely a pair of Wellington boots would be the best
                                                                          quently they occur as styling products. Each item appears
choice of styling product to recommend. Hence in general styling
                                                                          once at most as a hero product (red), while there is a heav-
products tend to be more conservative than hero products. Our
                                                                          ily skewed distribution in the frequency with which items
approach takes this difference into account by explicitly including
                                                                          appear as styling products (blue).
this information as a feature.
   We formulate our training task as binary classification, where
                                                                          Pooling allows them to consider outfits of variable size. Tangseng
GORDN learns to tell the difference between BTL and randomly
                                                                          et al. [21] create item embeddings solely from images. They are able
generated negative outfits. We consider an outfit to be a set of
                                                                          to use outfits of variable size by padding their set of item images to
fashion items and train a model that projects items into a single
                                                                          a fixed length with a ‘mean image’. Our method is similar to these
style space. Compatible items will appear close in style space en-
                                                                          as we combine multi-modal item embeddings, however we aim not
abling good outfits to be constructed from nearby items. GORDN
                                                                          to lose information by pooling or padding.
is a neural network which combines embeddings of multi-modal
                                                                             Vasileva et al. [23] extend this concept by noting that compat-
features for all items in an outfit and outputs a single score. When
                                                                          ibility is dependent on context - in this case the pair of clothing
generating outfits, GORDN is used as a scorer to assess the validity
                                                                          types being matched. They create learned type-aware projections
of different combinations of items.
                                                                          from their style space to calculate compatibility between different
   In summary, our contributions are:
                                                                          types of clothing.
    (1) A novel model that uses multi-modal data to generate outfits
        that can be trained on images in the wild i.e. dressed people     3                 OUTFIT DATASETS
        rather than individual item flat shots. Outfits generated by
                                                                          The ASOS outfits dataset consists of 586,520 outfits, each containing
        our model outperform a challenging baseline by 21% for
                                                                          between 2 and 5 items (see Table 1). In total these outfits contain
        womenswear and 34% for menswear.
                                                                          591,725 unique items representing 18 different womenswear (WW)
    (2) A new research dataset consisting of 586,320 fashion outfits
                                                                          product types and 22 different menswear (MW) product types. As
        (images and textual descriptions) composed by ASOS stylists.
                                                                          all of our outfits have been created by ASOS stylists, they are rep-
        This is the world’s largest annotated outfit dataset and is the
                                                                          resentative of a particular fashion style.
        first to contain Menswear items.
                                                                             Most previous outfit generators have used either co-purchase
                                                                          data from Amazon [12, 24] or user created outfits taken from
2    RELATED WORK                                                         Polyvore [4, 5, 10, 14, 20, 21, 23], both of which represent a di-
Our work follows an emerging body of related work on learning             verse range of styles and tastes. Co-purchase is not a strong signal
clothing style [11, 24], clothing compatibility [18, 20, 24] and outfit   of compatibility as co-purchased items are typically not bought
composition [4, 7, 10, 23]. Successful outfit composition encom-          with the intention of being worn together. Instead it is more likely
passes an understanding of both style and compatibility.                  to reflect a user’s style preference. Data collected from Polyvore
   A popular approach is to embed items in a latent style or com-         gives a stronger signal of compatibility and furthermore provide
patibility space often using multi-modal features [10, 20, 21, 24].       complete outfits.
A challenge with this approach is how to use item embeddings to              The largest previously available outfits dataset was collected
measure the overall outfit compatibility. This challenge is increased     from Polyvore and contained 68,306 outfits and 365,054 items en-
when considering outfits of multiple sizes. Song et al. [20] only         tirely from WW [23]. Our dataset is the first to contain MW as
consider outfits of size 2 made of top-bottom pairs. Veit et al. [24]     well. Our WW dataset contains an order of magnitude more outfits
use a Siamese CNN, a technique which allows only consideration            than the Polyvore set, but has slightly fewer fashion items. This is
of pairwise compatibilities. Li et al. [10] combine text and image        a consequence of ASOS stylists choosing styling products from a
embeddings to create multi-modal item embeddings which are then           subset of items held in our studios meaning that styling products
combined using pooling to create an overall outfit representation.        can appear in many outfits.
Fashion Outfit Generation for E-commerce                                                                     SIGIR 2019 eCom, July 2019, Paris, France

                                                Table 1: Statistics of the ASOS outfits dataset

      Department       Number of Outfits      Number of Items       Outfits of size 2      Outfits of size 3     Outfits of size 4     Outfits of size 5
      Womenswear             314,200                321,672                 155,083             109,308                42,028                  7,781
       Menswear              272,120                270,053                 100,395             102,666                58,544                 10,515


   For each item we have four images, a text title and descrip-
tion, a high-level product type and a product category. We process
both the images and the text title and description to obtain lower-
dimensional embeddings, which are included in this dataset along-
side the raw images and text to allow full reproducibility of our
work. The methods used to extract these embeddings are described
in Sections 4.3 and 4.4, respectively. Although we have four images
for each item, in these experiments we only use the first image as
it consistently shows the entire item, from the front, within the
context of an outfit, whilst the other images can focus on close
ups or different angles, and do not follow consistent rules between
product types.

4     METHODOLOGY
Our approach uses a deep neural network. We acknowledge some
recent approaches that use LSTM neural networks [4, 14]. We have
not adopted this approach because fundamentally an outfit is a set of
fashion items and treating it as a sequence is an artificial construct.
LSTMs are also designed to progressively forget past items when
moving through a sequence which in this context would mean that
compatibility is not enforced between all outfit items.
    We consider an outfit to be a set of fashion items of arbitrary
length which match stylistically and can be worn together. In order
for the outfit to work, each item must be compatible with all other
items. Our aim is to model this by embedding each item into a
latent space such that for two items (Ii , I j ) the dot product of their
embeddings (zi , zj ) reflects their compatibility. We aim for the               Figure 3: Network architecture of GORDN’s item embedder.
embeddings of compatible items to have large dot products and                    For each item the embedder takes visual features, a textual
the embeddings of items which are incompatible to have small dot                 embedding of the item’s title and description, a pre-trained
products. We map input data for each item Ii to its embedding zi via             GloVe embedding of the item’s product category and a bi-
a multi-layer neural network. As we are treating hero products and               nary flag indicating if the item is the outfit’s hero product.
styling products differently, we learn two embeddings in the same                Each set of features is passed through a dense layer and the
space for each item; one for when the item is the hero product,                  outputs of these layers are concatenated along with the hero
 (h)                                                          (s)                product flag before being passed through two further dense
zi and one for when the item is a styling product, zi ; which
is reminiscent of the context specific representations in language               layers. The output is an embedding for the item in our style
modelling [13, 15].                                                              space. We train separate item embedders for womenswear
                                                                                 and menswear items.
4.1    Network Architecture                                                      4.2     Outfit Scoring
For each item, the inputs to our network are a textual title and                 We use the dot product of item embeddings to quantify pairwise
description embedding (1024 dimensions), a visual embedding (512                 compatibility. Outfit compatibility is then calculated as the sum
dimensions), a pre-trained GloVe embedding [15] for each product                 over pairwise dot products for all pairs of items in the outfit (Figure
category (50 dimensions) and a binary flag indicating the hero                   4).
product. First, each of the three input feature vectors is passed                    For an outfit S = {I 1 , I 2 , ..., I N } consisting of N items, the overall
through their own fully connected ReLU layer. The outputs from                   outfit score is defined by
these layers, as well as the hero product flag, are then concatenated
and passed through two further fully connected ReLU layers to
                                                                                                                            N
produce an item embedding with 256 dimensions (Figure 3). We use                                                   1
                                                                                                             ©            Õ             ª
                                                                                                    y(S) = σ ­                  zi · zj ® ,                  (1)
                                                                                                             ­                          ®
batch normalization after each fully connected layer and a dropout                                           ­ N (N − 1)                ®
                                                                                                                         i, j=1
rate of 0.5 during training.                                                                                 «            i <j          ¬
SIGIR 2019 eCom, July 2019, Paris, France                                                                              E. M. Bettaney et al.


                                                                        classification. The CAM is calculated as a linear combination of the
                                                                        feature maps weighted by the corresponding class weights.
                                                                           Before using the CAM model to extract image features, we fine-
                                                                        tune it on our dataset. Similar to [25] our model architecture com-
                                                                        bines VGG with a Global Average Pooling (GAP) layer and an output
                                                                        classification layer. We initialize VGG with weights pre-trained on
                                                                        ImageNet and fine-tune it towards product type classification (e.g.
                                                                        Jeans, Dresses, etc.). After training we pass each image to the VGG
                                                                        and obtain the feature maps.
                                                                           To produce localised image embeddings, we use the CAM to
                                                                        spatially re-weight the feature maps. Similar to Jimenez et al. [8],
                                                                        we perform the re-weighting by a simple spatial element-wise mul-
                                                                        tiplication of the feature maps with the CAM. Our pipeline is shown
                                                                        in Figure 5. This re-weighting can be seen as a form of attention
                                                                        mechanism on the area of interest in the image. The final image
                                                                        embedding is a 512-dimensional vector. The same figure illustrates
                                                                        the effect of the re-weighting mechanism on the feature maps.

                                                                        4.4    Title and Description Embeddings
                                                                        Product titles typically contain important information, such as the
                                                                        brand and colour. Similarly, our text descriptions contain details
                                                                        such as the item’s fit, design and material. We use pre-trained text
                                                                        embeddings of our item’s title and description. These embeddings
                                                                        are learned as part of an existing ASOS production system that
                                                                        predicts product attributes [3]. Vector representations for each
                                                                        word are passed through a simple 1D convolutional layer, followed
Figure 4: GORDN’s outfit scorer takes the learnt embeddings             by a max-over-time pooling layer and finally a dense layer, resulting
of each item in an outfit and produces a score in the range             in 1024 dimensional embeddings.
[0,1], with a high value representing a compatible outfit. The
scorer takes the sum of the compatibility scores (dot prod-             4.5    Training
uct) for each pair of items within the outfit, normalised by
                                                                        We train GORDN in a supervised manner using a binary cross-
the number of pairs of items. This is then passed through a
                                                                        entropy loss. Our training data consists of positive outfit samples
sigmoid function.
                                                                        taken from the ASOS outfits dataset and randomly generated nega-
                                                                        tive outfit samples. We generate negative samples for our training
where σ is the sigmoid function. The normalisation factor of N (N −     and test sets by randomly replacing the styling products in each
1), proportional to the number of pairs of items in the outfit is       outfit with another item of the same type. For example, for an out-
required to deal with outfits containing varying numbers of items.      fit with a top as the hero product and jeans and shoes as styling
The sigmoid function is used to ensure the output is in the range       products, we would create a negative sample by replacing the jeans
[0,1].                                                                  and shoes with randomly sampled jeans and shoes. We ensure that
                                                                        styling products appear with the same frequency in the positive
4.3    Visual Feature Extraction                                        and negative samples by sampling styling products from their dis-
As described in Section 1 and illustrated in Figure 1, items are        tribution in the positive samples. This is important as the frequency
photographed as part of an outfit and therefore our item images         distribution of styling products is heavily skewed (Figure 2) and
frequently contain the other items from the BTL outfit. Feeding         without preserving this GORDN could memorise frequently occur-
the whole image to the network would result in features capturing       ring items and predict outfit compatibility based on their presence.
information for the entire input leaking information to GORDN. It       By matching the distribution GORDN must instead learn the charac-
was therefore necessary to localise the target item within the image.   teristics of items which lead to compatibility. Although some of the
To extract visual features from the images in our dataset we use        negative outfits generated in this way may be good quality outfits,
VGG [19]. Feeding the whole image to the network would result in        we assume that the majority of these randomly generated outfits
features capturing information for the entire input, that is both the   will contain incompatible item combinations. Randomly selecting
hero and the styling products. To extract features focused on the       negative samples in this way is common practice in metric learning
most relevant areas of the image, we adopt an approach based on         and ranking problems (e.g. [6, 16]). In both training and testing, we
Class Activation Mapping (CAM) [25]. Weakly-supervised object           generate one negative outfit sample for each positive outfit sample.
localisation is performed by calculating a heatmap (CAM) from               To assess the relative importance of each set of input features,
the feature maps of the last convolutional layer of a CNN, which        we conduct an ablation study. We separately train five different
highlights the discriminative regions in the input used for image       versions of GORDN using only the textual title and description
Fashion Outfit Generation for E-commerce                                                                 SIGIR 2019 eCom, July 2019, Paris, France


                                        VGG                      CAM                                           GAP


                                                                                                                      ...
                                                    ..


                                                                                                    ..
                                                     .


                                                                                                    .
                                                               ...


                                                                          ...


                                                                                          ...
Figure 5: Pipeline for extracting image embeddings. An image is passed to VGG and the final convolutional feature maps are
used to calculate the class activation map (CAM). The CAM is then used to spatially re-weight the feature maps by element-
wise multiplication. Finally, a global average pooling (GAP) layer averages each re-weighted feature map to calculate a single
value (shown in same colours) and outputs a 512-dimensional image embedding. During training, re-weighting is ignored
and the output of the model is passed into a fully-connected layer for product type classification. We can see the effect of
feature map re-weighting in the brackets for the case of an item with trousers as the hero product and top and shoes as styling
products. Activations in the feature maps that correspond to the relevant region of interest in the input image (trousers) are
refined by re-weighting (i.e. second and third row) whereas irrelevant activations are ignored (first row).


embeddings as input (text), only the visual embeddings (vis), both              Table 2: The number of outfits and items in our training and
text and visual embeddings (text + vis), text, visual and category              test partitions after applying the Louvain community detec-
embeddings (text + vis + cat), and finally the full set of inputs (text         tion method to the full ASOS outfits dataset. There are no
+ vis + cat + hero). For each of these configurations we trained 20             items which appear in both the training and test set.
models from scratch using Adam [9] for 30 epochs.
                                                                                    Department     Dataset    Number of Outfits      Number of Items
4.6    Outfit Generation Method                                                  Womenswear         Train            237,478              239,818
Once trained, GORDN can generate novel outfits of any length by                                     Test              76,722               81,854
sequentially adding items and re-scoring the new outfit. Each outfit
                                                                                    Menswear        Train            201,844              198,947
starts with a hero product from our catalogue. We then define an
                                                                                                    Test              70,276               71,106
outfit template P = {T (h) ,T1 , ...,TN −1 } as a set of product types
including the hero product type T (h) and N − 1 other compatible
styling product types. Our aim is to find the set of items of the
                                                                                speed up in outfit generation whilst still maintaining a precision@5
appropriate product types that maximises the outfit score y.
                                                                                of over 80%.
   An exhaustive search over every possible combination of styling
                                                                                   The choice of template P depends on the use case. Templates for
products cannot be computed within a reasonable e-commerce la-
                                                                                each hero product type can be found from our ASOS outfits dataset.
tency budget. Instead, we map the maximum inner product search
                                                                                The distribution of templates can be used to introduce variety into
in Equation 1 to a Euclidean nearest neighbour problem that is
                                                                                the generated outfits. For the purposes of our AB test, we picked
solved approximately and combine this with a beam search (illus-
                                                                                the most frequently occurring template for each hero product type.
trated in Figure 6). The approximate nearest neighbours algorithm
uses a PCA-tree that has been adapted for recommendations prob-
lems [1]. We use a beam width of three because it returned the
                                                                                5     EVALUATION
optimal outfit 77.5% of the time. The beam search algorithm is re-              We evaluate the performance of GORDN on two tasks. The first task
peated for all (N − 1)! permutations of the styling product types in            is binary classification of genuine and randomly generated outfits,
the template as different outfits may be generated depending on the             using a held out test set. The second task is user evaluation of outfits
order in which product types are added. The outfit returned is the              generated by GORDN in comparison to randomly generated outfits
one that has the maximal score across all template permutations.                from a simple baseline model.
For each step of the beam search, we calculate the resultant vector
of the partial outfit and find the w approximate nearest neighbours             5.1    Train/test split
from the product catalogue (where w is the beam width). With each               We split the ASOS outfits dataset first into WW and MW and each
step searching through 2000-5000 products we achieved a ten times               of these into a training and test set ensuring that no items appeared
SIGIR 2019 eCom, July 2019, Paris, France                                                                              E. M. Bettaney et al.

Skirts WW         Tops WW         Shoes WW          Bags WW            Table 3: Comparison of GORDN when using different fea-
                                                                       tures on the binary classification task. Scores are the mean
                                                                       over 20 runs of the test set AUC after 30 epochs of training.

                       0.99              0.96                0.97
                                                                                      Features                   AUC
                                                                                                                WW MW
                                                                                      vis                       0.66    0.55
                                         0.95                0.94
                                                                                      text                      0.80    0.66
                                                                                      text + vis                0.82    0.66
                                                                                      text + vis + cat          0.82    0.67
                                         0.93                0.93                     text + vis + cat + hero   0.83    0.67


                                                                       split ratio as far as possible and secondly to ensure items from
                                                                       each season are proportionally split between the train and test sets.
                       0.97              0.98                0.99      This resulted in 76:24 and 74:26 train-test splits in terms of outfits
                                                                       for WW and MW respectively. The use of disjoint train and test
                                                                       sets provides a sterner test for GORDN as it is unable to simply
                                         0.98                0.96
                                                                       memorise which items frequently co-occur in outfits in the training
                                                                       set. Instead, the embeddings GORDN learns must represent product
                                                                       attributes that contribute to fashion compatibility.

                                         0.95                0.92      5.2    Outfit Classification Results
                                                                       The test set contains BTL outfits and an equal number of negative
                                                                       samples. We use GORDN to predict compatibility scores for the
                                                                       test set outfits and then calculate the AUC of the ROC curve. We
                                                                       found that training separate versions of GORDN for WW and MW
                       0.96              0.99                0.98
                                                                       produced better results and so we report the performance of these
                                                                       here.
                                                                          Table 3 shows the AUC scores achieved for different combina-
                                         0.97                0.95      tions of features. As we add features to GORDN we increase its
                                                                       performance, with the best performing model including text, visual,
                                                                       category and hero item features. The majority of the performance
                                                                       benefit came from the text embeddings with visual embeddings
                                         0.92                0.91
                                                                       adding a small improvement. We expected our visual embeddings
                                                                       to be of poorer quality than those for Polyvore datasets as our
                                                                       images show whole outfits on people as opposed to a photograph
Figure 6: Beam search in the context of outfit generation.             of the fashion item in isolation. In contrast the success of our text
Starting with an outfit template of product types and a hero           embeddings could be due to the attribution task on which they
product (highlighted in yellow) each product type in the               were trained [3]. A total of 34 attributes were predicted, including
template is filled sequentially by finding the products from           many attributes that are directly applicable for outfit composition
the catalogue which when added to the outfit give the high-            e.g. ‘pattern’, ‘neckline’, ‘dress type’ and ‘shirt style’.
est outfit score. After each step the number of outfits re-               For all feature combinations the WW model greatly outperforms
tained is reduced to the beam width (set to 3). The retained           the MW one. This could be due to fashion items being more in-
outfits after each step are highlighted in green with the high-        terchangeable in MW than in WW hence having more similar
est scoring outfit in dark green.                                      embeddings making the training task harder. For example the mean
                                                                       correlations between the text embeddings for the most prevalent
                                                                       product type in the WW and MW training sets are 0.041 (dresses)
in both sets. To achieve this we first represented the ASOS outfits    and 0.077 (T-shirts) respectively. More simply, there are many com-
dataset as a graph where the nodes are items and edge weights are      binations of MW T-shirts and jeans that make equally acceptable
defined by the number of outfits pairs of items are found together     outfits whereas there are far fewer for WW dresses and shoes.
in. We then used the Louvain community detection method [2] to            Using GORDN to predict compatibility scores for our test set
split the graph into communities which maximise the modularity.        is equivalent to the outfit compatibility task used by [4] and [23].
This resulted in many small communities which could then be com-       As noted by Vasileva et al., Han’s negative samples contain outfits
bined together to create the train and test sets. When re-combining    that are incompatible due to multiple occurrences of product types
communities care was taken firstly to respect the desired train-test   e.g. multiple pairs of shoes in the same outfit. Since our negative
Fashion Outfit Generation for E-commerce                                                                                                                         SIGIR 2019 eCom, July 2019, Paris, France

                                                                                                                                              GORDN                                          Randomly Generated
ASOS Outﬁt Generator Feedback

Tops                                           Jeans                              Shoes
New Look stripe button through top in yellow   Levi's 501 High Rise Skinny Jean   New Look Patent Heeled Ankle Boot


                                                                                                                          Rolla's Logo        Lee Scarlett     Vans Authentic        Rolla's Logo      ASOS DESIGN          Faith Fringe
                                                                                                                             T-Shirt          Skinny Jeans     Classic Black            T-Shirt         Rivington high      Chain Loafer
                                                                                                                                                               Mono Lace Up                            waisted jeggings
                                                                                                                                                                  Trainers                             in smokey grey
                                                                                                                                                                                                             wash


                                                                                                                        Mads Norgaard         Esprit Bardot     ALDO T Bar         Mads Norgaard       ASOS DESIGN         Trufﬂe Collection
                                                                                                                        Gingham Skater         Stripe Tie        Sandal with       Gingham Skater       blouse with          Spike Stud
                                                                                                                             Skirt             Sleeve Top      Diamante Gems            Skirt           frill shoulder        Heel Boot


Does this outﬁt work?
     YES          NO


                Figure 7: Screenshot of Outfit Evaluation App                                                           AX Paris ruched           Public Desire Winona             AX Paris ruched           Nike Running Epic
                                                                                                                        velvet mini dress          Embellished Block               velvet mini dress       React Trainers In Black
                                                                                                                                                    Heeled Sandals

samples were generated using templates respecting product type
our data does not have this characteristic and hence we compare
only to results in [23]. Our WW model achieves an AUC score
just slightly less than Vasileva et al.’s compatibility AUC on their
disjoint Polyvore outfits dataset.
                                                                                                                         ASOS DESIGN         Moss London        ASOS Loafers        ASOS DESIGN          COLLUSION          PS Paul Smith
                                                                                                                         wedding regular     Skinny Smart      In Black Suede       wedding regular     skater ﬁt check     Kirk dino print
                                                                                                                       ﬁt shirt in super ﬁne  Trouser In      With Metal Snafﬂe   ﬁt shirt in super ﬁne trousers with       canvas hi-tops
5.3           Generated Outfit Evaluation                                                                               cotton in off white     Check                              cotton in off white    side stripe          in white


We perform an AB test to evaluate the quality of outfits generated
by GORDN. We select six popular outfit templates to test, three
each for WW and MW (shown in Table 4), and generate 100 WW
and 100 MW outfits split evenly across the templates. We use a
                                                                                                                       Reclaimed Vintage Carhartt WIP          Vans x Mickey      Reclaimed Vintage Bershka Skinny         ASOS DESIGN
large pool of in stock products from which we randomly select hero                                                     inspired oversized Smith jean in
                                                                                                                       t-shirt with rainbow navy rigid
                                                                                                                                                               Mouse SK8-Hi
                                                                                                                                                               trainers in navy
                                                                                                                                                                                  inspired oversized Jeans In Washed
                                                                                                                                                                                  t-shirt with rainbow    Black
                                                                                                                                                                                                                             trainers in
                                                                                                                                                                                                                             navy block
products of the required product types. The remaining items in the                                                       face illustration                                          face illustration

outfits were generated using the beam search method described
in Section 4.6 and illustrated in Figure 6. These outfits constitute
our test group. For a control group we take the same hero products
and templates and generate outfits by randomly selecting items of
                                                                                                                          ASOS Skinny       Gym King Muscle    Puma Trimm            ASOS Skinny       ASOS DESIGN         Red Tape Tassel
the correct type from the same pool of products. By using outfit                                                           Short With        T-Shirt In Navy     Quick MU             Short With       muscle ﬁt t-shirt   Loafers In Brown
                                                                                                                           Camo Print         With Contrast  Trainers In Green        Camo Print       with dictionary         Leather
templates we ensure that none of the outfits contain incompatible                                                                               Sleeves                                                 slogan print

product type combinations, such as by pairing a dress and a skirt,
or by placing two pairs of shoes in one outfit. Instead, the quality                                                  Figure 8: Example outfits generated by GORDN (left column)
of the outfits depends solely on style compatibility between items.                                                   and by random selection of items (right column) that were
   To run the AB test we developed an internal app which we                                                           used in our AB test. In each row the same hero product is
exposed to ASOS employees. A screenshot of the app is shown                                                           used (red box) and each model is given the same template of
in Figure 7. The app displayed an outfit to the user asking them                                                      product types.
to decide if the items in the outfit work stylistically. The outfits
were shown one at a time to each user with the order of outfits
randomised for each user. WW and MW outfits were only shown                                                              The results are shown in Figure 4. We analyse the results for
to female and male users respectively and each user rated all 200                                                     WW and MW separately as the WW and MW models were trained
outfits from their corresponding gender.                                                                              separately. We collected 1,200 observations per group for WW and
   The data collected from the app comprised a binary score for                                                       900 for MW. We found the relative difference between the test and
each user-outfit pair. The data exhibit two way correlation — all                                                     control groups to be 21.28% and 34.16% for WW and MW respec-
scores from the same user are correlated due to the inherent user                                                     tively. Testing at the 1% level these differences were significant. We
preferences and all scores on the same outfit are also correlated. We                                                 were able to further break down our results to find that GORDN
therefore used a two-way random effects model as described in [17]                                                    outperformed the control significantly for all templates.
to calculate the variance of the sample mean. We could then use a                                                        Examples of outfits generated for our AB test are shown in
t-test for the difference between means to calculate if the difference                                                Figure 8. For each hero product we show the outfit produced by
between the test and control groups was significant.                                                                  GORDN alongside the randomly generated outfit. Many of the
SIGIR 2019 eCom, July 2019, Paris, France                                                                                                            E. M. Bettaney et al.

      Table 4: Relative differences between the test and control group user scores. All results are significant at the 1% level.

                                                                                              Ctrl score      Test score       Rel. diff. (%)      p-value
                  WW       all                                                                    0.49            0.60              21.28           < 0.01
                           Dress | Shoes                                                          0.54            0.78              46.12           < 0.01
                           Tops | Jeans | Shoes                                                   0.61            0.64               4.53           < 0.01
                           Skirts | Tops | Shoes                                                  0.33            0.36              10.77           < 0.01
                  MW       all                                                                    0.49            0.66              34.16           < 0.01
                           T-Shirts | Jeans | Shoes, Boots & Trainers                             0.63            0.76              19.07           < 0.01
                           Shirts | Trousers & Chinos | Shoes, Boots & Trainers                   0.42            0.60              44.35           < 0.01
                           Shorts | T-Shirts | Shoes, Boots & Trainers                            0.43            0.63              47.24           < 0.01


random examples appear to be reasonable outfits. Although the                             [6] Elad Hoffer and Nir Ailon. 2015. Deep metric learning using triplet network. In
random model is simple, the use of outfit templates, combined with                            Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial
                                                                                              Intelligence and Lecture Notes in Bioinformatics).
selecting only products that were in stock in the ASOS catalogue                          [7] Yang Hu, Xi Yi, and Larry S. Davis. 2015. Collaborative Fashion Recommendation:
on the same day makes this a challenging baseline.                                            A Functional Tensor Factorization Approach. In Proceedings of the 23rd ACM
                                                                                              international conference on Multimedia. Brisbane, Australia, 129–138.
                                                                                          [8] Albert Jimenez, Jose M. Alvarez, and Xavier Giro-i Nieto. 2017. Class-Weighted
5.4     Style space                                                                           Convolutional Features for Visual Instance Search. In 28th British Machine Vision
                                                                                              Conference (BMVC).
We visualise our style space using a t-Distributed Stochastic Neigh-                      [9] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti-
bour Embedding (t-SNE) [22] plot in two dimensions (Figure 9a).                               mization. In Internationl Aconference for Learning Representations. San Diego.
                                                                                         [10] Yuncheng Li, Liangliang Cao, Jiang Zhu, and Jiebo Luo. 2017. Mining fashion
While similar items have similar embeddings, we can also see that                             outfit composition using an end-to-end deep learning approach on set data. IEEE
compatible items of different product types have similar embed-                               Transactions on Multimedia 19, 8 (2017), 1946–1955.
                                                                                         [11] Yihui Ma, Jia Jia, Suping Zhou, Jingtian Fu, Yejun Liu, and Zijian Tong. 2017.
dings. Rather than dresses and shoes being completely separate                                Towards better understanding the clothing fashion styles: A multimodal deep
in style space, these product types overlap, with casual dresses                              learning approach. AAAI Conference on Artificial Intelligence (2017), 38–44.
having similar embeddings to casual shoes and occasion dresses                           [12] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hen-
                                                                                              gel. 2015. Image-based Recommendations on Styles and Substitutes. In SIGIR
having similar embeddings to occasion shoes. We built an app for                              Converence on Research and Development in Information Retrieval. 43–52.
internal use that uses t-SNE to visualise our style space and allows                     [13] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed
us to easily explore compatible item combinations, as predicted by                            Representations of Words and Phrases and their Compositionality. In Neural
                                                                                              Information Processing Systems. 3111–3119.
GORDN (Figure 9b).                                                                       [14] Takuma Nakamura and Ryosuke Goto. 2018. Outfit Generation and Style Extrac-
                                                                                              tion via Bidirectional LSTM and Autoencoder. In The third international workshop
                                                                                              on fashion and KDD.
6     CONCLUSION                                                                         [15] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove:
                                                                                              Global Vectors for Word Representation. In Proceedings of the 2014 Conference on
We have described GORDN, a multi-modal neural network for                                     Empirical Methods in Natural Language Processing (EMNLP).
generating outfits of fashion items, currently under development at                      [16] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme.
ASOS. GORDN learns to represent items in a latent style space, such                           2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. In Conference
                                                                                              on Uncertainty in Artificial Intelligence, Vol. 1120. 452–461.
that compatible items of different types have similar embeddings.                        [17] Flâvio Ribeiro, Dinei Florêncio, Cha Zhang, and Michael Seltzer. 2011. CROWD-
GORDN is trained on the ASOS outfits dataset, a new resource                                  MOS: An approach for crowdsourcing mean opinion score studies. In ICASSP,
for the research community which contains over 500,000 outfits                                IEEE International Conference on Acoustics, Speech and Signal Processing - Proceed-
                                                                                              ings. 2416–2419.
curated by professional stylists. The results of an AB test show that                    [18] Yong-Siang Shih, Kai-Yueh Chang, Hsuan-Tien Lin, and Min Sun. 2018. Com-
users approve of outfits generated by GORDN 21% and 34% more                                  patibility Family Learning for Item Recommendation and Generation. In AAAI
                                                                                              Conference on Artificial Intelligence. 2403–2410.
frequently than those generated by a simple baseline model for                           [19] K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for
womenswear and menswear, respectively.                                                        Large-Scale Image Recognition. CoRR abs/1409.1556 (2014).
                                                                                         [20] Xuemeng Song, Fuli Feng, Xianjing Han, Xin Yang, Wei Liu, and Liqiang Nie.
                                                                                              2018. Neural Compatibility Modeling with Attentive Knowledge Distillation. In
REFERENCES                                                                                    SIGIR Conference on Research & Development in Information Retrieval. 5–14.
 [1] Yoram Bachrach, Yehuda Finkelstein, Ran Gilad-Bachrach, Liran Katzir, Noam          [21] Pongsate Tangseng, Kota Yamaguchi, and Takayuki Okatani. 2018. Recom-
     Koenigstein, Nir Nice, and Ulrich Paquet. 2014. Speeding up the xbox recom-              mending Outfits from Personal Closet. In Proceedings of the IEEE Conference on
     mender system using a euclidean transformation for inner-product spaces. In              Computer Vision and Pattern Recognition. 269–277.
     Proceedings of the 8th ACM Conference on Recommender systems. ACM, 257–264.         [22] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using
 [2] Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefeb-            t-SNE. Journal of Machine Learning Research 9 (2008), 2579–2605.
     vre. 2008. Fast unfolding of communities in large networks. J. Stat. Mech (2008),   [23] Mariya I. Vasileva, Bryan A. Plummer, Krishna Dusad, Shreya Rajpal, Ranjitha
     1–12.                                                                                    Kumar, and David Forsyth. 2018. Learning Type-Aware Embeddings for Fashion
 [3] Ângelo Cardoso, Fabio Daolio, and Saúl Vargas. 2018. Product Characterisation            Compatibility. In European Conference on Cumputer Vision. 405–421.
     towards Personalisation. In Proceedings of the 24th ACM SIGKDD International        [24] Andreas Veit, Balazs Kovacs, Sean Bell, Julian Mcauley, Kavita Bala, and Serge
     Conference on Knowledge Discovery & Data Mining - KDD ’18. 80–89.                        Belongie. 2015. Learning Visual Clothing Style with Heterogeneous Dyadic
 [4] Xintong Han, Zuxuan Wu, Yu-Gang Jiang, and Larry S. Davis. 2017. Learning                Co-occurrences. In IEEE International Conference on Computer Vision. 4642–4650.
     Fashion Compatibility with Bidirectional LSTMs. Proceedings of the 2017 ACM on      [25] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba.
     Multimedia Conference - MM ’17 1 (2017), 1078–1086.                                      2016. Learning Deep Features for Discriminative Localization. In Computer Vision
 [5] Tong He and Yang Hu. 2018. FashionNet: Personalized Outfit Recommendation                and Pattern Recognition.
     with Deep Neural Network. (2018), 1–9.
Fashion Outfit Generation for E-commerce                                                 SIGIR 2019 eCom, July 2019, Paris, France


                                                               (a)
             ASOS Style Space Explorer
             Select product types to include:


              AX Paris Scuba
               Pephem Midi
              Dress With Lace
                    Top


            Most similar products in style
                       space

             Filter results by product type:


                                                               (b)

Figure 9: a) A section of a t-Distributed Stochastic Neighbour Embedding (t-SNE) visualisation of the embeddings learnt by
GORDN for womenswear dresses and shoes. Similar items have similar embeddings, but so do compatible items of different
types. The two highlighted areas illustrate that casual dresses are embedded close to casual shoes (red), while occasion dresses
are embedded close to occasion shoes (blue). b) Screenshot of an internal app, developed to allow exploration of the learnt style
space. Users can create t-SNE plots using different product types and then select individual items to view the most similar items
in style space of different types.