Learning Embeddings for Product Size Recommendations
                Kallirroi Dogani∗                                       Matteo Tomassetti∗                                 Sofie De Cnudde
                     ASOS.com                                                ASOS.com                                          ASOS.com
                    London, UK                                              London, UK                                        London, UK
            kallirroi.dogani@asos.com                               matteo.tomassetti@asos.com                         sofiede.cnudde@asos.com

                                                  Saúl Vargas                                  Ben Chamberlain
                                                ASOS.com                                         ASOS.com
                                                London, UK                                       London, UK
                                      saul.vargassandoval@asos.com                        ben.chamberlain@asos.com

ABSTRACT                                                                              with an even higher average return rate of 30-40 % for fashion
Despite significant recent growth in online fashion retail, choosing                  products. It is desirable to minimise returns as the process incurs
product sizes remains a major problem for customers. We tackle                        high operational and environmental costs.
the problem of size recommendation in fashion e-commerce with                            The size problem can not be solved by simply mapping between
the goal of improving customer experience and reducing financial                      different sizing schemes such as mapping a EUR shoe size 45 to a
and environmental costs from returned items. We propose a novel                       UK size 11. There are two reasons for this: (1) inconsistent sizes, for
size recommendation system that learns a latent space for product                     example a men’s US size 8 shoe is 10 inches for a Nike trainer [3]
sizes using only past purchases and brand information. Key to the                     while an Adidas trainer measures 10.2 inches [1], (2) simple sizes
success of our model is the application of transfer learning from a                   mask the complexity of the underlying products. For instance, a
brand to a product level. We develop a neural collaborative filtering                 t-shirt will be sold as small, medium or large, but the size is at
model that is applicable to every product, without requiring specific                 least seven dimensional∗ and there is no standardisation of these
customer or product measurements or explicit customer feedback                        dimensions, even for a given brand.
on the purchased sizes, which are not available for most customers                       Personalised size recommendations provide a general solution
or products. Offline experiments using data from a major retailer                     to the size and fit problem. However, the development of a size
show improvements of between 4-40 % over the matrix factorisation                     recommendation system is accompanied by a number of challenges,
baseline.                                                                             which we address in our model. Firstly, physical measurements of
                                                                                      customers and products are generally not available. Secondly, data
KEYWORDS                                                                              indicating that a return was due to incorrect sizing is often missing
Recommender Systems, Representation Learning, Transfer Learn-                         or unreliable, as it is optionally collected from customers without
ing, E-Commerce                                                                       verification. Thirdly, the presence of an additional size variable
                                                                                      makes the data sparser than would be expected in the equivalent
ACM Reference Format:
                                                                                      product recommendations problem. Finally, the existence of differ-
Kallirroi Dogani, Matteo Tomassetti, Sofie De Cnudde, Saúl Vargas, and Ben
                                                                                      ent sizing schemes (e.g. EU, UK, US etc.) introduces heterogeneous
Chamberlain. 2019. Learning Embeddings for Product Size Recommenda-
tions. In Proceedings of the SIGIR 2019 Workshop on eCommerce (SIGIR                  data, which must be compared in some way.
2019 eCom), 9 pages.                                                                     We propose the Product Size Embedding (PSE) model, which
                                                                                      is a neural collaborative filtering approach that learns a latent
1 INTRODUCTION                                                                        representation for all the possible size variations of products and
                                                                                      customers’ sizing preferences using solely purchase data. By doing
Providing customers with accurate size guidance is one of the main
                                                                                      so we handle problems with missing physical measurements or
challenges in the online fashion industry. Since customers can not
                                                                                      returns reasons. We map all sizes into a common continuous latent
try garments before purchasing them, e-commerce platforms often
                                                                                      space, which neatly overcomes heterogeneity in sizing schemes and
adopt free return policies to motivate customers to purchase items
                                                                                      addresses the inconsistency in sizes that would be hard to address
regardless of concerns about size. This effectively turns homes into
                                                                                      with a discrete combinatorial representation† . To deal with sparsity,
fitting rooms and encourages customers to order multiple sizes of
                                                                                      we first solve the problem at a brand level by accepting the loose
the same product and return the items that do not fit. According
                                                                                      assumption that sizing within the same brand is consistent. Then,
to a recent estimate [2], 15-40 % of online purchases are returned,
                                                                                      we transfer this knowledge onto a product level, where sizes of
∗Both authors contributed equally to this research.                                   products within the same brand now have separate representations.
                                                                                      Our main contributions are:
Copyright © 2019 by the paper’s authors. Copying permitted for private and academic
purposes.                                                                                  • A novel size recommendation system that maps sizes into a
In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.):                               single latent space without requiring customer or product
Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at
http://ceur-ws.org
                                                                                      ∗ neck circumference, arm circumference, arm length, height, chest circumference,
                                                                                      waist circumference, shoulder width
                                                                                      † such as mapping products to a discrete platonic size scale
SIGIR 2019 eCom, July 2019, Paris, France                                                        K. Dogani, M. Tomassetti, S. De Cnudde et. al


      physical measurements or explicit customers’ feedback on              complex statistical models. [4] filters out users where the mean and
      returned items (e.g. too big/small). Our model leads to an im-        standard deviation of the purchased sizes exceeds a category-level
      provement of between 4-40 % when compared to the matrix               threshold. [18] uses a hierarchical clustering method where clusters
      factorisation baseline.                                               are iteratively merged as long as the standard deviation of the
    • We show that transferring knowledge learned from a higher             cluster does not exceed an empirically determined threshold. Each
      level (brands) leads to improved and generalised solutions            persona is then treated as a separate customer in the subsequent
      at a lower level (products).                                          prediction problem. An improvement to the latter work is made
    • We introduce a method to filter out multiple personas from            in [19], where a persona distribution is drawn from a Dirichlet
      our dataset. Our solution is independent of fixed thresholds          distribution. Latent variables related to the specific persona are
      or empirically-tuned hyperparameters.                                 then appended to each purchase transaction. Finally, [8] follows
                                                                            a Gaussian kernel density estimation approach which is further
  The rest of the paper is structured as follows: Section 2 presents
                                                                            refined to a Gaussian mixture model. Two assumptions are made
previous related work, Section 3 introduces our proposed model
                                                                            here: (i) the maximum number of personas is fixed at four, and (ii)
and Section 4 describes how we handle accounts used by multiple
                                                                            the case where only one persona is active is deemed more likely.
personas. Finally, in Section 5 we discuss our experiments and the
                                                                            Each identified persona is subsequently retained in the dataset.
performance of our model.
                                                                            A similar problem is tackled in literature focused on identifying
                                                                            active household members in online rental services [5]. Contextual
2    RELATED WORK                                                           variables such as day of week or time of day are used to identify
The size recommendation problem has been previously studied in              which member is responsible for which actions and which member
[4, 8, 13, 18–20]. Specifically, [18] models the size prediction task       is active at a certain point in time.
as an ordinal regression problem, where the customer and prod-
uct true sizes are learned by taking their differences and feeding
them into a linear model. [19] extends the work of [18] with a
Bayesian logit and probit regression model with ordinal categories.         3    THE PRODUCT SIZE EMBEDDING MODEL
The posterior distribution over customer and product true sizes             The Product Size Embedding (PSE) model follows a neural collabo-
is based on mean-field variational inference with Polya-Gammma              rative filtering approach to learn embeddings for each product-size
augmentation. The Bayesian approach allows the use of priors for            combination. The main advantage of the PSE over related latent
handling data sparsity and the computation of confidence intervals          variable models (e.g. [13]) is that it does not rely on noisy and sparse
for dealing with noisy data. Both [18] and [19] generate ordinal            customer feedback on the returned items (i.e. customers optionally
categorical variables based on explicit customer feedback on re-            reporting that the item was too big / small). Instead, only implicit
turned items (e.g. too small, too big or no return). [8] proposes a         signals are used; the products that are purchased and the subset
Bayesian model that learns the joint probability of a customer pur-         that are returned.
chasing a given product size and the resulting return status being             Collaborative filtering [9, 17] uses customer-product interactions
either too small, too big or no return. The probability distribution        and is based on the assumption that customers buying similar prod-
over sizes is conditioned on the return status and the probability          ucts have similar tastes. This principle naturally translates into the
over return statuses is modeled as the empirical distribution over          size and fit domain as "customers with similar body shapes tend
the three possible return events along with a Dirichlet prior based         to buy clothes in similar sizes". Matrix factorisation approaches,
on the counts at the brand and category level. [13] learns a latent         such as the one proposed by Hu et al. [9], have been proposed to
space for customers and products by applying ordinal regression.            capture the latent taste/preference/style space as reflected by the
A fitness score is computed for each purchase and size ordering is          interactions between customers and products. Matrix factorisation
enforced based on customer’s feedback on the purchased size (i.e.           decomposes customer-product interaction matrices into low-rank
too small, too big or a good fit). In order to handle class imbalances,     user and item matrices that represent, respectively, customers and
metric learning techniques are applied to transform data into a             products as vectors in a latent space that captures preferences and
space where purchases of the same class are closer and purchases            styles. Our proposed PSE model similarly represents customers and
of different classes are separated by a margin.                             product sizes in a vector space. However, there are two important
    There are two additional studies [4, 20] that tackle the size and       differences between our approach and most matrix factorisation
fit problem. [4] learns latent product features using Word2Vec [12]         approaches. Firstly, we learn a latent space at a product size level
and feeds them into a Gradient Boosting classifier along with ad-           instead of at a product level i.e. we have a different vector for every
ditional product features (e.g physical measurements, colour, etc.).        possible size of a product. Secondly, we adopt an asymmetric frame-
However, additional product features are often difficult to obtain          work [15] so that users are not represented explicitly, but as the
[6]. Finally, [20] extends [4] to the specific case of footwear size rec-   aggregate of the product vectors with which they have interacted.
ommendations and also proposes a probabilistic graphical approach           Accordingly, we train different models for each product category
that exploits brand similarities.                                           (tops, bottoms or shoes), so all trained embeddings belong to the
    In literature covering the size recommendation problem, multiple        same category and the learned latent space represents the same
approaches have been employed to reduce noise by identifying                body part. The asymmetric approach eliminates learning an em-
multiple personas. The approaches vary from using empirically               bedding layer for customers, which greatly reduces the number of
determined thresholds on the range of purchased sizes to more               parameters. For example, the symmetric approach for menswear
Learning Embeddings for Product Size Recommendations                                                SIGIR 2019 eCom, July 2019, Paris, France


Figure 1: The architecture of the Product Size Embedding model, which is trained independently for each product category
(tops, bottoms or shoes) by maximising the dot product between the user vector Vu and the product size vector Vps of a pur-
chased size ps . The softmax is computed for each product over all of its possible sizes (i.e. the purchased size ps and the
non-purchased sizes p¬s ).


shoes requires ∼ 780K product size and ∼ 3M customer parame-               user and product vectors from all contiguous subsequences of length
ters, therefore the asymmetric model is approximately five times           k where the first (k-1) elements form a customer vector and the
smaller. Another advantage of the asymmetric approach is that the          k th is the target product-size. The similarity τ between customers
model does not require retraining for new customers since their            and product-sizes is given by the dot product between the user and
representations can be inferred from their purchase history. The           product vectors
architecture for the PSE model is shown in Figure 1.
                                                                                                     τu,ps = VTu Vps ,                       (2)
   We model size recommendation as a multi-class classification
task. Given a user u and a product p, the task is to predict the           and product size probabilities are computed as the softmax of the
customers’ size in that product, ps∗ . This differs from standard multi-   similarity scores normalised over all sizes of the given product
class classification as each product is only available in a small subset                                                 e τu,pi
of all possible size classes (t-shirts don’t come in shoe sizes etc.).                    f (τ )u,pi = P(s = i |u, p) = Í τu,p ,             (3)
                                                                                                                         je
                                                                                                                                 j
   The input to the model is a set of user purchase histories, Hu .
For every customer we create a sequence of previously purchased            where the index j runs over all possible sizes of product p. To
(and not returned) product sizes {ps1 , ps2 , ...psn }. For a sequence     evaluate this softmax we require the product-size vectors for ps ∀s,
                                                                           which are stored in a key-value stored keyed on the product id.
on length n, the nt h product-size is the target and the previous n − 1
                                                                              The PSE is trained in Keras using the Adam optimiser [10] with
products are used to construct a customer vector. Each product-size
                                                                           parameters α = 0.001, b1 = 0.9, b2 = 0.999 and the categorical
in the history indexes into an embedding matrix using a neural
                                                                           cross-entropy loss
network embedding layer to produce a product-size vector Vps ∈                                                             (
Rk . User vectors Vu ∈ Rk are constructed by taking the first n −                    ÕÕ                                      1 if j = s
                                                                               L=−           t j log(f (τ )u,p j ) , t j =                  (4)
1 product-sizes in the Hu , retrieving the associated product-size                                                           0 otherwise
                                                                                      D j
vectors and taking the mean
                                                                           where D is the extended set of purchase histories and s is the
                            1        Õ                                     purchased size.
                      Vu =                    Vps ,                 (1)
                           n−1
                                   ps ∈Hu\n
                                                                           3.1    Transfer from Brands to Products
where Hu\n is the history minus the target size-product. In practice,      As we model product-size combinations instead of just products,
to increase the amount of training data, for each Hu we will create        our product-size interaction matrix is roughly ten times sparser (e.g.
SIGIR 2019 eCom, July 2019, Paris, France                                                           K. Dogani, M. Tomassetti, S. De Cnudde et. al


      Figure 2: The size embeddings learned at a brand level are used to initialise the size embeddings at a product level.


from ∼ 3×10−4 to ∼ 4×10−5 for menswear shoes) than the data used         with at least two purchases and with a size difference‡ larger than
for product recommendations. As a result, learning representations       one, are potential candidates for the multiple persona detection pro-
for all possible product-size combinations is challenging. Transfer      cess. The output of the GMM consists of a mixture of components,
learning is a popular technique to generalise from small datasets to     each representing a different persona in the purchase history. Each
larger ones [14]. We assume that each brand has consistent sizes         component (or persona) is represented by a Gaussian distribution,
and we learn latent representations Vbs for every combination of         whose mean µ corresponds to the persona’s core size.
brand b = {p} and size s. Then, we transfer this knowledge to a             Since the number of personas λ using an account is unknown, we
product level by initialising                                            employ the silhouette score s λ [16] to find the optimal number of
                                                                         mixture components λopt (see Algorithm 1). The silhouette score is
                       Vps = Vbs , ∀ps ∈ bs .                     (5)
                                                                         a cluster evaluation metric that measures how well each purchased
   As shown in Figure 2, we train the model at a brand size level,       size is clustered with similar purchased sizes. An s λ ≈ 1 implies
then we initialise the product size vectors Vps with the trained         non-overlapping clusters with high density, while s λ = 0 points to
brand size vectors Vbs and finally we train the model at a product       overlapping clusters.
size level to fine tune the product size vectors. Applying the pre-
trained brand size vectors at a product level improves generalisation,   Algorithm 1 Algorithm for multiple persona detection
boosts performance and leads to faster convergence. In Section 5.3,      Input: purchase history Hu
we demonstrate the improvements transfer learning offers over            Output: λopt persona
random initialisation of latent vectors.                                   λ←2
                                                                           s λ−1 ← 0
4   DETECTING MULTIPLE PERSONAS                                            s λ = getSilhouetteScore(GMM(Hu , λ))
A major challenge in the design of recommender systems is identi-                                             λ
                                                                            while s λ > s λ−1 and           min |µ i − µ j | > 1 do
fying accounts that are shared across multiple users. Some services,                                      i, j=1;i,j
such as Netflix [7], solve this problem by creating explicit user             λ =λ+1
profiles for each persona. In our work, user profiles are not viable          s λ ← getSilhouetteScore(GMM(Hu , λ))
and so we detect multiple personas as a preprocessing step.                 end while
   To detect multiple personas we employ a Gaussian Mixture                 λopt = λ − 1
Model (GMM) [11] that predicts the number of individuals us-
ing an account and identifies each persona’s purchases. Our pro-            The process of identifying multiple personas consists of running
posed method is independent of assumption-based thresholds or            the GMM to detect λ personas within Hu and calculating the sil-
empirically-tuned hyperparameters. When we detect an account             houette score s λ associated with that mixture. The parameter λ
with multiple personas, we subsequently remove it from both train-
                                                                         ‡ We have ordered each sizing scheme from the smallest to the largest size found in our
ing and test sets.
                                                                         dataset and defined a set of sizing indexes. For examples, the sizing index for the sizing
   Our GMM approach is based on the assumption that the pur-             scheme CAT ranges from 0 (3XS) to 25 (8XL). When referring to the size difference
chases of every persona are centred around a core size. Customers        between two sizes, we mean their difference when mapped to the sizing index.
Learning Embeddings for Product Size Recommendations                                                 SIGIR 2019 eCom, July 2019, Paris, France

Table 1: Example of the output of the multiple persona de-                            Table 2: Size range for all sizing schemes.
tection process for womenswear shoes.
                                                                                      Sizing Scheme     Size Range
   Purchase history Hu                                  Detection
                                                                                      UK                UK2, UK4, ..., UK34
   UK3, UK3, UK3.5, UK4, UK4                            1 persona                     EU                EU30, EU32, ..., EU50
   {UK2, UK2}, {UK5, UK5, UK6}                          2 personas
   {UK2, UK3, UK3, UK3, UK4, UK4}, {UK6, UK6}, {UK9}    3 personas
                                                                                      CAT               3XS, ..., 8XL
   UK2, UK3, UK4, UK5, UK6, UK6, UK7, UK8, UK9          reseller                      JNS               W22in L26in, ..., W44in L34in
                                                                                      WST               W22in, ..., W44in
                                                                                      CST               Chest 32in, ..., Chest 56in


                                                                               The evaluation of the detected multiple personas is similar to
                                                                            evaluating clusters in unsupervised clustering techniques. During
                                                                            the detection process, we calculate the silhouette score, and thus
                                                                            have a built-in evaluation metric that guides the clustering. Figure 3
                                                                            demonstrates that as the size difference of the purchases increases,
                                                                            the probability of detecting a multiple persona account steadily
                                                                            increases, but it then flattens out and decreases for very large size
                                                                            differences, which indicate a higher probability of detecting a re-
                                                                            seller.

                                                                            5     EXPERIMENTS AND RESULTS
                                                                            In this section, we first describe the experimental setup, then detail
                                                                            the baselines for comparison and finally present our results. Our
                                                                            experiments are based on data from a major online retailer collected
                                                                            over one year. We have grouped all products into three categories
                                                                            (Tops, Bottoms and Shoes), two genders (menswear (MW) and
Figure 3: Percentage of multiple persona accounts (red line),
                                                                            womenswear (WW)), and six sizing schemes (see Table 2).
reseller accounts (blue line) and no multiple persona ac-
                                                                               The size recommendation problem is solved independently for
counts (green line) in function of the size difference of the
                                                                            each product category-gender combination e.g. menswear-tops.
purchases for menswear bottoms.
                                                                            Table 3 shows example product types that comprise each product
                                                                            category as well as the supported sizing schemes and high-level
is iteratively increased as long as (i) s λ is higher than s λ−1 , and      statistics. Products originate from a large and diverse network of
(ii) the core size of each mixture component differs by at least 1          international suppliers, with thousands of new items added weekly
size unit. When the iterative process is finished, λopt is set to λ         and so in general, physical measurements of products are not avail-
and if λopt > 1 that customer is identified as buying for multiple          able.
personas.
    While dealing with the multiple persona problem, two additional
                                                                            5.1    Experimental Setup
issues arise: i) the problem of resellers, and ii) the issue of purchases   Since we solve the size prediction problem separately for each
in multiple sizing schemes. Resellers are customers who purchase            product category, the purchase history Hu has been computed
products with the intention of reselling them, so it is likely that         using all previous purchases of customer u from the same product
their purchases cover a wider range of sizes. In that case, a Gaussian      category (i.e. we do not use past purchases of shoes to predict sizes
mixture model is not suitable for detecting them, as their purchases        for tops). We exclude any returned products from the purchase
are not centred around a core size, but instead have a uniform distri-      history as there is no data specifying whether items are returned
bution. Therefore, prior to performing multiple persona detection,          due to poor fit or for other reasons.
we eliminate all customers with a uniform purchase history.                    Table 4 shows examples of the same purchase history computed
    To apply the GMM model, we first need to convert all sizes into         at different levels. In this case, applying transfer learning from
a single sizing scheme. Since most existing conversion tables are           the brand level to the product level means that we initialise the
incomplete and inaccurate, we have used the data to approximate             product size vector id43498_W34inL32in with the brand size vector
size conversions. Specifically, we build a co-purchase matrix per           Levis_W34inL32in.
product category between two sizing schemes and we convert sizes               We divide the dataset for each product category into a training
according to the highest co-purchase frequency. Note that this              and a test set using an 80:20 split.
conversion is only an approximation for data cleaning purposes
and is not used in the final size prediction model.                         5.2    Comparison Methods
    Table 1 lists examples of purchase histories that are flagged as        We compare the performance of the following personalised meth-
either multiple personas or resellers.                                      ods:
SIGIR 2019 eCom, July 2019, Paris, France                                                        K. Dogani, M. Tomassetti, S. De Cnudde et. al

Table 3: Properties and high level statistics of the product categories. WW and MW refer to womenswear and menswear,
respectively.

            Product Category     Product Types             Sizing Schemes     #users    #products    #brands     % MP     % Resellers
            TopsWW               crop tops, hoodies, ...   UK, CAT, EU        3.4M      105.6K       800         9.5%     0.3%
            BottomsWW            jeans, leggings, ...      JNS, CAT           1.3M      24.7K        609         4.9%     0.1%
            ShoesWW              boots, trainers, ...      UK, EU             1.2M      17.0K        206         3.0%     0.6%
            TopsMW               shirts, t-shirts, ...     CAT, CST           1.3M      66.5K        430         5.3%     0.9%
            BottomsMW            jeans, chinos, ...        JNS, CAT, WST      840.6K    21.0K        362         3.6%     0.4%
            ShoesMW              boots, trainers, ...      UK                 391.5K    12.2K        182         2.3%     1.1%

                                  Table 4: The same purchase history generated at different levels.

                      Level Applied                    Purchase History Hu
                      Brand Level                      Adidas_L, Levis_W34inL32in
                      Brand & Product Type Level       Adidas_Shorts_L, Levis_Jeans_W34inL32in
                      Product Level                    Adidas_Shorts_id3223_L, Levis_Jeans_id43498_W34inL32in


     • MCS-SS. This method predicts the user’s most common size
       (MCS) given the sizing scheme (SS) of product p. For instance,
       if Hu = (id1432_UK8, id1564_UK8, id1055_UK9, id1453_EU36)
       is the purchase history of user u, this method predicts UK8
       for products available in UK sizes and EU36 for products
       available in EU sizes. If there is a tie, MCS-SS predicts the
       most recent purchased size.
     • ALS. This is a symmetric matrix factorisation model opti-
       mized through alternating least squares [9].
     • LR. This is a multi-class Logistic Regression classifier that
       takes as input the normalised counts of the purchased sizes
       and one-hot encoded features for the product type, brand
       and sizing scheme.
     • PSE-B. Version of the PSE model where the size embeddings
       are learned at a brand level.                                        Figure 4: PSE-B accuracy as a function of the latent space
     • PSE-BPT. Version of the PSE model where the size embed-              dimension, k, for each category. The results are independent
       dings are learned at a brand and product type level.                 of k when k ≥ 10.
     • PSE. The size embeddings are learned at a product level.
     • t-PSE-BPT. The size embeddings are learned at a brand and
       product type level and the embedding layer is initialised with       the accuracy increases when the size embeddings are learned at a
       the latent space learned from PSE-B.                                 brand and product type level (PSE-BPT) as opposed to the brand
     • t-PSE. This is our proposed PSE model. The size embeddings           level (PSE-B). However, when latent representations are learned
       are learned at a product level and the embedding layer is            at a product size level (PSE), the accuracy drops for some product
       initialised with the latent space learned from PSE-B.                categories. If we consider the case of menswear shoes, the num-
   We cannot compare our model against other size recommen-                 ber of latent vectors we need to train increases from 1.4K (PSE-B)
dation algorithms recently published as they require extra data             to 77.9K (PSE), therefore the latent space becomes sparser which
sources that are not always available (i.e. the return reason). Our         makes the model prone to overfitting (Figure 5). To overcome this
model is more generic and could be applied to any fashion dataset.          issue, we have used latent representations learned from PSE-B to
   All PSE experiments have been run with a fixed latent space              initialise the embedding layer in tPSE-BPT and tPSE. The results
dimension k = 10. We have explored the dependency of this param-            show that transfer learning improves generalisation and leads to
eter on our results and found no statistically significant difference       more accurate predictions.
when adopting a higher k (see Fig. 4).                                         Table 7 shows examples where the tPSE model successfully pre-
                                                                            dicts sizes that are not included in the purchase history, illustrating
5.3    Results                                                              the benefits of learning latent size representations.
The results of our experiments are summarised in Table 5. All vari-            To better understand how tPSE performs in different scenarios,
ations of the PSE model outperform the baselines. We observe that           we have evaluated the model on purchase histories of different
Learning Embeddings for Product Size Recommendations                                                   SIGIR 2019 eCom, July 2019, Paris, France

Table 5: Accuracy of each tested model for all product categories. The improvement in accuracy for the tPSE model is statisti-
cally significant (**α = 0.01). WW and MW used in the product categories refer to womenswear and menswear, respectively.

                Product Category      MCS-SS     ALS        LR            PSE-B     PSE-BPT     PSE        tPSE-BPT     tPSE
                TopsWW                38.917%    60.760%    60.361%       61.175%   61.302%     60.654%    61.294%      62.286%**
                BottomsWW             30.129%    56.440%    57.456%       58.287%   58.446%     58.574%    58.500%      60.083%**
                ShoesWW               63.098%    60.672%    68.354%       69.263%   69.276%     69.518%    69.289%      70.498%**
                TopsMW                64.009%    62.496%    68.689%       69.796%   70.135%     69.542%    70.134%      70.962%**
                BottomsMW             31.893%    52.789%    59.498%       59.964%   60.255%     57.910%    60.290%      61.992%**
                ShoesMW               64.467%    49.160%    68.209%       68.319%   68.612%     65.644%    68.691%      69.344%**


                  Table 6: Hitrate@K for tPSE.

           Product Category      Hitrate@2      Hitrate@3
           TopsWW                88.711%        96.939%
           BottomsWW             84.909%        93.835%
           ShoesWW               87.529%        94.668%
           TopsMW                92.373%        98.315%
           BottomsMW             82.485%        90.455%
           ShoesMW               86.259%        93.793%


Table 7: Examples of tPSE successfully predicting a size that
has not been purchased before.
                                                                              Figure 5: Training (blue lines) and test (orange lines) accu-
 Purchase History Hu                               True Predicted Size        racy as a function of the number of epochs for PSE (solid
 id3455_UK6.5, id5637_UK6, id4112_UK6.5            id9652_UK7                 lines) and PSE-B (dashed lines) in menswear shoes. The
 id6563_UK6, id1463_UK8, id3004_UK6                id8102_EU34                model trained at a product level (PSE) starts overfitting af-
                                                                              ter the third epoch, while the model trained at a brand level
                                                                              (PSE-B) is more stable. Similar trends have been observed for
                                                                              the other product categories.
lengths. Figure 6a shows that the accuracy for menswear shoes
increases as more items are present in the purchase history. We
observe that the accuracy of the model for purchase histories with            area around redtape_UK8 contains brands of size UK8. The neigh-
six or more items is more than 75%. However, this occurs for less             bourhood in the upper-left corner consists of UK7 sizes, while the
than 10% of the data (Figure 6b). The same figure shows that more             area in the bottom-right corner is constructed mainly with UK9
than 50% of the customers only have one item in their purchase                sizes. In the gap between these three big clusters, we observe the
history, which is not sufficient to accurately learn the customer’s           half sizes UK7.5 and UK8.5, which show the transitions from the
true size. We observe similar trends for all other product categories.        UK8 cluster to the UK7 and UK8 neighbourhood, respectively. In a
    To confirm that our model does not deviate significantly from the         similar context, Figure 8 shows the latent space of sizes for wom-
purchased size, we have also evaluated the Hitrate@K, defined as              enswear tops. The size representations are sorted in ascending
the fraction of times the correct size is within the top K predictions.       order, starting with XS sizes in the upper-right corner and ending
To retrieve the top K recommended sizes, we rank the predictions              with the cluster of XL sizes in the bottom-right corner. Additionally,
based on the similarity scores between the user vector Vu and the             we observe that same or similar sizes from different sizing schemes
product size vectors Vps . Hitrate@2 ranges between 85-92% for all            (e.g. XS and UK6) are mapped into the same neighbourhoods of
product categories (Table 6) and can explain cases where customers            the latent space. Both figures confirm the assumption that similar
may be in between two sizes. For instance, both sizes S and M could           purchased sizes correspond to customers with similar body mea-
fit well, but the customer has to pick just one when completing a             surements. Based on this assumption, we can use customer-product
purchase.                                                                     interactions to learn a latent space for size representations.

5.4    Analysis on the Latent Space                                           6     CONCLUSION
Figures 7 and 8 show instances of the latent representations mapped           We introduced the Product Size Embedding (PSE) model, a novel
onto a 3D space using the t-SNE technique for dimensionality re-              approach to solve the size recommendation problem in fashion
duction [21]. Specifically, Figure 7 shows the menswear shoes graph           e-commerce. The PSE model requires only customer-product inter-
constructed by retrieving the closest vectors to redtape_UK8. The             actions and brand information without needing explicit customer
SIGIR 2019 eCom, July 2019, Paris, France                                                     K. Dogani, M. Tomassetti, S. De Cnudde et. al


           (a) Accuracy of tPSE as a function of the number of items in (b) Distribution of the length of the purchase history Hu .
           the purchase history.                                        The dataset is dominated by customers with only one pur-
                                                                        chased item.

Figure 6: Relation between accuracy and the number of purchases in the purchase history Hu for menswear shoes. Similar
trends have been observed for the other product categories.


Figure 7: 3D t-SNE projection of the latent space of
                                                                          Figure 8: 3D t-SNE projection of the latent space of wom-
menswear shoes centred around redtape_UK8 . Purple points
                                                                          enswear tops. The size representations are sorted in ascend-
are closer to redtape_UK8 and represent UK8 or UK8.5 sizes,
                                                                          ing order, starting with XS sizes in the upper-right corner
while orange points are more distant and represent UK7,
                                                                          and ending in XL sizes in the bottom-right corner. Similar
UK7.5 or UK9 sizes.
                                                                          sizes of different sizing schemes are clustered together.


                                                                          performance of the model at a product level. Finally, we have pro-
                                                                          posed a technique to identify multiple personas in the purchase
feedback on the returned items (i.e the item was too big or too           history and applied it to reduce the noise in our data.
small). Our offline evaluation on a large-scale e-commerce dataset
shows that mapping product sizes into a single latent space leads
to more accurate size predictions over a range of different base-
lines. In addition, we have demonstrated the advantages of transfer
learning and how knowledge learned at a brand level boosts the
Learning Embeddings for Product Size Recommendations                                                                   SIGIR 2019 eCom, July 2019, Paris, France


REFERENCES                                                                               [11] Bruce G. Lindsay. 1995. Mixture Models: Theory, Geometry and Applications.
 [1] 2019. Adidas Size Chart for Men’s Shoes | adidas UK. https://www.adidas.co.uk/           Institute of Mathematical Statistics.
     help/size_charts. Accessed: 2019-01-20.                                             [12] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Esti-
 [2] 2019.       Finding a Fix for Retail’s Trillion-Dollar Problem: Returns.                 mation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781
     https://www.cnbc.com/2019/01/10/growing-online-sales-means-more-                         (2013).
     returns-and-trash-for-landfills.html. Accessed: 2019-01-20.                         [13] Rishabh Misra, Mengting Wan, and Julian McAuley. 2018. Decomposing Fit
 [3] 2019. Nike.com Size Fit Guide - Men’s Shoes. https://www.nike.com/us/en_us/c/            Semantics for Product Size Recommendation in Metric Spaces. In Proceedings of
     size-fit-guide/mens-shoe-sizing-chart. Accessed: 2019-01-20.                             the 12th Conference on Recommender Systems (RecSys ’18). ACM, pp. 422–426.
 [4] G. Mohammed Abdulla and Sumit Borar. 2017. Size Recommendation System for           [14] Sinno Jialin Pan and Qiang Yang. 2010. A Survey on Transfer Learning. IEEE
     Fashion E-Commerce. In KDD Workshop on Machine Learning Meets Fashion.                   Transactions on Knowledge and Data Engineering 22 (2010), 1345–1359.
 [5] Pedro G. Campos, Alejandro Bellogin, Fernando Díez, and Iván Cantador. 2012.        [15] Arkadiusz Paterek. 2007. Improving Regularised Singular Value Decomposition
     Time Feature Selection for Identifying Active Household Members. In Proceedings          for Collaborative Filtering. In Proceedings of KDD Cup and Workshop. ACM, pp.
     of the 21st International Conference on Information and Knowledge Management             5–8.
     (CIKM ’12). ACM, pp. 2311–2314.                                                     [16] Peter J. Rousseeuw. 1987. Silhouettes: A Graphical Aid to the Interpretation and
 [6] Ângelo Cardoso, Fabio Daolio, and Saúl Vargas. 2018. Product Characterisation            Validation of Cluster Analysis. Journal of Computational and Applied Mathematics
     towards Personalisation: Learning Attributes from Unstructured Data to Recom-            20, 1 (1987), pp. 53–65.
     mend Fashion Products. In Proceedings of the 24th International Conference on       [17] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-Based
     Knowledge Discovery & Data Mining (KDD ’18). ACM, pp. 80–89.                             Collaborative Filtering Recommendation Algorithms. In Proceedings of the 10th
 [7] Carlos A. Gomez-Uribe and Neil Hunt. 2016. The Netflix Recommender System:               International Conference on World Wide Web (WWW ’01). ACM, pp. 285–295.
     Algorithms, Business Value, and Innovation. ACM Transactions on Management          [18] Vivek Sembium, Rajeev Rastogi, Atul Saroop, and Srujana Merugu. 2017. Recom-
     Information Systems (TMIS) 6, 4 (2016), pp. 13.                                          mending Product Sizes to Customers. In Proceedings of the 11th Conference on
 [8] Romain Guigourès, Yuen King Ho, Evgenii Koriagin, Abdul-Saboor Sheikh, Urs               Recommender Systems (RecSys ’17). ACM, pp. 243–250.
     Bergmann, and Reza Shirvany. 2018. A Hierarchical Bayesian Model for Size Rec-      [19] Vivek Sembium, Rajeev Rastogi, Lavanya Tekumalla, and Atul Saroop. 2018.
     ommendation in Fashion. In Proceedings of the 12th Conference on Recommender             Bayesian Models for Product Size Recommendations. In Proceedings of the 27th
     Systems (RecSys ’18). ACM, pp. 392–396.                                                  World Wide Web Conference (WWW ’18). ACM, pp. 679–687.
 [9] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for       [20] Shreya Singh, G. Mohammed Abdulla, Sumit Borar, and Sagar Arora. 2018.
     Implicit Feedback Datasets. In Proceedings of the 8th International Conference on        Footwear Size Recommendation System. arXiv preprint arXiv:1806.11423 (2018).
     Data Mining (ICDM ’08). IEEE, pp. 263–272.                                          [21] L.J.P. van der Maaten and G.E. Hinton. 2008. Visualizing High-Dimensional Data
[10] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Opti-               Using t-SNE. (2008).
     mization. arXiv preprint arXiv:1412.6980 (2014).