An LSTM-Based Dynamic Customer Model for Fashion
                            Recommendation∗
                                                                           Short Paper

                 Sebastian Heinz                                      Christian Bracher                              Roland Vollgraf
                Zalando Research                                      Zalando Research                               Zalando Research
                    Germany                                                Germany                                       Germany
           sebastian.heinz@zalando.de                           christian.bracher@zalando.de                    roland.vollgraf@zalando.de

ABSTRACT                                                                            them are rarely ordered, items are generally available in small, fluc-
Online fashion sales present a challenging use case for personalized                tuating numbers. In addition, shoppers commonly return articles.
recommendation: Stores offer a huge variety of items in multiple                    The result is a rapid turnover of the inventory, with many items
sizes. Small stocks, high return rates, seasonality, and changing                   going in and out of stock daily. Superimposed on short-scale vari-
trends cause continuous turnover of articles for sale on all time                   ations, there are periodic alterations associated with the seasonal
scales. Customers tend to shop rarely, but often buy multiple items                 cycle, and secular changes caused by fashion trends. Regarding con-
at once. We report on backtest experiments with sales data of 100k                  sumer behavior, a noteworthy difference to e.g. streaming media
frequent shoppers at Zalando, Europe’s leading online fashion plat-                 services is their propensity to buy rarely (a few sales annually), but
form. To model changing customer and store environments, our                        then multiple items at once. Hence, their purchase histories are
recommendation method employs a pair of neural networks: To                         sparse, only partially ordered sequences.
overcome the cold start problem, a feedforward network generates                       We previously introduced a recommendation algorithm for fash-
article embeddings in “fashion space,” which serve as input to a                    ion items that combines article images, tags, and other catalog
recurrent neural network that predicts a style vector in this space                 information with customer response, tethering curated content to
for each client, based on their past purchase sequence. We com-                     collaborative filtering by minimizing the cross-entropy loss of a
pare our results with a static collaborative filtering approach, and a              deep neural network for the sales record across a large selection of
popularity ranking baseline.                                                        customers [1]. Like logistic matrix factorization methods [7, 9], our
                                                                                    technique yields low-dimensional embeddings for articles (“Fashion
CCS CONCEPTS                                                                        DNA”) and customers (“style vectors”), but has the advantage to
                                                                                    circumvent the cold-start problem that plagues collaborative meth-
• Information systems → Recommender systems; Content
                                                                                    ods by injecting catalog information for newly added articles. Our
analysis and feature selection; • Human-centered computing →
                                                                                    model proves capable of recognizing individual style preferences
Collaborative filtering; • Computing methodologies → Neural
                                                                                    from a modest number of purchases; as cumulative sales events
networks;
                                                                                    extend over a multi-year period, however, it creates only a static
                                                                                    style “fingerprint” of a customer.
KEYWORDS                                                                               In this contribution, we start from the static model, but extend it
Recommendation, collaborative filtering, recurrent neural network                   by including time-of-sale information. To contend with the ever-
ACM Reference Format:                                                               varying article stock, we use the static model to generate Fashion
Sebastian Heinz, Christian Bracher, and Roland Vollgraf. 2017. An LSTM-             DNA from curated article data, and employ it as a fixed item descrip-
Based Dynamic Customer Model for Fashion Recommendation. In Proceed-                tor. This allows us to focus on the temporal sequence of sales events
ings of Workshop on Temporal Reasoning in Recommender Systems, Como,                for individual customers, which we feed into a neural network to es-
Italy, 31st August 2017 (Temporal Reasoning), 5 pages.                              timate their style vectors. As these are updated with every purchase,
                                                                                    the approach models the evolution of our customers’ tastes, and we
                                                                                    may employ the style vectors at a given date to create a personal-
1    INTRODUCTION
                                                                                    ized preference ranking of the articles then in store, in a way fully
The recommendation task in the setting of online fashion sales                      analogous to the static model. Recurrent neural networks (RNN) are
presents unique challenges. Consumer tastes and body shapes are                     specifically designed to handle sequential data (see Chapter 10 in
idiosyncratic, so a huge selection of items in different sizes must be              Ref. [3] for an overview). Our network, introduced in Section 2, em-
kept on offer. On a typical day, Zalando, Europe’s leading online                   ploys long short-term memory (LSTM) cells [6] to learn temporal
fashion platform with ∼20M active customers, offers ∼200k product                   correlations between sales. As the model shares network weights
choices for sale. Being physical goods rather than digital informa-                 between customers, it has comparatively few parameters, and easily
tion, fashion articles must be stocked in warehouses; as most of                    scales to millions of clients during inference.
∗ Copyright©2017 for this paper by its authors. Copying permitted for private and      Recently, evaluations have appeared in the literature [2, 8, 10]
academic purposes.                                                                  that indicate superiority of RNN-based recommender systems on
                                                                                    standard data sets (LastFM, Netflix) over static models. Comparing
Temporal Reasoning, 31st August 2017, Como, Italy                                   the dynamic customer style model with predictions from the static
2017.
Temporal Reasoning, 31st August 2017, Como, Italy                                         Sebastian Heinz, Christian Bracher, and Roland Vollgraf


counterpart [1], and a baseline model build on global customer                               Θ        fν
preferences, we confirm that fashion recommendation benefits


                                                                                                                                                     cross entropy loss
                                                                                                               scalar product
from temporal information (Section 3). However, we also find that          article data


                                                                                                                                sigmoid
                                                                                            DNN      fDNA
peculiarities innate to the fashion context, like the prevalence of                                                                          pνk                           Πνk

partially ordered purchase sequences and the variability of in-store
                                                                                                                                          forecast                        purchase
content, are prone to impact recommendation quality; care must                                        sk
be taken in designing RNN architecture, training, and evaluation
schemes to accommodate them. Further avenues for research are                                       customer
discussed in Section 4.
                                                                          Figure 1: Training the Fashion DNA network. Backpropaga-
2     A DYNAMIC RECOMMENDER SYSTEM                                        tion of the loss (blue arrows) simultaneously improves the
We now lay out the elements of our proposed model – the data used         static customer style vectors sk , and the network weights Θ.
for training and validation, the static network learning the article
embeddings (Fashion DNA), the recurrent network responsible for
                                                                          style preferences and purchase propensity, respectively. The model
predicting the customer response, and the training scheme.
                                                                          architecture is sketched in Figure 1.
                                                                             The result is a low-rank logistic factorization of the purchase
2.1    Data overview                                                      matrix akin to collaborative filtering [7, 9],
This study is based on article and sales data from Zalando’s online
fashion store, collected from its start in 2008, up to a cutoff date of                           Π ν k ≈ pν k = σ ( f ν · s k + β k ) ,                                         (1)
July 1, 2015. The data set contains information about ∼1M fashion         (where σ (·) denotes the logistic function), except that the Fashion
items and millions of individual sales events (excluding customer         DNA fν is now clamped to the catalog data via the encoding neural
returns). Merchandise is characterized by a thumbnail image of each       network. This is a decisive advantage for our setting where we
item (size 108×156), categorical data (brand, color, gender, etc.) that   are faced with a continuously changing inventory of goods, as the
has been rolled out into ∼7k one-hot encoded “tags,” and as numer-        Fashion DNA for new articles is obtained from their curated data
ical data, the logarithm of the manufacturer-suggested retail price,      by a simple forward pass through the neural network.
and, for garments only, the fabric composition across ∼50 fibers             Ranking the purchase probabilities pν k in Eq. (1) naturally in-
as percentages. Each sales record contains a unique, anonymized           duces recommendations [1], a model we use for comparison in
customer ID, the article bought (disregarding size information), and      Section 3.2. We emphasize that the lack of time of sale informa-
the time of sale, with one minute granularity. Customer data is           tion enforces static customer styles. Hence, to invoke dynamically
limited to sales; in particular, article ratings were not available.      evolving customer tastes, we have to modify the style vectors sk .

2.2    Fashion DNA                                                        2.3       LSTM network for purchase sequences
Our first task is to encode the properties of the articles in a dense     Fashion DNA provides a compact encoding of all available content
numerical representation. As the curated data has multiple formats        information of an item, and largely solves the cold-start problem for
and carries diverse information, a natural vehicle for this transfor-     new articles entering the store. For these reasons, we use the static
mation is a deep neural network that learns suitable combinations         model Fashion DNA as article representations in the dynamic model.
of features on its own. We discussed such a model at length in an         We also want to preserve the association between customer-item
earlier paper [1], and we will only give an overview here.                affinity, and the scalar product of Fashion DNA and customer style,
    The representation of an article ν , its “Fashion DNA” vector         akin to Eq. (1). Hence, we make our model dynamic by allowing
fν , is obtained as the activation in a low-dimensional “bottleneck”      the customer style to change over time t. To distinguish between
layer near the top of the network. At its base, the network receives      static and dynamic customer styles, we denote the latter dk (t ).
the catalog information as its input: RGB image data is first pro-           While we could add time as a dimension to the static model,
cessed with a pretrained residual neural network [4] whose output         and attempt to factorize the resulting three-dimensional purchase
is concatenated with the categorical and numerical article data and       data tensor (as is done, for example, in [11]), we chose to follow a
further transformed with a stack of fully connected layers, result-       different approach featuring LSTM cells. We also reverse the role
ing in Fashion DNA. As we are ultimately interested in customer           of articles and customers: While our implementation of the static
preferences, it is sensible to train the model on the sales record:       model used batches of articles as input, and learned the response
Disregarding the timestamp information, we arrange the sales in-          of all customers simultaneously, the input to the LSTM network
formation for a large number of frequent customers (∼100k) into           is customer based. Batches now contain Fashion DNA sequences
a sparse binary purchase matrix Π whose elements Πν k ∈ {0, 1}            of the form ( fk,1 , . . . , fk, Nk ), representing the purchase history
indicate whether customer k has bought item ν . The network is            νk,1 , . . . , νk, Nk of customer k. When customers buy multiple items
then trained to minimize the average cross-entropy loss per article       at once, the purchase sequence is ambiguous. To prevent the LSTM
over these customers. In effect, the network learns both an optimal       from interpreting these non-sequential parts as time series, we
representation of the article fν across the customer base, and a          put purchases with the same time stamp in random order. Beyond
logistic regression from Fashion DNA to the sales record for each         the order sequence, the absolute time of purchases tk,1 , . . . , tk, Nk
customer k, with weight vectors sk and bias βk that encode their          carries important context information for our problem. For example,
An LSTM-Based Dynamic Customer Model for Fashion Recommendation                                   Temporal Reasoning, 31st August 2017, Como, Italy


the model may use temporal data to infer the in-store availability                                                                  fk,i
of an article, and the season. We thus additionally supply the time


                                                                                                                                                     scalar product
                                                                                               tk,i−1       tk,i
stamp of each purchase to the network.


                                                                                                                                                                                loss
   A single pass of the LSTM network processing customer pur-
chase histories is illustrated in Figure 2. For a fixed customer k and
                                                                                      fk,i−1       LSTM                 FC
purchase number i, the LSTM takes as input the concatenation of                                                                     dk,i
                                                                                                        Ψ           Ω
the time stamp tk,i−1 and Fashion DNA fk,i−1 of the previous pur-
chase, and the time stamp tk,i of the current purchase. In addition,
the LSTM accesses the content of its own memory, mk,i−1 , which


                                                                                                                                                                      f˜k,i,n
                                                                                                                                           f˜k,i,1


                                                                                                                                                     .
                                                                                                                                                     .
                                                                                                                                                     .
                                                                                               mk,i−1       mk,i
stores information on the purchase history of customer k it has seen
so far. The output of the LSTM is projected by a fully connected
layer which results in the current customer style dk,i . Note that
the first purchase of the sequence (i = 1) is treated specially: Since            Figure 2: Training the dynamical model. The shown time-
there is no previous purchase, we flush fk,0 , tk,0 , and mk,0 with               instance of the LSTM communicates with earlier instances
zero entries. Consequently, the customer style dk,1 just depends on               via the memory cells mk,i−1 and mk,i . They trigger backprop-
the time stamp tk,1 and favors the most popular items at that time.               agation through time (blue arrows).


2.4    Training scheme
                                                                                  problem, we let only the first article in the purchase sequence con-
For recommendation, we aim to predict customer style vectors                      tribute to the loss when a multiple order is encountered. (Because
dk,i that maximize the affinity fk,i ·dk,i to the next-bought article,            purchases with the same time stamp are always shuffled before
while minimizing the affinity to all other items in store at that                 feeding, the LSTM receives a variety of article sequences during
time. Because it is expensive to compute the customer affinities for              training.)
every article, we only pick a small sample of “negative” examples
among the articles not bought. We denote their corresponding
                                                                                  2.5      Inference and ranking
Fashion DNA vectors by f˜k,i,1 , . . . , f˜k,i,n . The number of negative
examples n > 0 is a hyperparameter of the model.                                  For each customer k, we now define an “intent-of-purchase” ipν,k (t )
   We tested three choices of loss functions for training the network,            for all articles ν in store at time t, akin to Eq. (1):
sigmoid cross-entropy loss Lσ (as in the static model), softmax loss                                         ipν,k (t ) = fν · dk (t ) .                                               (3)
Lsmax , and sigmoid-rank loss Lrank [12], and varied the number n
of negative examples. The loss functions are given by:                            Here, dk (t ) is the dynamic style vector emitted by the LSTM net-
                                                                                  work after feeding all sales to customer k that occurred before the
                              Pn                       
      Lσ = − log σ fk,i · dk,i −   log σ − f˜k,i, j · dk,i ,                      time t (with randomly assigned sequence for items purchased to-
                                               j=1                                gether); for the final sale, we replace the time stamp of the next
                                    exp ( f k,i ·d k, i )
                                                                                  purchase by the evaluation time t. We note that ipν,k (t ), unlike
   Lsmax = − log ..                                                   / ,
                  *                                                   +
                                             n                      /     (2)   pν k (1), cannot be interpreted as a likelihood of sale.
                      exp ( f k,i ·d k,i ) +     exp f˜k,i, j ·d k,i
                                            P
                ,                j=1             -
              n
   Lrank = n1
              P                                
                σ f˜k,i, j · dk,i − fk,i · dk,i .                                 3     COMPARISON OF MODELS
                j=1                                                               To evaluate our dynamic customer model, we assembled sales data
Only Lsmax permits a probabilistic interpretation of the dynamical                from the online fashion store for an eight day period immediately
model (when n reaches the number of all available articles).                      following training, July 1–8, 2015. We identified customers with
   The minimization landscape for Lσ and Lsmax depends on the                     orders during this test interval, representing ∼105 individual sales,
number of negative examples, as their contribution to the loss in-                among ∼190k items that were available for purchase in at least one
creases with n. Our experiments show that recommendation quality                  size, for at least one day in this period. For comparison, we score
improves when we use more negative examples. Yet, no significant                  also the static recommendation model (Section 2.2), and a simple
additional benefit is observed when n exceeds 50. In contrast, n                  empirical baseline that disregards customer specifics.
has no effect on the minimization landscape for the sigmoid-rank
loss. Still, for larger n fewer training epochs are needed to adjust              3.1      Empirical baseline
the network parameters. We find that n = 20 is a good tradeoff                    Fashion articles in the Zalando catalog vary greatly in popularity,
between faster convergence of the weights, and the computational                  with few articles representing most of the sales. This skewed distri-
costs caused by using more negative examples.                                     bution enables a simple, non-personalized baseline recommender
   A subtle yet important aspect of the recommendation problem                    that projects the recent popularity of items into the future. In detail,
is that we try to predict items in the next order of the customer,                we accumulated article sales for the week immediately preceding
rather than inferring articles within a single order. As items that are           the evaluation interval (June 23–30, 2015), and defined a popularity
bought together tend to be related (consider, e.g., a swimwear top                score for each article by their sales count if they were still available
and bottom), an LSTM network trained on full purchase sequences                   after July 1. For those articles (re-)entering inventory during the
quickly focuses on multiple orders and overfits. To circumvent the                evaluation period, we assigned the average number of sales among
Temporal Reasoning, 31st August 2017, Como, Italy                                                         Sebastian Heinz, Christian Bracher, and Roland Vollgraf


all articles as a preliminary score. The empirical baseline model                                   1.0               Cumulative distribution of rankings
then ranks the articles by descending popularity score.                                             0.9

3.2    Static Fashion DNA model                                                                 0.8
                                                                                                                         100               Detail as log-log plot


                                                                            fraction of purchases
The Fashion DNA network (Section 2.2) provides the basis for a                                      0.7
more sophisticated, personalized recommender system, based on                                   0.6
the customer static style vectors sk and the predicted probability of                                                    10-1
                                                                                                    0.5
purchase pν k (1), as detailed in Ref. [1]. Indeed, pν k proves to be an
unbiased estimate for the probability of purchase over the lifetime of                          0.4
customer and article. These assumptions are not met here, because                                   0.3                  10-2
the evaluation interval is outside the training period, and lasts only
eight days. Still, we may assume that the inner products fν · sk                                    0.2
underlying Eq. (1) are a measure of the affinity of an individual                                                        10-3100              101          102            103
                                                                                                    0.1
customer k to the in-store items {ν }(t ) during the time of evaluation,
                                                                                                0.00        25000     50000        75000     100000   125000     150000   175000
and sort them by decreasing value to create a static article ranking.
                                                                                                                                   position in ranking
3.3    Dynamic recommender system
                                                                            Figure 3: ROC curves for the dynamic (blue), static (green),
For the dynamic customer model, we rank the in-store articles for           and empirical baseline (red) recommender schemes.
each customer k according to their intent-of-purchase ipν,k (tk ), see
(3), evaluated at the time of first sale tk during the evaluation period.
We experimented with the three loss models detailed in Section 2.4,         Table 1: Model comparison. AUC and required number of
and found comparable results for the sigmoid cross-entropy loss             recommendations to cover 10% (50%, 90%) of purchases.
Lσ and sigmoid-rank loss Lrank , while the softmax loss Lsmax
performed significantly worse. The following results are based on
                                                                                                      model         AUC          10%         50%         90%        #params
a pretrained 128-float Fashion DNA and an LSTM implementation
with 256 cells, sigmoid-rank loss and n = 20 negative examples.                                      baseline       80.2%       1,200      19,500     105,000              -
Note that 1 − Lrank provides a smooth approximation for the area                                        static      85.2%         600      13,500      80,000          ∼ 108
under the ROC curve [5], used for model evaluation below.                                            dynamic        88.5%        400        9,300     63,000           < 106

3.4    Results
To compare model performance, we compile recommendation rank-
                                                                            displayed in Table 1, the baseline shows a drastic performance drop
ings of the z ≈ 190k items in store for each customer (for the
                                                                            as would also be expected from any other recommender system
baseline, the ranking is shared among customers), and identify the
                                                                            solely based on collaborative filtering. Static and dynamic model,
positions rν k of the articles {ν }(k ) purchased by customer k during
                                                                            however, circumvent this problem thanks to Fashion DNA.
evaluation. We then determine the cumulative distribution of ranks:
                       X X
                Rj =                       H (j − rν k ) .          (4)     4                        OUTLOOK
                          k     ν ∈ {ν }(k )

H (·) denotes the Heaviside function. The normalized cumulative             We find that a personalized recommendation model, based on a
rank R j /R z interpolates among customers and serves as a collec-          recurrent network, outperforms a static customer model in the
tive receiver operating characteristic (ROC) of the recommender             fashion context. By encoding temporal awareness into the LSTM
schemes (Figure 3). The inset displays a double-logarithmic detail          memory of the network, the dynamic model can infer the seasonal-
of the origin region, representing high-quality recommendations.            ity of items, and also record when certain articles are trending—a
     Table 1 lists the area under the curves (AUC) as a global per-         distinct advantage over the static model, which is limited to learning
formance measure, together with quantiles of the distributions              only long-term customer style preferences.
R j . We find that our dynamic model outperforms the static model              An important element currently missing in the recommendation
throughout, and both models are superior to the baseline popularity         model is short-term customer intent. In the fashion setting, goods
model, except for the leading ∼10 recommendations, representing             for sale belong to varied classes (clothes, shoes, accessories, etc.),
less than 0.5% of the purchases (inset in Figure 3). The table also         and shoppers, irrespective of their style profile, often have a par-
lists the number of model parameters. Weights are shared among              ticular category in mind during a session. These implicit interests
customers for the LSTM network, but not the static model, resulting         strongly influence item preference, but due to their transient nature,
in reduction of complexity by orders of magnitude.                          are hard to infer from the purchase record. Complementary data
     More than 3% of the purchased articles from the test interval have     sources like search queries, or the sequence of items viewed online,
not been sold before and, hence, were completely ignored during             will pick up the relevant signals instead. Models that successfully
training. For those new articles, the cold start problem applies and        integrate long-term style evolution and short-term customer intent
the AUC of the baseline, static, and dynamic model decreases to             promise to greatly enhance recommendation quality and relevance,
64.4%, 83.3%, and 87.7%, respectively. In comparison to the numbers         and we plan to investigate them in future studies.
An LSTM-Based Dynamic Customer Model for Fashion Recommendation                           Temporal Reasoning, 31st August 2017, Como, Italy


REFERENCES
 [1] C. Bracher, S. Heinz, and R. Vollgraf. Fashion DNA: Merging content and sales
     data for recommendation and article mapping. In Workshop Machine learning
     meets fashion, KDD, 2016.
 [2] R. Devooght and H. Bersini. Long and Short-Term Recommendations with
     Recurrent Neural Networks. Proceedings of the 25th Conference on User Modeling,
     Adaptation and Personalization (2017), pp. 13–21.
 [3] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT Press (Cambridge,
     Mass., USA), 2017.
 [4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.
     CoRR abs/1512.03385 (2015).
 [5] A. Herschtal and B. Raskutti. Optimising area under the ROC curve using gradient
     descent. ICML: Conference Proceedings (2004), pp. 49–.
 [6] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput. 9
     (1997), p. 1735–1780.
 [7] C. Johnson. Logistic matrix factorization for implicit feedback data. In NIPS
     Workshop on Distributed Matrix Computations, 2014.
 [8] Y.–J. Ko, L. Maystre, and M. Grossglauser. Collaborative recurrent neural net-
     works for dynamic recommender systems. JMLR: Workshop and Conference
     Proceedings 63 (2016), p. 366–381.
 [9] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recom-
     mender systems. IEEE Computer 42 (2009), p. 30–37.
[10] H. Wang, X. Shi, and D. Yeung. Collaborative recurrent autoencoder: recommend
     while learning to fill in the blanks. Advances in Neural Information Processing
     Systems 29 (2016), pp. 415–423.
[11] L. Xiong, X. Chen, T.–K. Huang, J. Schneider and J. G. Carbonell. Temporal col-
     laborative filtering with Bayesian probabilistic tensor factorization. Proceedings
     of the 2010 SIAM International Conference on Data Mining (2010), pp. 211–222.
[12] L. Yan, R. Dodier, M. C. Mozer, and R. Wolniewicz. Optimizing classifier per-
     formance via approximation to the Wilcoxon–Mann–Witney statistic. ICML:
     Conference Proceedings (2003), pp. 848–855.