An LSTM-Based Dynamic Customer Model for Fashion Recommendation∗ Short Paper Sebastian Heinz Christian Bracher Roland Vollgraf Zalando Research Zalando Research Zalando Research Germany Germany Germany sebastian.heinz@zalando.de christian.bracher@zalando.de roland.vollgraf@zalando.de ABSTRACT them are rarely ordered, items are generally available in small, fluc- Online fashion sales present a challenging use case for personalized tuating numbers. In addition, shoppers commonly return articles. recommendation: Stores offer a huge variety of items in multiple The result is a rapid turnover of the inventory, with many items sizes. Small stocks, high return rates, seasonality, and changing going in and out of stock daily. Superimposed on short-scale vari- trends cause continuous turnover of articles for sale on all time ations, there are periodic alterations associated with the seasonal scales. Customers tend to shop rarely, but often buy multiple items cycle, and secular changes caused by fashion trends. Regarding con- at once. We report on backtest experiments with sales data of 100k sumer behavior, a noteworthy difference to e.g. streaming media frequent shoppers at Zalando, Europe’s leading online fashion plat- services is their propensity to buy rarely (a few sales annually), but form. To model changing customer and store environments, our then multiple items at once. Hence, their purchase histories are recommendation method employs a pair of neural networks: To sparse, only partially ordered sequences. overcome the cold start problem, a feedforward network generates We previously introduced a recommendation algorithm for fash- article embeddings in “fashion space,” which serve as input to a ion items that combines article images, tags, and other catalog recurrent neural network that predicts a style vector in this space information with customer response, tethering curated content to for each client, based on their past purchase sequence. We com- collaborative filtering by minimizing the cross-entropy loss of a pare our results with a static collaborative filtering approach, and a deep neural network for the sales record across a large selection of popularity ranking baseline. customers [1]. Like logistic matrix factorization methods [7, 9], our technique yields low-dimensional embeddings for articles (“Fashion CCS CONCEPTS DNA”) and customers (“style vectors”), but has the advantage to circumvent the cold-start problem that plagues collaborative meth- • Information systems → Recommender systems; Content ods by injecting catalog information for newly added articles. Our analysis and feature selection; • Human-centered computing → model proves capable of recognizing individual style preferences Collaborative filtering; • Computing methodologies → Neural from a modest number of purchases; as cumulative sales events networks; extend over a multi-year period, however, it creates only a static style “fingerprint” of a customer. KEYWORDS In this contribution, we start from the static model, but extend it Recommendation, collaborative filtering, recurrent neural network by including time-of-sale information. To contend with the ever- ACM Reference Format: varying article stock, we use the static model to generate Fashion Sebastian Heinz, Christian Bracher, and Roland Vollgraf. 2017. An LSTM- DNA from curated article data, and employ it as a fixed item descrip- Based Dynamic Customer Model for Fashion Recommendation. In Proceed- tor. This allows us to focus on the temporal sequence of sales events ings of Workshop on Temporal Reasoning in Recommender Systems, Como, for individual customers, which we feed into a neural network to es- Italy, 31st August 2017 (Temporal Reasoning), 5 pages. timate their style vectors. As these are updated with every purchase, the approach models the evolution of our customers’ tastes, and we may employ the style vectors at a given date to create a personal- 1 INTRODUCTION ized preference ranking of the articles then in store, in a way fully The recommendation task in the setting of online fashion sales analogous to the static model. Recurrent neural networks (RNN) are presents unique challenges. Consumer tastes and body shapes are specifically designed to handle sequential data (see Chapter 10 in idiosyncratic, so a huge selection of items in different sizes must be Ref. [3] for an overview). Our network, introduced in Section 2, em- kept on offer. On a typical day, Zalando, Europe’s leading online ploys long short-term memory (LSTM) cells [6] to learn temporal fashion platform with ∼20M active customers, offers ∼200k product correlations between sales. As the model shares network weights choices for sale. Being physical goods rather than digital informa- between customers, it has comparatively few parameters, and easily tion, fashion articles must be stocked in warehouses; as most of scales to millions of clients during inference. ∗ Copyright©2017 for this paper by its authors. Copying permitted for private and Recently, evaluations have appeared in the literature [2, 8, 10] academic purposes. that indicate superiority of RNN-based recommender systems on standard data sets (LastFM, Netflix) over static models. Comparing Temporal Reasoning, 31st August 2017, Como, Italy the dynamic customer style model with predictions from the static 2017. Temporal Reasoning, 31st August 2017, Como, Italy Sebastian Heinz, Christian Bracher, and Roland Vollgraf counterpart [1], and a baseline model build on global customer Θ fν preferences, we confirm that fashion recommendation benefits cross entropy loss scalar product from temporal information (Section 3). However, we also find that article data sigmoid DNN fDNA peculiarities innate to the fashion context, like the prevalence of pνk Πνk partially ordered purchase sequences and the variability of in-store forecast purchase content, are prone to impact recommendation quality; care must sk be taken in designing RNN architecture, training, and evaluation schemes to accommodate them. Further avenues for research are customer discussed in Section 4. Figure 1: Training the Fashion DNA network. Backpropaga- 2 A DYNAMIC RECOMMENDER SYSTEM tion of the loss (blue arrows) simultaneously improves the We now lay out the elements of our proposed model – the data used static customer style vectors sk , and the network weights Θ. for training and validation, the static network learning the article embeddings (Fashion DNA), the recurrent network responsible for style preferences and purchase propensity, respectively. The model predicting the customer response, and the training scheme. architecture is sketched in Figure 1. The result is a low-rank logistic factorization of the purchase 2.1 Data overview matrix akin to collaborative filtering [7, 9], This study is based on article and sales data from Zalando’s online fashion store, collected from its start in 2008, up to a cutoff date of Π ν k ≈ pν k = σ ( f ν · s k + β k ) , (1) July 1, 2015. The data set contains information about ∼1M fashion (where σ (·) denotes the logistic function), except that the Fashion items and millions of individual sales events (excluding customer DNA fν is now clamped to the catalog data via the encoding neural returns). Merchandise is characterized by a thumbnail image of each network. This is a decisive advantage for our setting where we item (size 108×156), categorical data (brand, color, gender, etc.) that are faced with a continuously changing inventory of goods, as the has been rolled out into ∼7k one-hot encoded “tags,” and as numer- Fashion DNA for new articles is obtained from their curated data ical data, the logarithm of the manufacturer-suggested retail price, by a simple forward pass through the neural network. and, for garments only, the fabric composition across ∼50 fibers Ranking the purchase probabilities pν k in Eq. (1) naturally in- as percentages. Each sales record contains a unique, anonymized duces recommendations [1], a model we use for comparison in customer ID, the article bought (disregarding size information), and Section 3.2. We emphasize that the lack of time of sale informa- the time of sale, with one minute granularity. Customer data is tion enforces static customer styles. Hence, to invoke dynamically limited to sales; in particular, article ratings were not available. evolving customer tastes, we have to modify the style vectors sk . 2.2 Fashion DNA 2.3 LSTM network for purchase sequences Our first task is to encode the properties of the articles in a dense Fashion DNA provides a compact encoding of all available content numerical representation. As the curated data has multiple formats information of an item, and largely solves the cold-start problem for and carries diverse information, a natural vehicle for this transfor- new articles entering the store. For these reasons, we use the static mation is a deep neural network that learns suitable combinations model Fashion DNA as article representations in the dynamic model. of features on its own. We discussed such a model at length in an We also want to preserve the association between customer-item earlier paper [1], and we will only give an overview here. affinity, and the scalar product of Fashion DNA and customer style, The representation of an article ν , its “Fashion DNA” vector akin to Eq. (1). Hence, we make our model dynamic by allowing fν , is obtained as the activation in a low-dimensional “bottleneck” the customer style to change over time t. To distinguish between layer near the top of the network. At its base, the network receives static and dynamic customer styles, we denote the latter dk (t ). the catalog information as its input: RGB image data is first pro- While we could add time as a dimension to the static model, cessed with a pretrained residual neural network [4] whose output and attempt to factorize the resulting three-dimensional purchase is concatenated with the categorical and numerical article data and data tensor (as is done, for example, in [11]), we chose to follow a further transformed with a stack of fully connected layers, result- different approach featuring LSTM cells. We also reverse the role ing in Fashion DNA. As we are ultimately interested in customer of articles and customers: While our implementation of the static preferences, it is sensible to train the model on the sales record: model used batches of articles as input, and learned the response Disregarding the timestamp information, we arrange the sales in- of all customers simultaneously, the input to the LSTM network formation for a large number of frequent customers (∼100k) into is customer based. Batches now contain Fashion DNA sequences a sparse binary purchase matrix Π whose elements Πν k ∈ {0, 1} of the form ( fk,1 , . . . , fk, Nk ), representing the purchase history indicate whether customer k has bought item ν . The network is νk,1 , . . . , νk, Nk of customer k. When customers buy multiple items then trained to minimize the average cross-entropy loss per article at once, the purchase sequence is ambiguous. To prevent the LSTM over these customers. In effect, the network learns both an optimal from interpreting these non-sequential parts as time series, we representation of the article fν across the customer base, and a put purchases with the same time stamp in random order. Beyond logistic regression from Fashion DNA to the sales record for each the order sequence, the absolute time of purchases tk,1 , . . . , tk, Nk customer k, with weight vectors sk and bias βk that encode their carries important context information for our problem. For example, An LSTM-Based Dynamic Customer Model for Fashion Recommendation Temporal Reasoning, 31st August 2017, Como, Italy the model may use temporal data to infer the in-store availability fk,i of an article, and the season. We thus additionally supply the time scalar product tk,i−1 tk,i stamp of each purchase to the network. loss A single pass of the LSTM network processing customer pur- chase histories is illustrated in Figure 2. For a fixed customer k and fk,i−1 LSTM FC purchase number i, the LSTM takes as input the concatenation of dk,i Ψ Ω the time stamp tk,i−1 and Fashion DNA fk,i−1 of the previous pur- chase, and the time stamp tk,i of the current purchase. In addition, the LSTM accesses the content of its own memory, mk,i−1 , which f˜k,i,n f˜k,i,1 . . . mk,i−1 mk,i stores information on the purchase history of customer k it has seen so far. The output of the LSTM is projected by a fully connected layer which results in the current customer style dk,i . Note that the first purchase of the sequence (i = 1) is treated specially: Since Figure 2: Training the dynamical model. The shown time- there is no previous purchase, we flush fk,0 , tk,0 , and mk,0 with instance of the LSTM communicates with earlier instances zero entries. Consequently, the customer style dk,1 just depends on via the memory cells mk,i−1 and mk,i . They trigger backprop- the time stamp tk,1 and favors the most popular items at that time. agation through time (blue arrows). 2.4 Training scheme problem, we let only the first article in the purchase sequence con- For recommendation, we aim to predict customer style vectors tribute to the loss when a multiple order is encountered. (Because dk,i that maximize the affinity fk,i ·dk,i to the next-bought article, purchases with the same time stamp are always shuffled before while minimizing the affinity to all other items in store at that feeding, the LSTM receives a variety of article sequences during time. Because it is expensive to compute the customer affinities for training.) every article, we only pick a small sample of “negative” examples among the articles not bought. We denote their corresponding 2.5 Inference and ranking Fashion DNA vectors by f˜k,i,1 , . . . , f˜k,i,n . The number of negative examples n > 0 is a hyperparameter of the model. For each customer k, we now define an “intent-of-purchase” ipν,k (t ) We tested three choices of loss functions for training the network, for all articles ν in store at time t, akin to Eq. (1): sigmoid cross-entropy loss Lσ (as in the static model), softmax loss ipν,k (t ) = fν · dk (t ) . (3) Lsmax , and sigmoid-rank loss Lrank [12], and varied the number n of negative examples. The loss functions are given by: Here, dk (t ) is the dynamic style vector emitted by the LSTM net- work after feeding all sales to customer k that occurred before the   Pn   Lσ = − log σ fk,i · dk,i − log σ − f˜k,i, j · dk,i , time t (with randomly assigned sequence for items purchased to- j=1 gether); for the final sale, we replace the time stamp of the next exp ( f k,i ·d k, i ) purchase by the evaluation time t. We note that ipν,k (t ), unlike Lsmax = − log .. / , * + n  / (2) pν k (1), cannot be interpreted as a likelihood of sale. exp ( f k,i ·d k,i ) + exp f˜k,i, j ·d k,i P ,  j=1 - n Lrank = n1 P  σ f˜k,i, j · dk,i − fk,i · dk,i . 3 COMPARISON OF MODELS j=1 To evaluate our dynamic customer model, we assembled sales data Only Lsmax permits a probabilistic interpretation of the dynamical from the online fashion store for an eight day period immediately model (when n reaches the number of all available articles). following training, July 1–8, 2015. We identified customers with The minimization landscape for Lσ and Lsmax depends on the orders during this test interval, representing ∼105 individual sales, number of negative examples, as their contribution to the loss in- among ∼190k items that were available for purchase in at least one creases with n. Our experiments show that recommendation quality size, for at least one day in this period. For comparison, we score improves when we use more negative examples. Yet, no significant also the static recommendation model (Section 2.2), and a simple additional benefit is observed when n exceeds 50. In contrast, n empirical baseline that disregards customer specifics. has no effect on the minimization landscape for the sigmoid-rank loss. Still, for larger n fewer training epochs are needed to adjust 3.1 Empirical baseline the network parameters. We find that n = 20 is a good tradeoff Fashion articles in the Zalando catalog vary greatly in popularity, between faster convergence of the weights, and the computational with few articles representing most of the sales. This skewed distri- costs caused by using more negative examples. bution enables a simple, non-personalized baseline recommender A subtle yet important aspect of the recommendation problem that projects the recent popularity of items into the future. In detail, is that we try to predict items in the next order of the customer, we accumulated article sales for the week immediately preceding rather than inferring articles within a single order. As items that are the evaluation interval (June 23–30, 2015), and defined a popularity bought together tend to be related (consider, e.g., a swimwear top score for each article by their sales count if they were still available and bottom), an LSTM network trained on full purchase sequences after July 1. For those articles (re-)entering inventory during the quickly focuses on multiple orders and overfits. To circumvent the evaluation period, we assigned the average number of sales among Temporal Reasoning, 31st August 2017, Como, Italy Sebastian Heinz, Christian Bracher, and Roland Vollgraf all articles as a preliminary score. The empirical baseline model 1.0 Cumulative distribution of rankings then ranks the articles by descending popularity score. 0.9 3.2 Static Fashion DNA model 0.8 100 Detail as log-log plot fraction of purchases The Fashion DNA network (Section 2.2) provides the basis for a 0.7 more sophisticated, personalized recommender system, based on 0.6 the customer static style vectors sk and the predicted probability of 10-1 0.5 purchase pν k (1), as detailed in Ref. [1]. Indeed, pν k proves to be an unbiased estimate for the probability of purchase over the lifetime of 0.4 customer and article. These assumptions are not met here, because 0.3 10-2 the evaluation interval is outside the training period, and lasts only eight days. Still, we may assume that the inner products fν · sk 0.2 underlying Eq. (1) are a measure of the affinity of an individual 10-3100 101 102 103 0.1 customer k to the in-store items {ν }(t ) during the time of evaluation, 0.00 25000 50000 75000 100000 125000 150000 175000 and sort them by decreasing value to create a static article ranking. position in ranking 3.3 Dynamic recommender system Figure 3: ROC curves for the dynamic (blue), static (green), For the dynamic customer model, we rank the in-store articles for and empirical baseline (red) recommender schemes. each customer k according to their intent-of-purchase ipν,k (tk ), see (3), evaluated at the time of first sale tk during the evaluation period. We experimented with the three loss models detailed in Section 2.4, Table 1: Model comparison. AUC and required number of and found comparable results for the sigmoid cross-entropy loss recommendations to cover 10% (50%, 90%) of purchases. Lσ and sigmoid-rank loss Lrank , while the softmax loss Lsmax performed significantly worse. The following results are based on model AUC 10% 50% 90% #params a pretrained 128-float Fashion DNA and an LSTM implementation with 256 cells, sigmoid-rank loss and n = 20 negative examples. baseline 80.2% 1,200 19,500 105,000 - Note that 1 − Lrank provides a smooth approximation for the area static 85.2% 600 13,500 80,000 ∼ 108 under the ROC curve [5], used for model evaluation below. dynamic 88.5% 400 9,300 63,000 < 106 3.4 Results To compare model performance, we compile recommendation rank- displayed in Table 1, the baseline shows a drastic performance drop ings of the z ≈ 190k items in store for each customer (for the as would also be expected from any other recommender system baseline, the ranking is shared among customers), and identify the solely based on collaborative filtering. Static and dynamic model, positions rν k of the articles {ν }(k ) purchased by customer k during however, circumvent this problem thanks to Fashion DNA. evaluation. We then determine the cumulative distribution of ranks: X X Rj = H (j − rν k ) . (4) 4 OUTLOOK k ν ∈ {ν }(k ) H (·) denotes the Heaviside function. The normalized cumulative We find that a personalized recommendation model, based on a rank R j /R z interpolates among customers and serves as a collec- recurrent network, outperforms a static customer model in the tive receiver operating characteristic (ROC) of the recommender fashion context. By encoding temporal awareness into the LSTM schemes (Figure 3). The inset displays a double-logarithmic detail memory of the network, the dynamic model can infer the seasonal- of the origin region, representing high-quality recommendations. ity of items, and also record when certain articles are trending—a Table 1 lists the area under the curves (AUC) as a global per- distinct advantage over the static model, which is limited to learning formance measure, together with quantiles of the distributions only long-term customer style preferences. R j . We find that our dynamic model outperforms the static model An important element currently missing in the recommendation throughout, and both models are superior to the baseline popularity model is short-term customer intent. In the fashion setting, goods model, except for the leading ∼10 recommendations, representing for sale belong to varied classes (clothes, shoes, accessories, etc.), less than 0.5% of the purchases (inset in Figure 3). The table also and shoppers, irrespective of their style profile, often have a par- lists the number of model parameters. Weights are shared among ticular category in mind during a session. These implicit interests customers for the LSTM network, but not the static model, resulting strongly influence item preference, but due to their transient nature, in reduction of complexity by orders of magnitude. are hard to infer from the purchase record. Complementary data More than 3% of the purchased articles from the test interval have sources like search queries, or the sequence of items viewed online, not been sold before and, hence, were completely ignored during will pick up the relevant signals instead. Models that successfully training. For those new articles, the cold start problem applies and integrate long-term style evolution and short-term customer intent the AUC of the baseline, static, and dynamic model decreases to promise to greatly enhance recommendation quality and relevance, 64.4%, 83.3%, and 87.7%, respectively. In comparison to the numbers and we plan to investigate them in future studies. An LSTM-Based Dynamic Customer Model for Fashion Recommendation Temporal Reasoning, 31st August 2017, Como, Italy REFERENCES [1] C. Bracher, S. Heinz, and R. Vollgraf. Fashion DNA: Merging content and sales data for recommendation and article mapping. In Workshop Machine learning meets fashion, KDD, 2016. [2] R. Devooght and H. Bersini. Long and Short-Term Recommendations with Recurrent Neural Networks. Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization (2017), pp. 13–21. [3] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT Press (Cambridge, Mass., USA), 2017. [4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR abs/1512.03385 (2015). [5] A. Herschtal and B. Raskutti. Optimising area under the ROC curve using gradient descent. ICML: Conference Proceedings (2004), pp. 49–. [6] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput. 9 (1997), p. 1735–1780. [7] C. Johnson. Logistic matrix factorization for implicit feedback data. In NIPS Workshop on Distributed Matrix Computations, 2014. [8] Y.–J. Ko, L. Maystre, and M. Grossglauser. Collaborative recurrent neural net- works for dynamic recommender systems. JMLR: Workshop and Conference Proceedings 63 (2016), p. 366–381. [9] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recom- mender systems. IEEE Computer 42 (2009), p. 30–37. [10] H. Wang, X. Shi, and D. Yeung. Collaborative recurrent autoencoder: recommend while learning to fill in the blanks. Advances in Neural Information Processing Systems 29 (2016), pp. 415–423. [11] L. Xiong, X. Chen, T.–K. Huang, J. Schneider and J. G. Carbonell. Temporal col- laborative filtering with Bayesian probabilistic tensor factorization. Proceedings of the 2010 SIAM International Conference on Data Mining (2010), pp. 211–222. [12] L. Yan, R. Dodier, M. C. Mozer, and R. Wolniewicz. Optimizing classifier per- formance via approximation to the Wilcoxon–Mann–Witney statistic. ICML: Conference Proceedings (2003), pp. 848–855.