Predicting Shopping Behavior with Mixture of RNNs Arthur Toth Louis Tan Rakuten Institute of Technology Rakuten Institute of Technology Boston, Massachusetts 02110 Boston, Massachusetts 02110 arthur.toth@rakuten.com ts-louis.tan@rakuten.com Giuseppe Di Fabbrizio Ankur Datta Rakuten Institute of Technology Rakuten Institute of Technology Boston, Massachusetts 02110 Boston, Massachusetts 02110 difabbrizio@gmail.com ankur.datta@rakuten.com ABSTRACT online shopping literature [5, 8]. Main causes include concerns We compare two machine learning approaches for early predic- about costs, perceived decision difficulty, and selection conflicts tion of shoppers’ behaviors, leveraging features from clickstream due to a large number of similar choices. data generated during live shopping sessions. Our baseline is a Early shopping abandonment detection may allow mitigation mixture of Markov models to predict three outcomes: purchase, of purchase inhibitors by enabling injection of new content into abandoned shopping cart, and browsing-only. We then experiment live web browsing sessions. For instance, it could trigger a discount with a mixture of Recurrent Neural Networks. When sequences offer or change the product search strategy to retrieve more diverse are truncated to 75% of their length, a relatively small feature set options and simplify the consumer’s decision process. predicts purchase with an F-measure of 0.80 and browsing-only This paper considers real web interactions from a US e-commerce with an F-measure of 0.98. We also investigate an entropy-based subsidiary of Rakuten (楽天株式会社) to predict three possible decision procedure. outcomes: purchase, abandoned shopping cart, and browsing-only. To early detect outcomes, we consider clues left behind by shop- CCS CONCEPTS pers that are encoded into clickstream data and logged during each session. Clickstream data is used to experiment with mixtures of • Computing methodologies → Neural networks; • Applied high-order Markov Chain Models (MCMs) and mixtures of Recur- computing → Online shopping; rent Neural Networks (RNNs) that use the Long Short-Term Mem- KEYWORDS ory (LSTM) architecture. We compare and contrast each model on sequences truncated at different lengths and report on precision, mining/modeling search activity, click models, behavioral analysis recall, and F-measures. We also show F-measures from using an ACM Reference format: entropy-based decision procedure that is usable in a live system. Arthur Toth, Louis Tan, Giuseppe Di Fabbrizio, and Ankur Datta. 2017. Predicting Shopping Behavior with Mixture of RNNs. In Proceedings of 2 RELATED WORK SIGIR 2017 eCom, Tokyo, Japan, August 2017, 5 pages. We treat predicting user behavior from clickstream data as sequence classification, which is broadly surveyed by Xing et al. [14], who 1 INTRODUCTION divide it into feature-based, sequence distance-based, and model- Recent e-commerce forecast analysis estimates that more than 1.77 based methods. Previous feature-based work on clickstream clas- billion users will shop online by the end of 2017 [11]. Although sification includes the random forest used by Awalker et al. [2], this is impressive growth, conversion rates for online shoppers the deep belief networks and stacked denoising auto-encoders by are substantially lower than rates for traditional brick-and-mortar Viera [12], and recurrent neural networks by Wu et al. [13]. Previ- stores. ous distance-based work includes the large margin nearest neighbor Consumers shopping on e-commerce web sites are influenced approach by Pai et al. [10]. Previous model-based work by Bertsimas by numerous factors and may decide to stop the current session, et al. [4] used a mixture of Markov chains. leaving products in their shopping carts. Once a user has interacted Our baseline approach is based on the latter work, whereas our with a shopping cart, abandonment rates range between 25% and new approach uses a mixture of RNNs. Although Wu et al. [13] used 88%, significantly reducing merchants’ selling opportunities [15]. RNNs, their approach is not applicable to our scenario, since its Several potential purchase inhibitors have been analyzed in the bi-directional RNN uses entire clickstream sequences. Our goal is to classify incomplete sequences. Also their model is not a mixture. Copyright © 2017 by the paper’s authors. Copying permitted for private and academic Finally, our approach differs from most others in its use of a purposes. ternary classification scheme. We classify clickstreams as either In: J. Degenhardt, S. Kallumadi, M. de Rijke, L. Si, A. Trotman, Y. Xu (eds.): Proceedings of the SIGIR 2017 eCom workshop, August 2017, Tokyo, Japan, published being a purchase, abandon or browsing-only session instead of just at http://ceur-ws.org purchase and non-purchase. SIGIR 2017 eCom, August 2017, Tokyo, Japan Arthur Toth, Louis Tan, Giuseppe Di Fabbrizio, and Ankur Datta Table 1: Clickstream data distribution. by the equation P (Si |Ci = ω)P (Ci = ω) P (Ci = ω |Si ) = P Clickstream outcome Count % Average Median ω ∈Ω P (Si |Ci = ω)P (Ci = ω) Length m̄i Length with the prior, P (Ci = ω), estimated from counts. ABANDON 29,371 14.7% 8.2 5 This model fits the problem’s constraints, because likelihoods BROWSING-ONLY 128,450 64.6% 16.6 6 can be produced from subsequences without using “future” clicks. PURCHASE 41,115 20.7% 8.4 5 Although each chain is trained only on click data, separating data by class implicitly conditions them on class. Total 198,936 100% Taking inspiration from the Automatic Speech Recognition (ASR) community and similarities to “Language Modeling”, we adapted 3 CLICKSTREAM DATA some of their more recent techniques to our problem. In preliminary experiments, 5-grams performed better than shorter chains, so we We consider clickstream data collected over two weeks and con- used them. Longer chains cause greater sparsity, so we addressed sisting of n 0 = 1, 560, 830 sessions. A session Si , i = 1, 2, . . . , n 0 , is a this with Kneser-Ney smoothing, which performed best in a study chronological sequence of recorded page views, or “clicks.” Let Vi, j of language modeling techniques [6]. We used the MITLM toolkit be the jth page view of session i so that Si = (Vi,1 , Vi,2 , . . . Vi,mi ), to train the Markov chains [7]. and mi is defined as the length of session i. We exclude session i from our experiments if mi < 4, after which n = 198, 936 ses- 4.2 Recurrent Neural Networks sions remain. Sessions with purchases are truncated before the first purchase confirmation. Table 1 summarizes the n sessions. Taking further inspiration from the ASR community, we replaced Our experiments use both the page type and dwell time of Vi, j . the Markov-chains in our mixtures with RNNs. Earlier language The page type of Vi, j , denoted as Pi, j , belongs to one of eight cate- modeling work used feed-forward artificial neural networks [3], gories, including search pages, product view pages, login pages, etc. but RNNs have performed better recently, both in direct likelihood The dwell time, D i, j , of Vi, j is the amount of time the user spends metrics and in overall ASR task metrics [9, 16]. Click-stream data viewing the page, and is not available until the (j + 1)th page view. differs from ASR text, and our mixture model differed from the After the jth page view, the clickstream data gathered for session i is typical ASR approach, so it was unclear whether RNNs would help given by Si |j = ((Pi,1 , D i,0 ), (Pi,2 , D i,1 ), . . . , (Pi, j , D i, j−1 )) where in our scenario. D i,0 is undefined, i.e., D i,0 = ∅. To reduce sparsity, dwell times were placed in 8 bins, evenly spaced by percentiles. 4 MODELING APPROACHES Our goal is to classify customer behavior into final decision cat- egories. In particular, clickstream sequences receive one of the following labels: PURCHASE, if the sequence leads to an item pur- chase; ABANDON, if an item was left in the shopping cart, but there was no purchase; and BROWSING-ONLY, when the shopping cart was not used. The final two categories can be combined to Figure 1: Simple RNN to Predict Following Token investigate PURCHASE vs. NON_PURCHASE behavior. In prelimi- nary studies, the ABANDON sequences were much more similar Earlier RNN-based language models used the “simple recurrent to the PURCHASE sequences than to the BROWSING-ONLY se- neural network” architecture [9]. The underlying idea, depicted in quences, so having three categories helped account for some of Figure 1, is that an input sequence, represented by {X 0 , ..., X t −1 }, the confusability of the data. Our eventual goal of applying our is connected to a recurrent layer, represented by {A0 , ...At }, which models in a live system adds a constraint. The classifier must work is also connected to itself at the next point in time. The recurrent for incomplete sequences, without using data from the “future”. layer is also connected to an output layer. In our models, each RNN tries to predict the next input, X n+1 , after each input, X n . 4.1 Mixture of High-Order Markov Chains “Simple” RNN-based language models for ASR were outper- Bertsimas et al. [4] modeled a similar problem with a mixture of formed by RNNs using the Long Short-Term Memory (LSTM) con- Markov chains, using maximum a posteriori estimation to predict figuration and “drop-out” [16]. LSTM addressed the “vanishing sequence labels. In this approach, the training data is partitioned by gradient” and “exploding gradient” problems. “Drop-out” addressed class and separate Markov chain models are trained for each one. overfitting by probabilistically ignoring the contributions of non- The resultant models can estimate class likelihoods from sequences. recurrent nodes during training. Let Ci be a random variable representing the outcome of session Si , In LSTM RNNs, some nodes function as “memory cells”. Some where Ci ∈ Ω, and Ω is the set of possible clickstream outcomes. connections retrieve information from them, and others cause them The models would be used to estimate the likelihoods P (Si |Ci = ω) to forget. The LSTM equations are [16]: for all ω ∈ Ω. Using Bayes’ theorem and the law of total probability, LST M :h tl −1,h tl −1,c tl −1 →h tl ,c tl the class posteriors for each of the three classes can be estimated Predicting Shopping Behavior with Mixture of RNNs SIGIR 2017 eCom, August 2017, Tokyo, Japan i + * siдm + F1-measure by sequence length l −1 f /// ... siдm /// * ht *. 1 //=.. //T2n, 4n . l .. +/ .. o / . siдm / h , t −1 - 0.9 д - , siдm - . , 0.8 c tl =f c tl −1 +i д 0.7 h tl =o t anh (c tl ) 0.6 where h and c represent hidden states and memory cells, subscripts 0.5 refer to time, superscripts refer to layers, Tn,m is an affine transform 0.4 from R n to Rm , refers to element-wise multiplication, and siдm 0.3 and tanh are applied to each element. 0.2 Since LSTM RNNs with drop-out worked much better than 0.1 Markov chains for ASR, we replaced the Markov chains with them 0 in our clickstream mixture models. Additional work was neces- 25% 50% 75% 100% sary, since our scenario did not exactly match language modeling. MCM-ABN MCM-BRO MCM-PRC During training, input tokens were still used to predict following RNN-ABN RNN-BRO RNN-PRC tokens. During testing, however, our goal was the sequence prob- abilities. These were calculated from token probabilities present Figure 2: F1-measure by sequence length for mixtures of Markov in intermediate softmax layers in each LSTM model. Due to the Chain Models (MCM) and RNNs for each label: ABN=abandoned; network architecture, “future” events were not used for this. We BRO=browsing-only; PRC=purchase. used TensorFlow to implement our LSTM RNNs [1]. 5 EXPERIMENTS by the lack of data for the less represented sequences (i.e., 14.7% for We experimented on the dataset described in Section 3 and summa- ABANDON and 20.7% for PURCHASE). RNNs instead generalize rized in Table 1. It was partitioned into an 80% training/20% testing better due to the memory component that can model long-distance split, using stratified sampling to retain class proportions. dependencies. All RNNs were trained for 10 epochs, using batches of 20 se- quences. We tested 16 combinations of the remaining parameters. The number of recurrent layers was 1 or 2, the keep probability was 1 or 0.5, and the hidden state size was 10, 20, 40, or 80. For a particular mixture model, all the RNNs used the same parameter values. 5.1 Results For each model, we evaluated the prediction performance by trun- cating the page view sequences at different lengths. Table 2 shows the results for the mixture of Markov models, and for one of the mixture of RNN trials. Although we tested 16 different RNN parame- ter combinations, results were so similar that we are only reporting on one of them. ABANDON Table 2 reports precision, recall, and F1-measure for each specific BROWSING-ONLY PURCHASE sequence outcome when considering 25%, 50%, 75%, and 100% of the total length of the sequence. For instance, when splitting at 50%, the Markov chain model can predict a PURCHASE with a 0.42 precision and 0.11 recall, resulting in an overall F1-measure of 0.17. For the same conditions, the RNN-based model reaches a Figure 3: Log-probability trajectories from the MCM mixture, pro- precision of 0.82 with 0.71 recall and an F1-measure of 0.76. We gressing along page view sequences for each class also report the accuracy when randomly selecting a class based on the prior distribution of the clickstream corpus. RNN mixture Similarly to Bertsimas et al. [4], in Figure 3, we plot 100 log- components substantially outperform Markov chain components. probability trajectories, with lengths from 2 through 20 page views, This is particularly evident from Figure 2, which shows the F1- estimated by the MCM mixture along page view sequences for measure by sequence length for the mixtures of MCMs (dotted each class. This plot demonstrates how probabilities evolve during line) and of RNNs (solid line). Both models monotonically increase interactions and how confident each model is compared to others. performance as the model observes more data from 25% splits to The legend in Figure 3 corresponds to the true final state labels 100%, but the mixture of RNNs has an immediate edge even with of the test examples: dashed red lines for ABANDON sequences, the short 25% sequences. The MCMs present similar F1-measures dashed and dotted green lines for BROWSING-ONLY sequences, for the majority class (i.e., BROWSING-ONLY), but it is penalized and solid blue lines for PURCHASE sequences. Ideally, the model SIGIR 2017 eCom, August 2017, Tokyo, Japan Arthur Toth, Louis Tan, Giuseppe Di Fabbrizio, and Ankur Datta Table 2: Precision (P), Recall (R), and F1-measure (F) for MCM and RNN mixtures by sequence length. Mixture Type Sequence Length 25% 50% 75% 100% MCM Final State P R F1 P R F1 P R F1 P R F1 Random ABANDON 0.44 0.02 0.04 0.45 0.09 0.14 0.72 0.52 0.61 0.70 0.64 0.67 0.21 BROWSING-ONLY 0.66 0.99 0.79 0.69 0.98 0.81 0.89 0.97 0.92 0.99 0.95 0.97 0.64 PURCHASE 0.37 0.02 0.04 0.42 0.11 0.17 0.72 0.67 0.69 0.71 0.86 0.78 0.15 RNN Final State P R F1 P R F1 P R F1 P R F1 Random ABANDON 0.75 0.39 0.52 0.73 0.62 0.67 0.74 0.72 0.73 1.00 0.84 0.91 0.21 BROWSING-ONLY 0.84 0.99 0.91 0.92 0.99 0.96 0.97 0.99 0.98 1.00 0.98 0.99 0.64 PURCHASE 0.80 0.64 0.71 0.82 0.71 0.76 0.82 0.78 0.80 0.86 1.00 0.92 0.15 would score the PURCHASE sequences (solid blue lines) high, and F1-measure by Label Distribution Entropy the other sequences low, and the earlier the distinction could be made, the better. Looking at this figure, there does appear to be 1 some level of discrimination between the categories. In general, 0.8 F1-measure the BROWSING-ONLY sequences seem more separable from the 0.6 PURCHASE sequences than the ABANDON sequences, as expected. 0.4 Although Table 2 can be used to compare other work [10], it de- 0.2 pends on sequence lengths. A useful live system must predict final actions before sequences are complete and needs a decision process 0 0.99 0.88 0.77 0.66 0.55 0.44 0.33 0.22 0.11 for accepting the predicted label. We experimented with taking Entropy Threshold (Proportion of Maximum Entropy) the highest scoring label once the entropy fell below a threshold. Figure 4 shows the F1-measures at different thresholds, which are Abandon No_Cart_Interaction Purchase proportions of the maximum possible entropy. Since the entropy might not drop below the threshold, it is important to consider how many sequences have predicted labels. When we calculated F1- Figure 4: F1-measures at Entropy Thresholds for each class measures, sequences without predictions were counted as misses. Figure 5 shows the number of sequences where predictions were made based on entropy threshold. Again, the horizontal axis rep- Sequences where Entropy Goes Below Threshold resents the proportion of the maximum possible entropy value. 45000 The vertical axis represents the number of sequences where a de- 40000 Number of Sequences 35000 cision can be made based on threshold crossing. As an example, a 30000 threshold of 0.55 led to reasonable F1-measures while producing 25000 predictions for 99% of the sequences before they were complete. 20000 Choosing higher entropy thresholds allows decisions to be made 15000 for more sequences, but performance can suffer since decisions can 10000 be made while class probabilities are more uniform and confidence 5000 0 is lower. Choosing lower entropy thresholds forces the class proba- 0.99 0.88 0.77 0.66 0.55 0.44 0.33 0.22 0.11 bilities to be more distinct, which leads to more confident decisions, Entropy Threshold (Proportion of Maximum Entropy) but performance starts to suffer when fewer sequences receive decisions. In practice, the threshold would be tuned on held-out data. Figure 5: Sequences Crossing Entropy Thresholds 6 CONCLUSION AND FUTURE WORK REFERENCES We presented two models for the real-time, early prediction of [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, shopping session outcomes on an e-commerce platform. We demon- Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, San- strated that LSTM RNNs generalize better and with less data than jay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Leven- high-order Markov chain models used in previous work. Our ap- berg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike proach, in addition to distinguishing browsing-only and cart-interaction Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul sessions, can also accurately discriminate between cart abandon- Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. ment and purchase sessions. Future work will focus on features, 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. single RNN architectures, and decision strategies. (2015). http://tensorflow.org/ Software available from tensorflow.org. Predicting Shopping Behavior with Mixture of RNNs SIGIR 2017 eCom, August 2017, Tokyo, Japan [2] Aditya Awalkar, Ibrahim Ahmed, and Tejas Nevrekar. 2016. Prediction of User’s Purchases using Clickstream Data. International Journal of Engineering Science and Computing (2016). [3] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Language Model. J. Mach. Learn. Res. 3 (March 2003), 1137–1155. [4] Dimitris Bertsimas, Adam J. Mersereau, and Nitin R. Patel. 2003. Dynamic Classification of Online Customers. In Proceedings of the Third SIAM International Conference on Data Mining, San Francisco, CA, USA, May 1-3, 2003. 107–118. [5] Robert Florentin. 2016. Online shopping abandonment rate a new perspective : the role of choice conflicts as a factor of online shopping abandonment. Master’s thesis. University of Twente. [6] Joshua T. Goodman. 2001. A Bit of Progress in Language Modeling. Comput. Speech Lang. 15, 4 (Oct. 2001), 403–434. [7] Bo-June Paul Hsu and James R. Glass. 2008. Iterative language model estimation: efficient data structure & algorithms. In INTERSPEECH 2008, 9th Annual Confer- ence of the International Speech Communication Association, Brisbane, Australia, September 22-26, 2008. 841–844. [8] Monika Kukar-Kinney and Angeline G. Close. 2010. The determinants of con- sumers’ online shopping cart abandonment. Journal of the Academy of Marketing Science 38, 2 (2010), 240–250. [9] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černocký, and Sanjeev Khu- danpur. 2010. Recurrent neural network based language model. In The 11th Annual Conference of the International Speech Communication Association (IN- TERSPEECH 2010). [10] Deepak Pai, Abhijit Sharang, Meghanath Macha Yadagiri, and Shradha Agrawal. 2014. Modelling Visit Similarity Using Click-Stream Data: A Supervised Approach. Springer International Publishing, 135–145. [11] The Statistics Portal. 2014. Digital buyer penetration world- wide from 2014 to 2019. http://www.statista.com/statistics/261676/ digital-buyer-penetration-worldwide/. (2014). Accessed: 2016-09-10. [12] Armando Vieira. 2015. Predicting online user behaviour using deep learning algorithms. The Computing Research Repository (CoRR) abs/1511.06247 (2015). [13] Zhenzhou Wu, Bao Hong Tan, Rubing Duan, Yong Liu, and Rick Siow Mong Goh. 2015. Neural Modeling of Buying Behaviour for E-Commerce from Clicking Patterns. In Proceedings of the 2015 International ACM Recommender Systems Challenge (RecSys ’15 Challenge). ACM, New York, NY, USA, Article 12, 4 pages. [14] Zhengzheng Xing, Jian Pei, and Eamonn Keogh. 2010. A Brief Survey on Sequence Classification. SIGKDD Explor. Newsl. 12, 1 (Nov. 2010), 40–48. [15] Yin Xu and Jin-Song Huang. 2015. Factors Influencing Cart Abandonment in the Online Shopping Process. Social Behavior and Personality: an international journal 43, 10 (Nov. 2015). [16] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent Neural Net- work Regularization. The Computing Research Repository (CoRR) abs/1409.2329 (2014). http://arxiv.org/abs/1409.2329