=Paper= {{Paper |id=Vol-2311/paper_6 |storemode=property |title=Predicting Shopping Behavior with Mixture of RNNs |pdfUrl=https://ceur-ws.org/Vol-2311/paper_6.pdf |volume=Vol-2311 |authors=Arthur Toth,Louis Tan,Giuseppe Di Fabbrizio,Ankur Datta |dblpUrl=https://dblp.org/rec/conf/sigir/TothTFD17 }} ==Predicting Shopping Behavior with Mixture of RNNs== https://ceur-ws.org/Vol-2311/paper_6.pdf

Predicting Shopping Behavior with Mixture of RNNs
Arthur Toth Louis Tan
Rakuten Institute of Technology Rakuten Institute of Technology
Boston, Massachusetts 02110 Boston, Massachusetts 02110
arthur.toth@rakuten.com ts-louis.tan@rakuten.com

Giuseppe Di Fabbrizio Ankur Datta
Rakuten Institute of Technology Rakuten Institute of Technology
Boston, Massachusetts 02110 Boston, Massachusetts 02110
difabbrizio@gmail.com ankur.datta@rakuten.com

ABSTRACT online shopping literature [5, 8]. Main causes include concerns
We compare two machine learning approaches for early predic- about costs, perceived decision difficulty, and selection conflicts
tion of shoppers’ behaviors, leveraging features from clickstream due to a large number of similar choices.
data generated during live shopping sessions. Our baseline is a Early shopping abandonment detection may allow mitigation
mixture of Markov models to predict three outcomes: purchase, of purchase inhibitors by enabling injection of new content into
abandoned shopping cart, and browsing-only. We then experiment live web browsing sessions. For instance, it could trigger a discount
with a mixture of Recurrent Neural Networks. When sequences offer or change the product search strategy to retrieve more diverse
are truncated to 75% of their length, a relatively small feature set options and simplify the consumer’s decision process.
predicts purchase with an F-measure of 0.80 and browsing-only This paper considers real web interactions from a US e-commerce
with an F-measure of 0.98. We also investigate an entropy-based subsidiary of Rakuten (楽天株式会社) to predict three possible
decision procedure. outcomes: purchase, abandoned shopping cart, and browsing-only.
To early detect outcomes, we consider clues left behind by shop-
CCS CONCEPTS pers that are encoded into clickstream data and logged during each
session. Clickstream data is used to experiment with mixtures of
• Computing methodologies → Neural networks; • Applied
high-order Markov Chain Models (MCMs) and mixtures of Recur-
computing → Online shopping;
rent Neural Networks (RNNs) that use the Long Short-Term Mem-
KEYWORDS ory (LSTM) architecture. We compare and contrast each model on
sequences truncated at different lengths and report on precision,
mining/modeling search activity, click models, behavioral analysis recall, and F-measures. We also show F-measures from using an
ACM Reference format: entropy-based decision procedure that is usable in a live system.
Arthur Toth, Louis Tan, Giuseppe Di Fabbrizio, and Ankur Datta. 2017.
Predicting Shopping Behavior with Mixture of RNNs. In Proceedings of 2 RELATED WORK
SIGIR 2017 eCom, Tokyo, Japan, August 2017, 5 pages.
We treat predicting user behavior from clickstream data as sequence
classification, which is broadly surveyed by Xing et al. [14], who
1 INTRODUCTION divide it into feature-based, sequence distance-based, and model-
Recent e-commerce forecast analysis estimates that more than 1.77 based methods. Previous feature-based work on clickstream clas-
billion users will shop online by the end of 2017 [11]. Although sification includes the random forest used by Awalker et al. [2],
this is impressive growth, conversion rates for online shoppers the deep belief networks and stacked denoising auto-encoders by
are substantially lower than rates for traditional brick-and-mortar Viera [12], and recurrent neural networks by Wu et al. [13]. Previ-
stores. ous distance-based work includes the large margin nearest neighbor
Consumers shopping on e-commerce web sites are influenced approach by Pai et al. [10]. Previous model-based work by Bertsimas
by numerous factors and may decide to stop the current session, et al. [4] used a mixture of Markov chains.
leaving products in their shopping carts. Once a user has interacted Our baseline approach is based on the latter work, whereas our
with a shopping cart, abandonment rates range between 25% and new approach uses a mixture of RNNs. Although Wu et al. [13] used
88%, significantly reducing merchants’ selling opportunities [15]. RNNs, their approach is not applicable to our scenario, since its
Several potential purchase inhibitors have been analyzed in the bi-directional RNN uses entire clickstream sequences. Our goal is
to classify incomplete sequences. Also their model is not a mixture.
Copyright © 2017 by the paper’s authors. Copying permitted for private and academic Finally, our approach differs from most others in its use of a
purposes. ternary classification scheme. We classify clickstreams as either
In: J. Degenhardt, S. Kallumadi, M. de Rijke, L. Si, A. Trotman, Y. Xu (eds.):
Proceedings of the SIGIR 2017 eCom workshop, August 2017, Tokyo, Japan, published
being a purchase, abandon or browsing-only session instead of just
at http://ceur-ws.org purchase and non-purchase.
SIGIR 2017 eCom, August 2017, Tokyo, Japan Arthur Toth, Louis Tan, Giuseppe Di Fabbrizio, and Ankur Datta

Table 1: Clickstream data distribution. by the equation
P (Si |Ci = ω)P (Ci = ω)
P (Ci = ω |Si ) = P
Clickstream outcome Count % Average Median ω ∈Ω P (Si |Ci = ω)P (Ci = ω)
Length m̄i Length with the prior, P (Ci = ω), estimated from counts.
ABANDON 29,371 14.7% 8.2 5 This model fits the problem’s constraints, because likelihoods
BROWSING-ONLY 128,450 64.6% 16.6 6 can be produced from subsequences without using “future” clicks.
PURCHASE 41,115 20.7% 8.4 5 Although each chain is trained only on click data, separating data
by class implicitly conditions them on class.
Total 198,936 100%
Taking inspiration from the Automatic Speech Recognition (ASR)
community and similarities to “Language Modeling”, we adapted
3 CLICKSTREAM DATA some of their more recent techniques to our problem. In preliminary
experiments, 5-grams performed better than shorter chains, so we
We consider clickstream data collected over two weeks and con-
used them. Longer chains cause greater sparsity, so we addressed
sisting of n 0 = 1, 560, 830 sessions. A session Si , i = 1, 2, . . . , n 0 , is a
this with Kneser-Ney smoothing, which performed best in a study
chronological sequence of recorded page views, or “clicks.” Let Vi, j
of language modeling techniques [6]. We used the MITLM toolkit
be the jth page view of session i so that Si = (Vi,1 , Vi,2 , . . . Vi,mi ),
to train the Markov chains [7].
and mi is defined as the length of session i. We exclude session
i from our experiments if mi < 4, after which n = 198, 936 ses- 4.2 Recurrent Neural Networks
sions remain. Sessions with purchases are truncated before the first
purchase confirmation. Table 1 summarizes the n sessions. Taking further inspiration from the ASR community, we replaced
Our experiments use both the page type and dwell time of Vi, j . the Markov-chains in our mixtures with RNNs. Earlier language
The page type of Vi, j , denoted as Pi, j , belongs to one of eight cate- modeling work used feed-forward artificial neural networks [3],
gories, including search pages, product view pages, login pages, etc. but RNNs have performed better recently, both in direct likelihood
The dwell time, D i, j , of Vi, j is the amount of time the user spends metrics and in overall ASR task metrics [9, 16]. Click-stream data
viewing the page, and is not available until the (j + 1)th page view. differs from ASR text, and our mixture model differed from the
After the jth page view, the clickstream data gathered for session i is typical ASR approach, so it was unclear whether RNNs would help
given by Si |j = ((Pi,1 , D i,0 ), (Pi,2 , D i,1 ), . . . , (Pi, j , D i, j−1 )) where in our scenario.
D i,0 is undefined, i.e., D i,0 = ∅. To reduce sparsity, dwell times
were placed in 8 bins, evenly spaced by percentiles.

4 MODELING APPROACHES
Our goal is to classify customer behavior into final decision cat-
egories. In particular, clickstream sequences receive one of the
following labels: PURCHASE, if the sequence leads to an item pur-
chase; ABANDON, if an item was left in the shopping cart, but
there was no purchase; and BROWSING-ONLY, when the shopping
cart was not used. The final two categories can be combined to Figure 1: Simple RNN to Predict Following Token
investigate PURCHASE vs. NON_PURCHASE behavior. In prelimi-
nary studies, the ABANDON sequences were much more similar Earlier RNN-based language models used the “simple recurrent
to the PURCHASE sequences than to the BROWSING-ONLY se- neural network” architecture [9]. The underlying idea, depicted in
quences, so having three categories helped account for some of Figure 1, is that an input sequence, represented by {X 0 , ..., X t −1 },
the confusability of the data. Our eventual goal of applying our is connected to a recurrent layer, represented by {A0 , ...At }, which
models in a live system adds a constraint. The classifier must work is also connected to itself at the next point in time. The recurrent
for incomplete sequences, without using data from the “future”. layer is also connected to an output layer. In our models, each RNN
tries to predict the next input, X n+1 , after each input, X n .
4.1 Mixture of High-Order Markov Chains “Simple” RNN-based language models for ASR were outper-
Bertsimas et al. [4] modeled a similar problem with a mixture of formed by RNNs using the Long Short-Term Memory (LSTM) con-
Markov chains, using maximum a posteriori estimation to predict figuration and “drop-out” [16]. LSTM addressed the “vanishing
sequence labels. In this approach, the training data is partitioned by gradient” and “exploding gradient” problems. “Drop-out” addressed
class and separate Markov chain models are trained for each one. overfitting by probabilistically ignoring the contributions of non-
The resultant models can estimate class likelihoods from sequences. recurrent nodes during training.
Let Ci be a random variable representing the outcome of session Si , In LSTM RNNs, some nodes function as “memory cells”. Some
where Ci ∈ Ω, and Ω is the set of possible clickstream outcomes. connections retrieve information from them, and others cause them
The models would be used to estimate the likelihoods P (Si |Ci = ω) to forget. The LSTM equations are [16]:
for all ω ∈ Ω. Using Bayes’ theorem and the law of total probability, LST M :h tl −1,h tl −1,c tl −1 →h tl ,c tl
the class posteriors for each of the three classes can be estimated
Predicting Shopping Behavior with Mixture of RNNs SIGIR 2017 eCom, August 2017, Tokyo, Japan

i + * siдm + F1-measure by sequence length
l −1
f /// ... siдm /// * ht
*.
1
//=.. //T2n, 4n . l
.. +/
..
o / . siдm / h
, t −1 - 0.9
д - , siдm -
.
, 0.8
c tl =f c tl −1 +i д 0.7
h tl =o t anh (c tl ) 0.6
where h and c represent hidden states and memory cells, subscripts 0.5
refer to time, superscripts refer to layers, Tn,m is an affine transform 0.4
from R n to Rm , refers to element-wise multiplication, and siдm 0.3
and tanh are applied to each element. 0.2
Since LSTM RNNs with drop-out worked much better than 0.1
Markov chains for ASR, we replaced the Markov chains with them 0
in our clickstream mixture models. Additional work was neces- 25% 50% 75% 100%

sary, since our scenario did not exactly match language modeling. MCM-ABN MCM-BRO MCM-PRC

During training, input tokens were still used to predict following RNN-ABN RNN-BRO RNN-PRC

tokens. During testing, however, our goal was the sequence prob-
abilities. These were calculated from token probabilities present Figure 2: F1-measure by sequence length for mixtures of Markov
in intermediate softmax layers in each LSTM model. Due to the Chain Models (MCM) and RNNs for each label: ABN=abandoned;
network architecture, “future” events were not used for this. We BRO=browsing-only; PRC=purchase.
used TensorFlow to implement our LSTM RNNs [1].

5 EXPERIMENTS by the lack of data for the less represented sequences (i.e., 14.7% for
We experimented on the dataset described in Section 3 and summa- ABANDON and 20.7% for PURCHASE). RNNs instead generalize
rized in Table 1. It was partitioned into an 80% training/20% testing better due to the memory component that can model long-distance
split, using stratified sampling to retain class proportions. dependencies.
All RNNs were trained for 10 epochs, using batches of 20 se-
quences. We tested 16 combinations of the remaining parameters.
The number of recurrent layers was 1 or 2, the keep probability
was 1 or 0.5, and the hidden state size was 10, 20, 40, or 80. For a
particular mixture model, all the RNNs used the same parameter
values.

5.1 Results
For each model, we evaluated the prediction performance by trun-
cating the page view sequences at different lengths. Table 2 shows
the results for the mixture of Markov models, and for one of the
mixture of RNN trials. Although we tested 16 different RNN parame-
ter combinations, results were so similar that we are only reporting
on one of them. ABANDON
Table 2 reports precision, recall, and F1-measure for each specific BROWSING-ONLY
PURCHASE
sequence outcome when considering 25%, 50%, 75%, and 100% of
the total length of the sequence. For instance, when splitting at
50%, the Markov chain model can predict a PURCHASE with a
0.42 precision and 0.11 recall, resulting in an overall F1-measure
of 0.17. For the same conditions, the RNN-based model reaches a Figure 3: Log-probability trajectories from the MCM mixture, pro-
precision of 0.82 with 0.71 recall and an F1-measure of 0.76. We gressing along page view sequences for each class
also report the accuracy when randomly selecting a class based
on the prior distribution of the clickstream corpus. RNN mixture Similarly to Bertsimas et al. [4], in Figure 3, we plot 100 log-
components substantially outperform Markov chain components. probability trajectories, with lengths from 2 through 20 page views,
This is particularly evident from Figure 2, which shows the F1- estimated by the MCM mixture along page view sequences for
measure by sequence length for the mixtures of MCMs (dotted each class. This plot demonstrates how probabilities evolve during
line) and of RNNs (solid line). Both models monotonically increase interactions and how confident each model is compared to others.
performance as the model observes more data from 25% splits to The legend in Figure 3 corresponds to the true final state labels
100%, but the mixture of RNNs has an immediate edge even with of the test examples: dashed red lines for ABANDON sequences,
the short 25% sequences. The MCMs present similar F1-measures dashed and dotted green lines for BROWSING-ONLY sequences,
for the majority class (i.e., BROWSING-ONLY), but it is penalized and solid blue lines for PURCHASE sequences. Ideally, the model
SIGIR 2017 eCom, August 2017, Tokyo, Japan Arthur Toth, Louis Tan, Giuseppe Di Fabbrizio, and Ankur Datta

Table 2: Precision (P), Recall (R), and F1-measure (F) for MCM and RNN mixtures by sequence length.

Mixture Type Sequence Length 25% 50% 75% 100%
MCM Final State P R F1 P R F1 P R F1 P R F1 Random
ABANDON 0.44 0.02 0.04 0.45 0.09 0.14 0.72 0.52 0.61 0.70 0.64 0.67 0.21
BROWSING-ONLY 0.66 0.99 0.79 0.69 0.98 0.81 0.89 0.97 0.92 0.99 0.95 0.97 0.64
PURCHASE 0.37 0.02 0.04 0.42 0.11 0.17 0.72 0.67 0.69 0.71 0.86 0.78 0.15
RNN Final State P R F1 P R F1 P R F1 P R F1 Random
ABANDON 0.75 0.39 0.52 0.73 0.62 0.67 0.74 0.72 0.73 1.00 0.84 0.91 0.21
BROWSING-ONLY 0.84 0.99 0.91 0.92 0.99 0.96 0.97 0.99 0.98 1.00 0.98 0.99 0.64
PURCHASE 0.80 0.64 0.71 0.82 0.71 0.76 0.82 0.78 0.80 0.86 1.00 0.92 0.15

would score the PURCHASE sequences (solid blue lines) high, and
F1-measure by Label Distribution Entropy
the other sequences low, and the earlier the distinction could be
made, the better. Looking at this figure, there does appear to be 1

some level of discrimination between the categories. In general, 0.8

F1-measure
the BROWSING-ONLY sequences seem more separable from the 0.6
PURCHASE sequences than the ABANDON sequences, as expected.
0.4
Although Table 2 can be used to compare other work [10], it de-
0.2
pends on sequence lengths. A useful live system must predict final
actions before sequences are complete and needs a decision process 0
0.99 0.88 0.77 0.66 0.55 0.44 0.33 0.22 0.11
for accepting the predicted label. We experimented with taking
Entropy Threshold (Proportion of Maximum Entropy)
the highest scoring label once the entropy fell below a threshold.
Figure 4 shows the F1-measures at different thresholds, which are Abandon No_Cart_Interaction Purchase
proportions of the maximum possible entropy. Since the entropy
might not drop below the threshold, it is important to consider
how many sequences have predicted labels. When we calculated F1- Figure 4: F1-measures at Entropy Thresholds for each class
measures, sequences without predictions were counted as misses.
Figure 5 shows the number of sequences where predictions were
made based on entropy threshold. Again, the horizontal axis rep- Sequences where Entropy Goes Below Threshold
resents the proportion of the maximum possible entropy value. 45000
The vertical axis represents the number of sequences where a de- 40000
Number of Sequences

35000
cision can be made based on threshold crossing. As an example, a
30000
threshold of 0.55 led to reasonable F1-measures while producing 25000
predictions for 99% of the sequences before they were complete. 20000
Choosing higher entropy thresholds allows decisions to be made 15000
for more sequences, but performance can suffer since decisions can 10000

be made while class probabilities are more uniform and confidence 5000
0
is lower. Choosing lower entropy thresholds forces the class proba- 0.99 0.88 0.77 0.66 0.55 0.44 0.33 0.22 0.11
bilities to be more distinct, which leads to more confident decisions, Entropy Threshold (Proportion of Maximum Entropy)
but performance starts to suffer when fewer sequences receive
decisions. In practice, the threshold would be tuned on held-out
data. Figure 5: Sequences Crossing Entropy Thresholds

6 CONCLUSION AND FUTURE WORK
REFERENCES
We presented two models for the real-time, early prediction of [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
shopping session outcomes on an e-commerce platform. We demon- Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, San-
strated that LSTM RNNs generalize better and with less data than jay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard,
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Leven-
high-order Markov chain models used in previous work. Our ap- berg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike
proach, in addition to distinguishing browsing-only and cart-interaction Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul
sessions, can also accurately discriminate between cart abandon- Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals,
Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.
ment and purchase sessions. Future work will focus on features, 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.
single RNN architectures, and decision strategies. (2015). http://tensorflow.org/ Software available from tensorflow.org.
Predicting Shopping Behavior with Mixture of RNNs SIGIR 2017 eCom, August 2017, Tokyo, Japan

[2] Aditya Awalkar, Ibrahim Ahmed, and Tejas Nevrekar. 2016. Prediction of User’s
Purchases using Clickstream Data. International Journal of Engineering Science
and Computing (2016).
[3] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003.
A Neural Probabilistic Language Model. J. Mach. Learn. Res. 3 (March 2003),
1137–1155.
[4] Dimitris Bertsimas, Adam J. Mersereau, and Nitin R. Patel. 2003. Dynamic
Classification of Online Customers. In Proceedings of the Third SIAM International
Conference on Data Mining, San Francisco, CA, USA, May 1-3, 2003. 107–118.
[5] Robert Florentin. 2016. Online shopping abandonment rate a new perspective : the
role of choice conflicts as a factor of online shopping abandonment. Master’s thesis.
University of Twente.
[6] Joshua T. Goodman. 2001. A Bit of Progress in Language Modeling. Comput.
Speech Lang. 15, 4 (Oct. 2001), 403–434.
[7] Bo-June Paul Hsu and James R. Glass. 2008. Iterative language model estimation:
efficient data structure & algorithms. In INTERSPEECH 2008, 9th Annual Confer-
ence of the International Speech Communication Association, Brisbane, Australia,
September 22-26, 2008. 841–844.
[8] Monika Kukar-Kinney and Angeline G. Close. 2010. The determinants of con-
sumers’ online shopping cart abandonment. Journal of the Academy of Marketing
Science 38, 2 (2010), 240–250.
[9] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černocký, and Sanjeev Khu-
danpur. 2010. Recurrent neural network based language model. In The 11th
Annual Conference of the International Speech Communication Association (IN-
TERSPEECH 2010).
[10] Deepak Pai, Abhijit Sharang, Meghanath Macha Yadagiri, and Shradha Agrawal.
2014. Modelling Visit Similarity Using Click-Stream Data: A Supervised Approach.
Springer International Publishing, 135–145.
[11] The Statistics Portal. 2014. Digital buyer penetration world-
wide from 2014 to 2019. http://www.statista.com/statistics/261676/
digital-buyer-penetration-worldwide/. (2014). Accessed: 2016-09-10.
[12] Armando Vieira. 2015. Predicting online user behaviour using deep learning
algorithms. The Computing Research Repository (CoRR) abs/1511.06247 (2015).
[13] Zhenzhou Wu, Bao Hong Tan, Rubing Duan, Yong Liu, and Rick Siow Mong Goh.
2015. Neural Modeling of Buying Behaviour for E-Commerce from Clicking
Patterns. In Proceedings of the 2015 International ACM Recommender Systems
Challenge (RecSys ’15 Challenge). ACM, New York, NY, USA, Article 12, 4 pages.
[14] Zhengzheng Xing, Jian Pei, and Eamonn Keogh. 2010. A Brief Survey on Sequence
Classification. SIGKDD Explor. Newsl. 12, 1 (Nov. 2010), 40–48.
[15] Yin Xu and Jin-Song Huang. 2015. Factors Influencing Cart Abandonment in
the Online Shopping Process. Social Behavior and Personality: an international
journal 43, 10 (Nov. 2015).
[16] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent Neural Net-
work Regularization. The Computing Research Repository (CoRR) abs/1409.2329
(2014). http://arxiv.org/abs/1409.2329