Predicting Shopping Behavior with Mixture of RNNs
                                 Arthur Toth                                                                   Louis Tan
                     Rakuten Institute of Technology                                               Rakuten Institute of Technology
                      Boston, Massachusetts 02110                                                   Boston, Massachusetts 02110
                        arthur.toth@rakuten.com                                                      ts-louis.tan@rakuten.com

                         Giuseppe Di Fabbrizio                                                              Ankur Datta
                     Rakuten Institute of Technology                                               Rakuten Institute of Technology
                      Boston, Massachusetts 02110                                                   Boston, Massachusetts 02110
                         difabbrizio@gmail.com                                                       ankur.datta@rakuten.com

ABSTRACT                                                                              online shopping literature [5, 8]. Main causes include concerns
We compare two machine learning approaches for early predic-                          about costs, perceived decision difficulty, and selection conflicts
tion of shoppers’ behaviors, leveraging features from clickstream                     due to a large number of similar choices.
data generated during live shopping sessions. Our baseline is a                          Early shopping abandonment detection may allow mitigation
mixture of Markov models to predict three outcomes: purchase,                         of purchase inhibitors by enabling injection of new content into
abandoned shopping cart, and browsing-only. We then experiment                        live web browsing sessions. For instance, it could trigger a discount
with a mixture of Recurrent Neural Networks. When sequences                           offer or change the product search strategy to retrieve more diverse
are truncated to 75% of their length, a relatively small feature set                  options and simplify the consumer’s decision process.
predicts purchase with an F-measure of 0.80 and browsing-only                            This paper considers real web interactions from a US e-commerce
with an F-measure of 0.98. We also investigate an entropy-based                       subsidiary of Rakuten (楽天株式会社) to predict three possible
decision procedure.                                                                   outcomes: purchase, abandoned shopping cart, and browsing-only.
                                                                                      To early detect outcomes, we consider clues left behind by shop-
CCS CONCEPTS                                                                          pers that are encoded into clickstream data and logged during each
                                                                                      session. Clickstream data is used to experiment with mixtures of
• Computing methodologies → Neural networks; • Applied
                                                                                      high-order Markov Chain Models (MCMs) and mixtures of Recur-
computing → Online shopping;
                                                                                      rent Neural Networks (RNNs) that use the Long Short-Term Mem-
KEYWORDS                                                                              ory (LSTM) architecture. We compare and contrast each model on
                                                                                      sequences truncated at different lengths and report on precision,
mining/modeling search activity, click models, behavioral analysis                    recall, and F-measures. We also show F-measures from using an
ACM Reference format:                                                                 entropy-based decision procedure that is usable in a live system.
Arthur Toth, Louis Tan, Giuseppe Di Fabbrizio, and Ankur Datta. 2017.
Predicting Shopping Behavior with Mixture of RNNs. In Proceedings of                  2   RELATED WORK
SIGIR 2017 eCom, Tokyo, Japan, August 2017, 5 pages.
                                                                                      We treat predicting user behavior from clickstream data as sequence
                                                                                      classification, which is broadly surveyed by Xing et al. [14], who
1 INTRODUCTION                                                                        divide it into feature-based, sequence distance-based, and model-
 Recent e-commerce forecast analysis estimates that more than 1.77                    based methods. Previous feature-based work on clickstream clas-
billion users will shop online by the end of 2017 [11]. Although                      sification includes the random forest used by Awalker et al. [2],
this is impressive growth, conversion rates for online shoppers                       the deep belief networks and stacked denoising auto-encoders by
are substantially lower than rates for traditional brick-and-mortar                   Viera [12], and recurrent neural networks by Wu et al. [13]. Previ-
stores.                                                                               ous distance-based work includes the large margin nearest neighbor
     Consumers shopping on e-commerce web sites are influenced                        approach by Pai et al. [10]. Previous model-based work by Bertsimas
by numerous factors and may decide to stop the current session,                       et al. [4] used a mixture of Markov chains.
leaving products in their shopping carts. Once a user has interacted                      Our baseline approach is based on the latter work, whereas our
with a shopping cart, abandonment rates range between 25% and                         new approach uses a mixture of RNNs. Although Wu et al. [13] used
88%, significantly reducing merchants’ selling opportunities [15].                    RNNs, their approach is not applicable to our scenario, since its
Several potential purchase inhibitors have been analyzed in the                       bi-directional RNN uses entire clickstream sequences. Our goal is
                                                                                      to classify incomplete sequences. Also their model is not a mixture.
Copyright © 2017 by the paper’s authors. Copying permitted for private and academic       Finally, our approach differs from most others in its use of a
purposes.                                                                             ternary classification scheme. We classify clickstreams as either
In: J. Degenhardt, S. Kallumadi, M. de Rijke, L. Si, A. Trotman, Y. Xu (eds.):
Proceedings of the SIGIR 2017 eCom workshop, August 2017, Tokyo, Japan, published
                                                                                      being a purchase, abandon or browsing-only session instead of just
at http://ceur-ws.org                                                                 purchase and non-purchase.
SIGIR 2017 eCom, August 2017, Tokyo, Japan                                                Arthur Toth, Louis Tan, Giuseppe Di Fabbrizio, and Ankur Datta

                 Table 1: Clickstream data distribution.                                 by the equation
                                                                                                                        P (Si |Ci = ω)P (Ci = ω)
                                                                                                   P (Ci = ω |Si ) = P
 Clickstream outcome              Count            %      Average        Median                                        ω ∈Ω P (Si |Ci = ω)P (Ci = ω)
                                                        Length m̄i       Length          with the prior, P (Ci = ω), estimated from counts.
 ABANDON                          29,371      14.7%               8.2             5         This model fits the problem’s constraints, because likelihoods
 BROWSING-ONLY                   128,450      64.6%              16.6             6      can be produced from subsequences without using “future” clicks.
 PURCHASE                         41,115      20.7%               8.4             5      Although each chain is trained only on click data, separating data
                                                                                         by class implicitly conditions them on class.
 Total                           198,936      100%
                                                                                            Taking inspiration from the Automatic Speech Recognition (ASR)
                                                                                         community and similarities to “Language Modeling”, we adapted
3     CLICKSTREAM DATA                                                                   some of their more recent techniques to our problem. In preliminary
                                                                                         experiments, 5-grams performed better than shorter chains, so we
We consider clickstream data collected over two weeks and con-
                                                                                         used them. Longer chains cause greater sparsity, so we addressed
sisting of n 0 = 1, 560, 830 sessions. A session Si , i = 1, 2, . . . , n 0 , is a
                                                                                         this with Kneser-Ney smoothing, which performed best in a study
chronological sequence of recorded page views, or “clicks.” Let Vi, j
                                                                                         of language modeling techniques [6]. We used the MITLM toolkit
be the jth page view of session i so that Si = (Vi,1 , Vi,2 , . . . Vi,mi ),
                                                                                         to train the Markov chains [7].
and mi is defined as the length of session i. We exclude session
i from our experiments if mi < 4, after which n = 198, 936 ses-                          4.2    Recurrent Neural Networks
sions remain. Sessions with purchases are truncated before the first
purchase confirmation. Table 1 summarizes the n sessions.                                Taking further inspiration from the ASR community, we replaced
   Our experiments use both the page type and dwell time of Vi, j .                      the Markov-chains in our mixtures with RNNs. Earlier language
The page type of Vi, j , denoted as Pi, j , belongs to one of eight cate-                modeling work used feed-forward artificial neural networks [3],
gories, including search pages, product view pages, login pages, etc.                    but RNNs have performed better recently, both in direct likelihood
The dwell time, D i, j , of Vi, j is the amount of time the user spends                  metrics and in overall ASR task metrics [9, 16]. Click-stream data
viewing the page, and is not available until the (j + 1)th page view.                    differs from ASR text, and our mixture model differed from the
After the jth page view, the clickstream data gathered for session i is                  typical ASR approach, so it was unclear whether RNNs would help
given by Si |j = ((Pi,1 , D i,0 ), (Pi,2 , D i,1 ), . . . , (Pi, j , D i, j−1 )) where   in our scenario.
D i,0 is undefined, i.e., D i,0 = ∅. To reduce sparsity, dwell times
were placed in 8 bins, evenly spaced by percentiles.

4     MODELING APPROACHES
Our goal is to classify customer behavior into final decision cat-
egories. In particular, clickstream sequences receive one of the
following labels: PURCHASE, if the sequence leads to an item pur-
chase; ABANDON, if an item was left in the shopping cart, but
there was no purchase; and BROWSING-ONLY, when the shopping
cart was not used. The final two categories can be combined to                                   Figure 1: Simple RNN to Predict Following Token
investigate PURCHASE vs. NON_PURCHASE behavior. In prelimi-
nary studies, the ABANDON sequences were much more similar                                   Earlier RNN-based language models used the “simple recurrent
to the PURCHASE sequences than to the BROWSING-ONLY se-                                  neural network” architecture [9]. The underlying idea, depicted in
quences, so having three categories helped account for some of                           Figure 1, is that an input sequence, represented by {X 0 , ..., X t −1 },
the confusability of the data. Our eventual goal of applying our                         is connected to a recurrent layer, represented by {A0 , ...At }, which
models in a live system adds a constraint. The classifier must work                      is also connected to itself at the next point in time. The recurrent
for incomplete sequences, without using data from the “future”.                          layer is also connected to an output layer. In our models, each RNN
                                                                                         tries to predict the next input, X n+1 , after each input, X n .
4.1      Mixture of High-Order Markov Chains                                                 “Simple” RNN-based language models for ASR were outper-
Bertsimas et al. [4] modeled a similar problem with a mixture of                         formed by RNNs using the Long Short-Term Memory (LSTM) con-
Markov chains, using maximum a posteriori estimation to predict                          figuration and “drop-out” [16]. LSTM addressed the “vanishing
sequence labels. In this approach, the training data is partitioned by                   gradient” and “exploding gradient” problems. “Drop-out” addressed
class and separate Markov chain models are trained for each one.                         overfitting by probabilistically ignoring the contributions of non-
The resultant models can estimate class likelihoods from sequences.                      recurrent nodes during training.
Let Ci be a random variable representing the outcome of session Si ,                         In LSTM RNNs, some nodes function as “memory cells”. Some
where Ci ∈ Ω, and Ω is the set of possible clickstream outcomes.                         connections retrieve information from them, and others cause them
The models would be used to estimate the likelihoods P (Si |Ci = ω)                      to forget. The LSTM equations are [16]:
for all ω ∈ Ω. Using Bayes’ theorem and the law of total probability,                                          LST M :h tl −1,h tl −1,c tl −1 →h tl ,c tl
the class posteriors for each of the three classes can be estimated
Predicting Shopping Behavior with Mixture of RNNs                                                SIGIR 2017 eCom, August 2017, Tokyo, Japan

                          i + * siдm +                                                          F1-measure by sequence length
                                                         l −1
                          f /// ... siдm ///         * ht
                   *.
                                                                             1
                              //=..        //T2n, 4n . l
                    ..                                        +/
                     ..
                          o / . siдm /                 h
                                                     , t −1 -               0.9
                          д - , siдm -
                      .
                      ,                                                     0.8
                                  c tl =f c tl −1 +i д                      0.7
                                  h tl =o t anh (c tl )                     0.6
where h and c represent hidden states and memory cells, subscripts          0.5
refer to time, superscripts refer to layers, Tn,m is an affine transform    0.4
from R n to Rm , refers to element-wise multiplication, and siдm            0.3
and tanh are applied to each element.                                       0.2
   Since LSTM RNNs with drop-out worked much better than                    0.1
Markov chains for ASR, we replaced the Markov chains with them               0
in our clickstream mixture models. Additional work was neces-                         25%             50%             75%            100%

sary, since our scenario did not exactly match language modeling.                               MCM-ABN     MCM-BRO     MCM-PRC

During training, input tokens were still used to predict following                              RNN-ABN     RNN-BRO     RNN-PRC

tokens. During testing, however, our goal was the sequence prob-
abilities. These were calculated from token probabilities present          Figure 2: F1-measure by sequence length for mixtures of Markov
in intermediate softmax layers in each LSTM model. Due to the              Chain Models (MCM) and RNNs for each label: ABN=abandoned;
network architecture, “future” events were not used for this. We           BRO=browsing-only; PRC=purchase.
used TensorFlow to implement our LSTM RNNs [1].

5     EXPERIMENTS                                                          by the lack of data for the less represented sequences (i.e., 14.7% for
We experimented on the dataset described in Section 3 and summa-           ABANDON and 20.7% for PURCHASE). RNNs instead generalize
rized in Table 1. It was partitioned into an 80% training/20% testing      better due to the memory component that can model long-distance
split, using stratified sampling to retain class proportions.              dependencies.
   All RNNs were trained for 10 epochs, using batches of 20 se-
quences. We tested 16 combinations of the remaining parameters.
The number of recurrent layers was 1 or 2, the keep probability
was 1 or 0.5, and the hidden state size was 10, 20, 40, or 80. For a
particular mixture model, all the RNNs used the same parameter
values.

5.1    Results
For each model, we evaluated the prediction performance by trun-
cating the page view sequences at different lengths. Table 2 shows
the results for the mixture of Markov models, and for one of the
mixture of RNN trials. Although we tested 16 different RNN parame-
ter combinations, results were so similar that we are only reporting
on one of them.                                                                         ABANDON
   Table 2 reports precision, recall, and F1-measure for each specific                  BROWSING-ONLY
                                                                                        PURCHASE
sequence outcome when considering 25%, 50%, 75%, and 100% of
the total length of the sequence. For instance, when splitting at
50%, the Markov chain model can predict a PURCHASE with a
0.42 precision and 0.11 recall, resulting in an overall F1-measure
of 0.17. For the same conditions, the RNN-based model reaches a            Figure 3: Log-probability trajectories from the MCM mixture, pro-
precision of 0.82 with 0.71 recall and an F1-measure of 0.76. We           gressing along page view sequences for each class
also report the accuracy when randomly selecting a class based
on the prior distribution of the clickstream corpus. RNN mixture              Similarly to Bertsimas et al. [4], in Figure 3, we plot 100 log-
components substantially outperform Markov chain components.               probability trajectories, with lengths from 2 through 20 page views,
This is particularly evident from Figure 2, which shows the F1-            estimated by the MCM mixture along page view sequences for
measure by sequence length for the mixtures of MCMs (dotted                each class. This plot demonstrates how probabilities evolve during
line) and of RNNs (solid line). Both models monotonically increase         interactions and how confident each model is compared to others.
performance as the model observes more data from 25% splits to                The legend in Figure 3 corresponds to the true final state labels
100%, but the mixture of RNNs has an immediate edge even with              of the test examples: dashed red lines for ABANDON sequences,
the short 25% sequences. The MCMs present similar F1-measures              dashed and dotted green lines for BROWSING-ONLY sequences,
for the majority class (i.e., BROWSING-ONLY), but it is penalized          and solid blue lines for PURCHASE sequences. Ideally, the model
SIGIR 2017 eCom, August 2017, Tokyo, Japan                                      Arthur Toth, Louis Tan, Giuseppe Di Fabbrizio, and Ankur Datta

                   Table 2: Precision (P), Recall (R), and F1-measure (F) for MCM and RNN mixtures by sequence length.

     Mixture Type     Sequence Length                25%                     50%                                                  75%                               100%
     MCM              Final State               P      R     F1          P     R                          F1                P          R          F1          P          R       F1      Random
                      ABANDON                 0.44   0.02   0.04   0.45      0.09                0.14                0.72         0.52       0.61        0.70        0.64       0.67            0.21
                      BROWSING-ONLY           0.66   0.99   0.79   0.69      0.98                0.81                0.89         0.97       0.92        0.99        0.95       0.97            0.64
                      PURCHASE                0.37   0.02   0.04   0.42      0.11                0.17                0.72         0.67       0.69        0.71        0.86       0.78            0.15
     RNN              Final State               P      R     F1          P     R                          F1                P          R          F1          P          R       F1      Random
                      ABANDON                 0.75   0.39   0.52   0.73      0.62                0.67                0.74         0.72       0.73        1.00        0.84       0.91            0.21
                      BROWSING-ONLY           0.84   0.99   0.91   0.92      0.99                0.96                0.97         0.99       0.98        1.00        0.98       0.99            0.64
                      PURCHASE                0.80   0.64   0.71   0.82      0.71                0.76                0.82         0.78       0.80        0.86        1.00       0.92            0.15


would score the PURCHASE sequences (solid blue lines) high, and
                                                                                                                     F1-measure by Label Distribution Entropy
the other sequences low, and the earlier the distinction could be
made, the better. Looking at this figure, there does appear to be                                          1

some level of discrimination between the categories. In general,                                          0.8


                                                                                    F1-measure
the BROWSING-ONLY sequences seem more separable from the                                                  0.6
PURCHASE sequences than the ABANDON sequences, as expected.
                                                                                                          0.4
    Although Table 2 can be used to compare other work [10], it de-
                                                                                                          0.2
pends on sequence lengths. A useful live system must predict final
actions before sequences are complete and needs a decision process                                         0
                                                                                                                     0.99       0.88       0.77    0.66       0.55       0.44    0.33    0.22     0.11
for accepting the predicted label. We experimented with taking
                                                                                                                                Entropy Threshold (Proportion of Maximum Entropy)
the highest scoring label once the entropy fell below a threshold.
Figure 4 shows the F1-measures at different thresholds, which are                                                               Abandon            No_Cart_Interaction             Purchase
proportions of the maximum possible entropy. Since the entropy
might not drop below the threshold, it is important to consider
how many sequences have predicted labels. When we calculated F1-                     Figure 4: F1-measures at Entropy Thresholds for each class
measures, sequences without predictions were counted as misses.
Figure 5 shows the number of sequences where predictions were
made based on entropy threshold. Again, the horizontal axis rep-                                                Sequences where Entropy Goes Below Threshold
resents the proportion of the maximum possible entropy value.                                             45000
The vertical axis represents the number of sequences where a de-                                          40000
                                                                                    Number of Sequences


                                                                                                          35000
cision can be made based on threshold crossing. As an example, a
                                                                                                          30000
threshold of 0.55 led to reasonable F1-measures while producing                                           25000
predictions for 99% of the sequences before they were complete.                                           20000
Choosing higher entropy thresholds allows decisions to be made                                            15000
for more sequences, but performance can suffer since decisions can                                        10000

be made while class probabilities are more uniform and confidence                                          5000
                                                                                                                 0
is lower. Choosing lower entropy thresholds forces the class proba-                                                    0.99       0.88      0.77       0.66       0.55   0.44     0.33   0.22     0.11
bilities to be more distinct, which leads to more confident decisions,                                                           Entropy Threshold (Proportion of Maximum Entropy)
but performance starts to suffer when fewer sequences receive
decisions. In practice, the threshold would be tuned on held-out
data.                                                                                                           Figure 5: Sequences Crossing Entropy Thresholds

6   CONCLUSION AND FUTURE WORK
                                                                               REFERENCES
We presented two models for the real-time, early prediction of                 [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
shopping session outcomes on an e-commerce platform. We demon-                     Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, San-
strated that LSTM RNNs generalize better and with less data than                   jay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard,
                                                                                   Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Leven-
high-order Markov chain models used in previous work. Our ap-                      berg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike
proach, in addition to distinguishing browsing-only and cart-interaction           Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul
sessions, can also accurately discriminate between cart abandon-                   Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals,
                                                                                   Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.
ment and purchase sessions. Future work will focus on features,                    2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.
single RNN architectures, and decision strategies.                                 (2015). http://tensorflow.org/ Software available from tensorflow.org.
Predicting Shopping Behavior with Mixture of RNNs                                            SIGIR 2017 eCom, August 2017, Tokyo, Japan


 [2] Aditya Awalkar, Ibrahim Ahmed, and Tejas Nevrekar. 2016. Prediction of User’s
     Purchases using Clickstream Data. International Journal of Engineering Science
     and Computing (2016).
 [3] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003.
     A Neural Probabilistic Language Model. J. Mach. Learn. Res. 3 (March 2003),
     1137–1155.
 [4] Dimitris Bertsimas, Adam J. Mersereau, and Nitin R. Patel. 2003. Dynamic
     Classification of Online Customers. In Proceedings of the Third SIAM International
     Conference on Data Mining, San Francisco, CA, USA, May 1-3, 2003. 107–118.
 [5] Robert Florentin. 2016. Online shopping abandonment rate a new perspective : the
     role of choice conflicts as a factor of online shopping abandonment. Master’s thesis.
     University of Twente.
 [6] Joshua T. Goodman. 2001. A Bit of Progress in Language Modeling. Comput.
     Speech Lang. 15, 4 (Oct. 2001), 403–434.
 [7] Bo-June Paul Hsu and James R. Glass. 2008. Iterative language model estimation:
     efficient data structure & algorithms. In INTERSPEECH 2008, 9th Annual Confer-
     ence of the International Speech Communication Association, Brisbane, Australia,
     September 22-26, 2008. 841–844.
 [8] Monika Kukar-Kinney and Angeline G. Close. 2010. The determinants of con-
     sumers’ online shopping cart abandonment. Journal of the Academy of Marketing
     Science 38, 2 (2010), 240–250.
 [9] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černocký, and Sanjeev Khu-
     danpur. 2010. Recurrent neural network based language model. In The 11th
     Annual Conference of the International Speech Communication Association (IN-
     TERSPEECH 2010).
[10] Deepak Pai, Abhijit Sharang, Meghanath Macha Yadagiri, and Shradha Agrawal.
     2014. Modelling Visit Similarity Using Click-Stream Data: A Supervised Approach.
     Springer International Publishing, 135–145.
[11] The Statistics Portal. 2014.                  Digital buyer penetration world-
     wide from 2014 to 2019.                   http://www.statista.com/statistics/261676/
     digital-buyer-penetration-worldwide/. (2014). Accessed: 2016-09-10.
[12] Armando Vieira. 2015. Predicting online user behaviour using deep learning
     algorithms. The Computing Research Repository (CoRR) abs/1511.06247 (2015).
[13] Zhenzhou Wu, Bao Hong Tan, Rubing Duan, Yong Liu, and Rick Siow Mong Goh.
     2015. Neural Modeling of Buying Behaviour for E-Commerce from Clicking
     Patterns. In Proceedings of the 2015 International ACM Recommender Systems
     Challenge (RecSys ’15 Challenge). ACM, New York, NY, USA, Article 12, 4 pages.
[14] Zhengzheng Xing, Jian Pei, and Eamonn Keogh. 2010. A Brief Survey on Sequence
     Classification. SIGKDD Explor. Newsl. 12, 1 (Nov. 2010), 40–48.
[15] Yin Xu and Jin-Song Huang. 2015. Factors Influencing Cart Abandonment in
     the Online Shopping Process. Social Behavior and Personality: an international
     journal 43, 10 (Nov. 2015).
[16] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent Neural Net-
     work Regularization. The Computing Research Repository (CoRR) abs/1409.2329
     (2014). http://arxiv.org/abs/1409.2329