=Paper= {{Paper |id=Vol-2319/paper17 |storemode=property |title=Predicting Purchasing Intent: Automatic Feature Learning using Recurrent Neural Networks |pdfUrl=https://ceur-ws.org/Vol-2319/paper17.pdf |volume=Vol-2319 |authors=Humphrey Sheil,Omer Rana,Ronan Reilly |dblpUrl=https://dblp.org/rec/conf/sigir/SheilRR18 }} ==Predicting Purchasing Intent: Automatic Feature Learning using Recurrent Neural Networks== https://ceur-ws.org/Vol-2319/paper17.pdf
 Predicting purchasing intent: Automatic Feature Learning using
                  Recurrent Neural Networks
                    Humphrey Sheil                                                          Omer Rana                                     Ronan Reilly
                     Cardiff University                                                Cardiff University                             Maynooth University
                       Cardiff, Wales                                                   Cardiff, Wales                                 Maynooth, Ireland
                    sheilh@cardiff.ac.uk                                             RanaOF@cardiff.ac.uk                             Ronan.Reilly@mu.ie

 ABSTRACT                                                                                             influence three out of four major variables that affect profit. In addi-
We present a neural network for predicting purchasing intent in an                                    tion, merchants increasingly rely on (and pay advertising to) much
Ecommerce setting. Our main contribution is to address the signifi-                                   larger third-party portals (for example eBay, Google, Bing, Taobao,
cant investment in feature engineering that is usually associated                                     Amazon) to achieve their distribution, so any direct measures the
with state-of-the-art methods such as Gradient Boosted Machines.                                      merchant group can use to increase their profit is sorely needed.
We use trainable vector spaces to model varied, semi-structured
input data comprising categoricals, quantities and unique instances.                                                           McKinsey     A.T. Kearney Affected by
Multi-layer recurrent neural networks capture both session-local                                                                                       shopping intent
and dataset-global event dependencies and relationships for user                                       Price management       11.1%        8.2%              Yes
sessions of any length. An exploration of model design decisions
                                                                                                          Variable cost        7.8%        5.1%              Yes
including parameter sharing and skip connections further increase
                                                                                                          Sales volume         3.3%        3.0%              Yes
model accuracy. Results on benchmark datasets deliver classifica-
                                                                                                           Fixed cost          2.3%        2.0%              No
tion accuracy within 98% of state-of-the-art on one and exceed
state-of-the-art on the second without the need for any domain /                                      Table 1: Effect of improving different variables on operating
dataset-specific feature engineering on both short and long event                                     profit, from [22]. In three out of four categories, knowing
sequences.                                                                                            more about a user’s shopping intent can be used to improve
                                                                                                      merchant profit.
 KEYWORDS
 Ecommerce, Deep Learning, Recurrent Neural Networks, Long
 Short Term Memory (LSTM), Embedding, Vector Space Models                                                 Virtually all Ecommerce systems can be thought of as a gener-
ACM Reference Format:                                                                                 ator of clickstream data - a log of {item - userid - action} tuples
Humphrey Sheil, Omer Rana, and Ronan Reilly. 2018. Predicting purchasing                              which captures user interactions with the system. A chronological
intent: Automatic Feature Learning using Recurrent Neural Networks. In                                grouping of these tuples by user ID is commonly known as a session.
Proceedings of ACM SIGIR Workshop on eCommerce (SIGIR 2018 eCom). ACM,                                    Predicting a users intent to purchase is more difficult than rank-
New York, NY, USA, 9 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn                                   ing content for the following reasons [29]: Clickers (users who only
                                                                                                      click and never purchase within a session) and buyers (users who
 1     INTRODUCTION                                                                                   click and also purchase at least one item within a single session)
In the Ecommerce domain, merchants can increase their sales vol-                                      can appear to be very similar, right up until a purchase action oc-
ume and profit margin by acquiring better answers for two ques-                                       curs. Additionally, the ratio between clickers and buyers is always
tions:                                                                                                heavily imbalanced - and can be 20:1 in favour of clickers or higher.
      • Which users are most likely to purchase (predict purchasing                                   An uninterested user will often click on an item during browsing as
        intent).                                                                                      there is no cost to doing so - an uninterested user will not purchase
      • Which elements of the product catalogue do users prefer                                       an item however. In our opinion, this user behaviour is in stark
        (rank content).                                                                               contrast to other settings such as predicting if a user will "like" or
                                                                                                      "pin" a piece of content hosted on a social media platform after
    By how much can merchants realistically increase profits? Table
                                                                                                      viewing it, where there is no monetary amount at stake for the user.
1 illustrates that merchants can improve profit by between 2% and
                                                                                                      As noted in [31], shoppers behave differently when visiting online
11% depending on the contributing variable. In the fluid and highly
                                                                                                      vs physical stores and online conversion rates are substantially
competitive world of online retailing, these margins are signifi-
                                                                                                      lower, for a variety of reasons.
cant, and understanding a user’s shopping intent can positively
                                                                                                          When a merchant has increased confidence that a subset of
 Permission to make digital or hard copies of part or all of this work for personal or                users are more likely to purchase, they can use this information
Copyright © 2018 by the paper’s authors. Copying permitted for private and academic purposes.
 classroom
In:            use is G.
    J. Degenhardt,     granted  withoutS.fee
                          Di Fabbrizio,      providedM.that
                                          Kallumadi,        copies
                                                         Kumar,     areLin,
                                                                 Y.-C.   notA.made  or distributed
                                                                               Trotman, H. Zhao       in the form of proactive actions to maximize conversion and yield.
(eds.): Proceedings
 for profit           of the SIGIR
             or commercial         2018 eCom
                               advantage  andworkshop,  12 July,
                                               that copies bear2018,   Ann Arbor,
                                                                 this notice  andMichigan,  USA,
                                                                                  the full citation
published  at http://ceur-ws.org
 on the first   page. Copyrights for third-party components of this work must be honored.             The merchant may offer a time-limited discount, spend more on
For all other uses, contact the owner/author(s).                                                      targeted (and relevant) advertising to re-engage these users, create
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                                                  bundles of complementary products to push the user to complete
© 2018 Copyright held by the owner/author(s).
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM.
                                                                                                      their purchase, or even offer a lower-priced own-brand alternative
https://doi.org/10.1145/nnnnnnn.nnnnnnn                                                               if the product is deemed to be fungible.
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                                                                            H. Sheil et al.


   However there are counterweights to the desire to create more          and merged clickers and buyers, whereas this work remains fo-
and more accurate models of online user behaviour - namely user           cused on the user intent classification problem. [17] compares [10]
privacy and ease of implementation. Users are increasingly reluc-         to a variety of classical Machine Learning algorithms on multiple
tant to share personal information with online services, while com-       datasets and finds that performance varies considerably by dataset.
plex Machine Learning models are difficult to implement and main-         [37] extends [10] with a variant of LSTM to capture variations in
tain in a production setting [28].                                        dwelltime between user actions. User dwelltime is considered an im-
   We surveyed existing work in this area and found that well-            portant factor in multiple implementations and has been addressed
performing approaches have a number of factors in common:                 in multiple ways. For shopping behaviour prediction, [31] uses a
                                                                          mixture of Recurrent Neural Networks and treats the problem as a
    • Heavy investment in dataset-specific feature engineering
                                                                          sequence-to-sequence translation problem, effectively combining
      was necessary, regardless of the model implementation cho-
                                                                          two models (prediction and recommendation) into one. However
      sen.
                                                                          only sessions of length 4 or greater are considered - removing the
    • Model choices favour techniques such as Gradient Boosted
                                                                          bulk from consideration. From [12], we know that short sessions
      Machines (GBM) [7] and Field-aware Factorization Machines
                                                                          are very common in Ecommerce datasets, moreover a user’s most
      (FFM) [24] which are well-suited to creating representations
                                                                          recent actions are often more important in deciphering their in-
      of semi-structured clickstream data once good features have
                                                                          tent than older actions. Therefore we argue that all session lengths
      been developed [26], [36], [34].
                                                                          should be included. [23] adopts a tangential approach - still focused
   In [29], an important feature class employed the notion of item        on predicting purchases, but using textual product metadata to
similarity, modelled as a learned vector space generated by word2vec      correlate words and terms that suit a particular geographic market
[18] and calculated using a standard pairwise cosine metric between       better than others. Broadening our focus to include the general
item vectors. In an Ecommerce context, items are more similar if          use of RNNs in the Ecommerce domain, Recurrent Recommender
they co-occur frequently over all user sessions in the corpus and are     Networks are used in [35] to incorporate temporal features with
dissimilar if they infrequently co-occur. The items themselves may        user preferences to improve recommendations, to predict future be-
be physically dissimilar (for example - headphones and batteries),        havioural directions, but not purchase intent. [30] further extends
but they are often browsed and purchased together.                        [10] by focusing on data augmentation and compensating for shifts
   However, in common with other work, [29] still requires a heavy        in the underlying distribution of the data.
investment in feature engineering. The drawback of specific fea-             In [15], the authors augment a more classical Machine Learning
tures is how tied they are to either a domain, dataset or both. The       approach (Singular Value Decomposition or SVD) to better capture
ability of Deep Learning to discover good representations without         temporal information to predict user behaviour - an alternative
explicit feature engineering is well-known [8]. In addition, Artificial   approach to the unrolling methodology used in this paper.
neural networks (ANNs) perform well with distributed representa-             Using embeddings as a learned representation is a common
tions such as embeddings, and ANNs with a recurrence capability           technique. In [1], embeddings are used to model items in a low
to model events over time - Recurrent Neural Networks (RNNs) -            dimensional space to calculate a similarity metric, however tem-
are well-suited to sequence processing and labelling tasks [16].          poral ordering is discarded. Learnable embeddings are also used
   Our motivation then is to build a good model of user intent            in [9] to model items and used purchase confirmation emails as
prediction which does not rely on private user data, and is also          a high-quality signal of user intent. Unrolling events that exceed
straightforward to implement in a real-world environment. What            an arbitrary threshold to create a better input representation for
performance can RNNs with an appropriate input representation             user dwelltime or interest is addressed in [3]. In [6], Convolutional
and end-to-end training regime achieve on the prediction of pur-          Neural Networks (CNNs) are used as the model implementation
chasing intent task? Can this performance be achieved within the          and micro-blogging content is analyzed rather than an Ecommerce
constraint of only processing anonymous session data and remain-          clickstream.
ing straightforward to implement on other Ecommerce datasets?
                                                                          3   OUR APPROACH
2   RELATED WORK                                                          Classical Machine Learning approaches such as GBM work well
The problem of user intent or session classification in an online         and are widely used on Ecommerce data, at least in part because
setting has been heavily studied, with a variety of classic Machine       the data is structured. GBM is an efficient model as it enables an
Learning and Deep Learning modelling techniques employed. [26]            additive expansion in a set of basis functions or weak learners to
was the original winner of the competition using one of the the           continually minimize a residual error. One weakness of GBM is a
datasets considered here using a commercial implementation of             propensity for overly-deep or wide decision trees to over-fit the
GBM with extensive feature engineering and is still to our knowl-         training data and thus record poor performance on the validation
edge the State of the Art (SOTA) implementation for this dataset.         and test set due to high variance [33], [34]. although this can be
However the paper authors did make their model predictions freely         controlled using hyperparameters (namely tree depth, learning rate,
available and we use these in the Experiments section to compare          minimum weight to split a tree node (min_child_weight) and data
our model performance to theirs.                                          sub-sampling). GBM also requires significant feature engineering
   [10] uses RNNs on a subset of the same dataset to predict the next     effort and does not naturally process the sequence in order, rather
session click (regardless of user intent) so removed 1-click sessions     it consumes a compressed version of it (although it is possible to
Predicting purchasing intent: Automatic Feature Learning using RNNs                       SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA

                                                                                                        xt
provide a one-hot vector representation of the input sequence as a
feature). Our approach is dual in nature - firstly we construct an
input representation for clickstream / session data that eliminates
                                                                                           Input Gate   it                             Output Gate   ot
the need for feature engineering. Second, we design a model which
can consume this input representation and predict user purchase
                                                                                                                  Cell
intent in an end-to-end, sequence to prediction manner.
                                                                                xt                      ×         ct                                 ×    ht
3.1     Embeddings as item / word representations
Natural Language Processing (NLP) tasks, such as information re-                                                  ×
trieval, part-of-speech tagging and chunking, operate by assigning
a probability value to a sequence of words. To this end, language
                                                                                                                  ft     Forget Gate
models have been developed, defining a mathematical model to
capture statistical properties of words and the dependencies among
them.
                                                                                                                  xt
   Learning good representations of input data is a central task
in designing a machine learning model that can perform well. An
embedding is a vector space model where words are converted                    Figure 1: A single LSTM cell, depicting the hidden and cell
to a low-dimensional vector. Vector space models embed words                   states, as well as the three gates controlling memory (input,
where semantically similar words are mapped to nearby points.                  forget and output).
Popular generators of word to vector mappings such as [18] and
[21], operate in an unsupervised manner - predicting similarity or
minimizing a perplexity metric using word co-occurrence counts
over a target corpus. We decided to employ embeddings as our                   phenomenon succinctly. Like all deep learning models, RNNs re-
target representation since:                                                   quire multiplication by a matrix W . After t steps, this equates to
                                                                               multiplying by W t . Therefore:
      • We can train the embeddings layer at the same time as train-
        ing the model itself - promoting simplicity.
      • Ecommerce data is straightforward to model as a dictionary                            W t = (V diaд(λ)V −1 )t = V diaд(λ)t V −1                   (1)
        of words.
      • Embedding size can be increased or decreased based on dic-                Eigenvalues (λ) that are not more or less equal to 1 will either
        tionary size and word complexity during the architecture               explode if they are > 1, or vanish if they are < 1. Gradients will
        tuning / hyper parameter search phase.                                 then be scaled by diaд(λ)t .
   Unlike [18] and [21], we chose not to pre-train the embeddings to              LSTM solves this problem by possessing an internal recurrence,
minimize a perplexity error measure. Instead we allow the model to             which stabilizes the gradient flow, even over long sequences. How-
modify the embedding weights at training time by back-propagating              ever this comes at a price of complexity. For each element in the
the loss from a binary classification criterion.                               input sequence, each layer computes the following function:

3.2     Recurrent Neural Networks
                                                                                            i t = σ (Wii x t + bii + Whi h (t −1) + bhi )
Recurrent neural networks [27] (RNNs) are a specialized class of
                                                                                            ft = σ (Wif x t + bif + Whf h (t −1) + bhf )
neural networks for processing sequential data. A recurrent net-
work is deep in time rather than space and arranges hidden state                            дt = tanh(Wiд x t + biд + Whc h (t −1) + bhд )
                                                                                                                                                          (2)
vectors hlt in a two-dimensional grid, where t = 1 . . . T is thought                       ot = σ (Wio x t + bio + Who h (t −1) + bho )
                                                                                            c t = ft ∗ c (t −1) + i t ∗ дt
of as time and l = 1 . . . L is the depth. All intermediate vectors hlt
                                                                                            ht = ot ∗ tanh(c t )
are computed as a function of hlt −1 and hlt −1 . Through these hidden
vectors, each output y at some particular time step t becomes an ap-
                                                                                     where:
proximating function of all input vectors up to that time, x 1 , . . . , x t
                                                                                    ht is the hidden state at time t,
[13].
                                                                               c t is the cell state at time t,
   3.2.1 LSTM and GRU. Long Short-Term Memory (LSTM) [11] is                   x t is the hidden state of the previous layer at time t or inputt for
an extension to colloquial or vanilla RNNs designed to address the             the first layer,
twin problems of vanishing and exploding gradients during train-               i t , ft , дt , ot are the input, forget, cell, and out gates, respectively,
ing [19]. Vanishing gradients make learning difficult as the correct           σ is the sigmoid function.
(downward) trajectory of the gradient is difficult to discern, while                 Gated Recurrent Units, or GRU [5] are a simplification of LSTM,
exploding gradients make training unstable - both are undesirable              with one less gate and the hidden state and cell state vectors com-
outcomes. Long-term dependencies in the input data, causing a                  bined. In practice, both LSTM and GRU are used interchangeably
deep computational graph which must iterate over the data are                  and the performance difference between both cell types is often
the root cause of vanishing / exploding gradients. [8] explain this            minimal and / or dataset-specific.
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                                                                                H. Sheil et al.


4 IMPLEMENTATION                                                                   SID         Timestamp          Item ID   Cat ID
                                                                                    1   2014-04-07T10:51:09.277Z 214536502    0
4.1 Datasets used                                                                   1   2014-04-07T10:57:09.868Z 214536500    0
The RecSys 2015 Challenge [2] and the Retail Rocket Kaggle [25]               Table 3: An example of a clicker session from the RecSys
datasets provide anonymous Ecommerce clickstream data well                    2015 dataset.
suited to testing purchase prediction models. Both datasets are
reasonable in size - consisting of 9.2 million and 1.4 million user
sessions respectively. These sessions are anonymous and consist of
a chronological sequence of time-stamped events describing user               Rocket dataset. Therefore these elements of the data provide limited
interactions (clicks) with content while browsing and shopping                value.
online. The logic used to mark the start and end of a user session
                                                                                           SID    Item ID     Price Quantity
is dataset-specific - the RecSys 2015 dataset contains more ses-
                                                                                          420374 214537888 12462        1
sions with a small item catalogue while the Retail Rocket dataset
                                                                                          420374 214537850 10471        1
contains less sessions with an item catalogue 5x larger than the
RecSys 2015 dataset. Both datasets contain a very high proportion             Table 4: An example of the buy events from a buyer session
of short length sessions (<= 3 events), making this problem setting           (timestamp column elided for brevity).
quite difficult for RNNs to solve. The Retail Rocket dataset contains
much longer sessions when measured by duration - the RecSys
2015 user sessions are much shorter in duration. In summary, the                 The Retail Rocket dataset consists of 1.4 million sessions. Ses-
datasets differ in important respects, and provide a good test of             sions are also anonymous and are even more imbalanced - just
model generalization ability.                                                 0.7% of the sessions end in a buy event. This dataset also provides
   For both datasets, no sessions were excluded - both datasets in            item metadata but in order to standardize our approach across both
their entirety were used in training and evaluation. This means that          datasets, we chose not to use any information that was not com-
for sequences with just one click, we require the trained embeddings          mon to both datasets. In particular we discard and do not use the
to accurately describe the item, and time of viewing by the user              additional "addtobasket" event that is present in the Retail Rocket
to accurately classify the session, while for longer sessions, we             dataset. Since it is so closely correlated with the buy event (users
can rely more on the RNN model to extract information from the                add to a basket before purchasing that basket), it renders the buyer
sequence. This decision makes the training task much harder for               prediction task trivial and an AUC of 0.97 is easily achievable for
our RNN model, but is a fairer comparison to previous work using              both our RNN and GBM models.
GBM where all session lengths were also included [26],[34],[29].                 Our approach in preparing the data for training is as follows. We
   The RecSys 2015 Challenge dataset includes a dedicated test set,           process each column as follows:
while the Retail Rocket dataset does not. We reserved 20% of the                   • Session IDs are discarded (of course we retain the sequence
Retail Rocket dataset for use in prediction / evaluation - the same                  grouping indicated by the IDs).
proportion as the RecSys 2015 Challenge test dataset.                              • Timestamps are quantized into bins 4 hours in duration.
                                                                                   • Item IDs are unchanged.
                        RecSys 2015 Retail Rocket                                  • Category IDs are unchanged.
            Sessions     9,249,729     1,398,795                                   • Purchase prices are unchanged. We calculate price variance
         Buyer sessions      5.5%         0.7%                                       per item to convey price movements to our model (e.g. a
          Unique items      52,739      227,006                                      merchant special offer).
Table 2: A short comparison of the two datasets used - Rec-                        • Purchase quantities are unchanged.
Sys 2015 and Retail Rocket.                                                      Each field is then converted to an embedding vocabulary - simply
                                                                              a lookup table mapping values to integer IDs. We do not impose
                                                                              a minimum occurrence limit on any field - a value occurring even
                                                                              once will be represented in the respective embedding. This ensures
                                                                              that even "long tail" items will be presented to the model during
4.2    Data preparation                                                       training. Lookup tables are then converted to an embedding with
The RecSys 2015 challenge dataset consists of 9.2 million user-item           embedding weights initialized from a range {-0.075, +0.075} - Table
click sessions. Sessions are anonymous and classes are imbalanced             5 identifies the number of unique items per embedding and the
with only 5% of sessions ending in one or more buy events. Each               width used. The testing dataset contains both item IDs and category
user session captures the interactions between a single user and              IDs that are not present in the training set - however only a very
items or products : Sn = e 1 , e 2 , .., ek , where ek is either a click or   small number of sessions are affected by this data in the test set.
buy event. An example 2-event session is:                                        This approach, combined with the use of Artificial Neural Net-
   Both datasets contain missing or obfuscated data - presumably              works, provides a learnable capacity to encode more information
for commercially sensitive reasons. Where sessions end with one               than just the original numeric value. For example, an item price of
or more purchase events, the item price and quantity values are               $100 vs $150 is not simply a numeric price difference, it can also
provided only 30% of the time in the case of the RecSys 2015 dataset,         signify learnable information on brand, premium vs value and so
while prices are obfuscated for commercial reasons in the Retail              on.
Predicting purchasing intent: Automatic Feature Learning using RNNs                      SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA

         Data name        Train    Train+Test
                                        Embedding                           where:
                                          Width                             x n is the output label value from the model [0..1]
          Item ID     52,739   54,287       100                             yn is the target label value {0, 1}.
       Category ID      340      348         10                                We conducted a grid search over the number of layers and layer
        Timestamp      4,368    4,368        10                             size by RNN type, as indicated in Table 7 below.
            Price       667      667         10
         Quantity        1        1          10                                                             Skip connections
Table 5: Data field embeddings and dimensions, along with
unique value counts for the training and test splits of the
                                                                                Time (width 10)
RecSys 2015 dataset.
                                                                                Item (width 100)




                                                                                                                                   LSTM 2 (256)




                                                                                                                                                  LSTM 3 (256)
                                                                                                                    LSTM 1 (256)




                                                                                                                                                                 Linear (256)
                                                                                Price (width 10)
      Dataset       Events before Events after % increase                       Price variance (width 10)
    RecSys 2015      41,255,735    56,059,913       36%
                                                                                Item category (width 10)
   Retail Rocket      2,351,354    11,224,267      377%
                                                                                Item quantity (width 10)
Table 6: Effect of unrolling by dwelltime on the RecSys 2015
and Retail Rocket datasets. There is a clear difference in the
mean / median session duration of each dataset.                                                                     Shared h(t) and c(t)




4.3    Event Unrolling                                                      Figure 2: Model architecture used - the output is interpreted
                                                                            as the log probability that the input contains either a clicker
In [3], a more explicit representation of user dwelltime or interest
                                                                            or buyer session. Skip connections are used to combine the
in a particular item i k in a sequence ei 1 , . . . , ei k is provided to
                                                                            original input with successive layer outputs, and each layer
the model by repeating the presentation of the event containing
                                                                            shares the same hidden layer parameters.
the item to the model in proportion to the elapsed time between
ei k and ei k +1 . In the example 2-event session displayed previously,         4.5.2 Hidden layer parameter sharing. One model design deci-
the elapsed time between the first and second event is 6 minutes,           sion worthy of elaboration is how hidden information (and cell
therefore the first event is replayed 3 times during training and           state for LSTM) is shared between the RNN layers. We found that
inference (⌈360/150⌉). In contrast to [3], we found that session            best results were obtained by re-using hidden state across RNN
unrolling provided a smaller improvement in model performance -             layers - i.e. re-using the hidden state vector (and cell state vector for
for example on the RecSys 2015 dataset our best AUC increased from          LSTM) from the previous layer in the architecture and initializing
0.837 to 0.839 when using the optimal unrolling value (which we             the next layer with these state vectors before presenting the next
discovered empirically using grid search) of 150 seconds. Unrolling         layer with the output from the previous layer. Our intuition here
also comes with a cost of increasing session length and thus training       is that for Ecommerce datasets, the features learned by each layer
time - Table 6 demonstrates the effect of session unrolling on the          are closely related due to the fact that Ecommerce log / clickstream
size of the training / validation and test sets on both datasets.           data is very structured, therefore re-using hidden and cell states
                                                                            from lower layers helps higher layers to converge on important
4.4    Sequence Reversal                                                    abstractions. Aggressive sharing of the hidden layers led to a very
From [29], we know that the most important item in a user session           significant improvement in AUC, increasing from 0.75 to 0.84 in
is the last item (followed by the first item). Intuitively, the last        our single best model.
and first items a user browses in a session are most predictive of
purchase intent. To capitalize on this, we reversed the sequence            5     EXPERIMENTS AND RESULTS
order for each session before presenting them as batches to the
model. Re-ordering the sequences provided an increase in test AUC           In this section we describe the experimental setup, and the results
on the RecSys 2015 dataset of 0.005 - from 0.834 to 0.839.                  obtained when comparing our best RNN model to GBM State of
                                                                            the Art, and a comparison of different RNN variants (vanilla, GRU,
4.5    Model                                                                LSTM).
   4.5.1 Model Architecture. The data embedding modules are con-
catenated and presented to a configurable number of RNN layers              5.1     Training Details
(typically 3), with a final linear layer combining the output of the        Both datasets were split into a training set and validation set in a
hidden units from the last layer. A sigmoid function is then ap-            90:10 ratio. The model was trained using the Adam optimizer [14],
plied to calculate a confidence probability in class membership. The        coupled with a binary cross entropy loss metric and a learning rate
model is trained by minimizing an unweighted binary cross entropy           annealer. Training was halted after 2 successive epochs of worsen-
loss:                                                                       ing validation AUC. Table 8 illustrates the main hyperparameters
             ln = − [yn · log x n + (1 − yn ) · log(1 − x n )]    (3)       and setting used during training.
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                                                                                          H. Sheil et al.

                                          RNN                GRU                LSTM
                            Layers     1    2    3      1      2      3    1      2      3
                            Layer size
                               64    0.72 0.81 0.81 0.741 0.832 0.833 0.735 0.832 0.831
                              128    0.72 0.80 0.80 0.755 0.833 0.833 0.729 0.834 0.834
                              256    0.71 0.80 0.80 0.732 0.834 0.834 0.724 0.834 0.839
                              512    0.69 0.80 0.77 0.746 0.832 0.833 0.759 0.835 0.839
Table 7: Model grid search results for number and size of RNN layers by RNN type on the RecSys 2015 dataset. The State of
the Art baseline for comparison is 0.853.



         Dataset split  90/10 (training / validation)                                           RecSys 2015 Retail Rocket
      Hidden units range  128 − 512 : 256 optimal                                      LSTM        0.839       0.838
      Embedding width 10 − 300 : 100 optimal for items                                 GBM         0.853        0.834
      Embedding weight        −0.075 to +0.075                            Table 9: Classification performance measured using Area
          Batch size  32 − 256 : 256 optimal for speed                    under the ROC curve (AUC) of the GBM and LSTM models
                             and regularization                           on the RecSys 2015 and Retail Rocket datasets.
         Optimizer        Adam, learning rate 1e -3
Table 8: Hyper parameters and setup employed during
model training.

                                                                                          ROC curves for LSTM and State of the art models
                                                                                                 1.0         LSTM
   We tested three main types of recurrent cells (vanilla RNN, GRU,                                          SotA
LSTM) as well as varying the number of cells per layer and layers
                                                                                                 0.8
per model. While a 3-layer LSTM achieved the best performance,
vanilla RNNs which possess no memory mechanism are able to
                                                                            True Positive Rate




achieve a competitive AUC. The datasets are heavily weighted                                     0.6
towards shorter session lengths (even after unrolling - see Figure 4
and 9). We posit that the power of LSTM and GRU is not needed
for the shorter sequences of the dataset, and colloquial recurrence                              0.4
with embeddings has the capacity to model sessions over a short
enough sequence length.
                                                                                                 0.2
5.2    Overall results
The metric we used in our analysis was Area Under the ROC Curve
or AUC. AUC is insensitive to class imbalance, and also the raw pre-                             0.0
dictions from [26] were available to us, thus a detailed, like-for-like                                0.0    0.2      0.4       0.6    0.8    1.0
AUC comparison using the test set is the best model comparison.                                                     False Positive Rate
The organizers of the challenge also released the solution to the
challenge, enabling the test set to be used. After training, the LSTM     Figure 3: ROC curves for the LSTM and State of the Art mod-
model AUC obtained on the test set was 0.839 - 98.4% of the AUC           els - on the RecSys 2015 test set.
(0.853) obtained by the SOTA model. As the subsequent experiments
demonstrate, a combination of feature embeddings and model ar-
chitecture decisions contribute to this performance. For all model           5.3.1 Session length. Figure 4 graphs the best RNN model (a 3-
architecture variants tested (see Table 7), the best performance was      layer LSTM with 256 cells per layer) and the SOTA model, with AUC
achieved after training for a small number of epochs (2 - 3). This        scores broken down by session length. For context, the quantities
held true for both datasets.                                              for each session length in the test set is also provided. Both models
   Our LSTM model achieved within 98% of the SOTA GBM per-                underperform for sessions with just one click - clearly it is difficult
formance on the RecSys 2015 dataset, and outperformed our GBM             to split clickers from buyers with such a small input signal. For
model by 0.5% on the Retail Rocket dataset, as table 9 shows.             the remaining session lengths, the relative model performance is
                                                                          consistent, although the LSTM model does start to close the gap
5.3    Analysis                                                           after sessions with length > 10.
We constructed a number of tests to analyze model performance                5.3.2 User dwelltime. Given that we unrolled long-running events
based on interesting subsets of the test data.                            in order to provide more input to the RNN models, we evaluated
   Predicting purchasing intent: Automatic Feature Learning using RNNs                                                   SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA


                                                                                                                   0.9
                     1,200,000   AUC for LSTM and SotA with session quantities by session length   1.0                                                    LSTM AUC
                                     LSTM AUC
                                     SotA AUC
                                                                                                                                                          SOTA AUC
                                                                                                                   0.8




                                                                                                             AUC
                     1,000,000
                                                                                                   0.8

                      800,000
                                                                                                                   0.7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of sessions




                                                                                                   0.6
                      600,000                                                                                                    Session length




                                                                                                     AUC
                                                                                                   0.4     Figure 6: Model performance for sessions containing low
                      400,000                                                                              price items, split by session length (RecSys 2015 dataset).

                                                                                                   0.2
                      200,000
                                                                                                                   0.9
                                                                                                                                                          LSTM AUC
                            00   1     2   3    4   5   6    7 8 9 10 11 12 13 14 15
                                                            Session length
                                                                                                   0.0                                                    SOTA AUC
                                                                                                                   0.8




                                                                                                             AUC
   Figure 4: AUC by session length for the LSTM and SOTA
   models, session quantities by length also provided for con-                                                     0.7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
   text - clearly showing the bias towards short sequence / ses-                                                                 Session length
   sion lengths in the RecSys 2015 dataset.
                                                                                                           Figure 7: Model performance for sessions containing high
                                                                                                           price items, split by session length (RecSys 2015 dataset).
                           0.8
                                                                                                           (< 5) than longer sequences. This is to be expected for anonymized
                           0.7
                     AUC




                                                                                                           data - user actions are only aggregated based on a current ses-
                                                       LSTM AUC                                            sion token and there is no "lifetime" set of user events. For many
                                                       SotA AUC                                            real-world cases then, using ungated RNNs may deliver acceptable
                           0.6 2 3 4 5 6 7 8 9 10 11 12 13 14 15                                           performance.
                                        Session length                                                        5.3.5 End-to-end learning. To measure the effect of allowing
   Figure 5: AUC by session length for the LSTM and SOTA                                                   (or not) the gradients from the output loss to flow unencumbered
   models, for any sessions where unrolling by dwelltime was                                               throughout the model (including the embeddings), we froze the
   employed. There are no sessions of length 1 as unrolling is                                             embedding layer so no gradient updates were applied and then
   inapplicable for these sessions (RecSys 2015 dataset).                                                  trained the network. Model performance decreased to an AUC of
                                                                                                           0.808 and training time increased by 3x to reach this AUC. Depriv-
   the relative performance of each model when presented with ses-                                         ing the model of the ability to dynamically modify the input data
   sions with any dwelltime > 1. As Figure 5 shows, LSTM is closer to                                      representation using gradients derived from the output loss metric
   SOTA for this subset of sessions and indeed outperforms SOTA for                                        reduces its ability to solve the classification problem posed.
   session length = 14, but the volume of sessions affected (5,433 ) is
   not enough to materially impact the overall AUC.                                                        5.4     Transferring to another dataset
                                                                                                           The GBM model itself used in [26] is not publically available, how-
      5.3.3 Item price. Like most Ecommerce catalogues, the cata-
                                                                                                           ever we were able to use the GBM model described in [29]. Figure
   logue under consideration here displays a considerable range of
                                                                                                           8 shows the respective ROC curves for the RNN (LSTM) and GBM
   item prices. We first selected all sessions where any single item
                                                                                                           models when they are ported to the Retail Rocket dataset. Both
   price was > 10,000 (capturing 544,014 sessions) and then user ses-
                                                                                                           models still perform well, however the LSTM model slightly out-
   sions where the price was <= 750 (roughly 25% of the maximum
                                                                                                           performs the GBM model (AUC of 0.837 vs 0.834).
   price - capturing 1,063,034 sessions). Figures 6 and 7 show the rela-
                                                                                                              A deeper analysis of the Area under the ROC curve demonstrates
   tive model performance for each session grouping. As with other
                                                                                                           how the characteristics of the dataset can impact on model per-
   selections, the relative model performance is broadly consistent -
                                                                                                           formance. The Retail Rocket dataset is heavily weighted towards
   there is no region where LSTM either dramatically outperforms or
                                                                                                           single-click sessions as Figure 9 shows. LSTM out-performs GBM
   underperforms the SOTA model.
                                                                                                           for these sessions - which can be attributed more to the learned
      5.3.4 Gated vs un-gated RNNs. Table 7 shows that while gated                                         embeddings since there is no sequence to process. GBM by con-
   RNNs clearly outperform ungated or vanilla RNNs, the difference                                         trast can extract only limited value from single-click sessions as
   is 0.02 of AUC which is less than might be expected. We believe                                         important feature components such as dwelltime, similarity etc. are
   the reason for this is that the dataset contains many more shorter                                      unavailable.
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                                                                                                                             H. Sheil et al.


                                                                                                               between 1 and 2 GB RAM. The results reported were obtained
                     ROC curves: LSTM and GBM models (Retail Rocket)                                           after 3 epochs of training on the full dataset - an average of 6
                                                                                                               hours (2 hours per epoch). This compares favourably to the training
                               1.0               LSTM                                                          times and resources reported in [26], where 150 machines were
                                                 GBM                                                           used for 12 hours. However, 12 hours denotes the time needed
                                                                                                               to train two models (session purchase and item ranking) whereas
                               0.8
                                                                                                               we train just one model. While we cannot compare GBM (which
                                                                                                               was trained on a CPU cluster) to RNN (trained on a single GPU
          True Positive Rate




                               0.6                                                                             with a different parallelization strategy) directly, we note that the
                                                                                                               hardware resources required for our model are modest and hence
                                                                                                               accessible to almost any commercial or academic setup. In addition,
                               0.4                                                                             real-world Ecommerce datasets are large [32] and change rapidly,
                                                                                                               therefore usable models must be able to consume large datasets
                                                                                                               and be re-trained readily to cater for new items / documents.
                               0.2
                                                                                                               5.6     Conclusions and future work
                                                                                                               We presented a Recurrent Neural Network (RNN) model which
                               0.0                                                                             recovers 98.4% of current SOTA performance on the user purchase
                                     0.0          0.2          0.4       0.6    0.8         1.0                prediction problem in Ecommerce without using explicit features.
                                                            False Positive Rate                                On a second dataset, our model fractionally exceeds SOTA perfor-
                                                                                                               mance. The model is straightforward to implement, generalizes to
Figure 8: Area under ROC curves for LSTM and GBM mod-                                                          different datasets with comparable performance and can be trained
els when ported to the Retail Rocket dataset. On this dataset,                                                 with modest hardware resource requirements.
the LSTM model slightly outperforms the GBM model over-                                                           It is promising that gated RNNs with no feature engineering can
all.                                                                                                           be competitive with Gradient Boosted Machines on short session
                                                                                                               lengths and structured data - GBM is a more established model
                                                                                                               choice in the domain of Recommender Systems and Ecommerce
                                                                                                               in general. We believe additional work on input representation
                     400,000          AUC for LSTM and GBM with session quantities by session length   1.0     (while still avoiding feature engineering) can further improve re-
                                           LSTM AUC                                                            sults for both gated and non-gated RNNs. One area of focus will
                                           GBM AUC
                     350,000                                                                                   be to investigate how parameter sharing at the hidden layer helps
                                                                                                       0.8     RNNs to operate on short sequences of structured data prevalent
                     300,000
                                                                                                               in Ecommerce.
                     250,000
Number of sessions




                                                                                                       0.6        Lastly,we note that although our approach requires no feature
                                                                                                               engineering, it is also inherently transductive - we plan to investi-
                     200,000
                                                                                                         AUC




                                                                                                               gate embedding generation and maintenance approaches for new
                     150,000                                                                           0.4     unseen items / documents to add an inductive capability to the
                                                                                                               architecture.
                     100,000
                                                                                                       0.2
                      50,000
                                                                                                               6     ACKNOWLEDGEMENTS
                                                                                                               We would like to thank the authors of [26] for making their orig-
                                00    1      2    3     4   5   6    7 8 9 10 11 12 13 14 15           0.0     inal test submission available and the organizers of the original
                                                                    Session length
                                                                                                               challenge in releasing the solution file after the competition ended,
                                                                                                               enabling us to carry out our comparisons.
Figure 9: AUC by session length for the LSTM and GBM
models when tested on the Retail Rocket dataset. The bias                                                      REFERENCES
towards shorter sessions is even more prevalent versus the                                                      [1] Oren Barkan and Noam Koenigstein. 2016. Item2Vec: Neural Item Embedding
                                                                                                                    for Collaborative Filtering. CoRR abs/1603.04259 (2016). arXiv:1603.04259 http:
RecSys 2015 dataset.                                                                                                //arxiv.org/abs/1603.04259
                                                                                                                [2] David Ben-Shimon, Alexander Tsikinovsky, Michael Friedmann, Bracha Shapira,
                                                                                                                    Lior Rokach, and Johannes Hoerle. 2015. RecSys Challenge 2015 and the YOO-
                                                                                                                    CHOOSE Dataset. In Proceedings of the 9th ACM Conference on Recommender
5.5                            Training time and resources used                                                     Systems (RecSys ’15). ACM, New York, NY, USA, 357–358. https://doi.org/10.
                                                                                                                    1145/2792838.2798723
We used PyTorch [20] to construct and train the LSTM models                                                     [3] Veronika Bogina and Tsvi Kuflik. 2017. Incorporating Dwell Time in Session-
while XGBoost [4] was used to train the GBM models. The LSTM                                                        Based Recommendations with Recurrent Neural Networks. In RecTemp@RecSys.
implementation was trained on a single Nvidia GeForce GTX TITAN                                                 [4] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting
                                                                                                                    System. In Proceedings of the 22Nd ACM SIGKDD International Conference on
Black (circa 2014 and with single-precision performance of 5.1                                                      Knowledge Discovery and Data Mining (KDD ’16). ACM, New York, NY, USA,
TFLOPs vs a 2017 GTX 1080 Ti with 11.3 TFLOPs) and consumed                                                         785–794. https://doi.org/10.1145/2939672.2939785
Predicting purchasing intent: Automatic Feature Learning using RNNs                                  SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA


 [5] KyungHyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio.           [29] Humphrey Sheil and Omer Rana. 2017. Classifying and Recommending Using
     2014. On the Properties of Neural Machine Translation: Encoder-Decoder Ap-               Gradient Boosted Machines and Vector Space Models. In Advances in Compu-
     proaches. CoRR abs/1409.1259 (2014). arXiv:1409.1259 http://arxiv.org/abs/1409.          tational Intelligence Systems. UKCI 2017., Zhang Q Chao F., Schockaert S. (Ed.),
     1259                                                                                     Vol. 650. Springer, Cham. https://doi.org/10.1007/978-3-319-66939-7_18
 [6] Xiao Ding, Ting Liu, Junwen Duan, and Jian-Yun Nie. 2015. Mining User Con-          [30] Yong Kiam Tan, Xinxing Xu, and Yong Liu. 2016. Improved Recurrent Neural
     sumption Intention from Social Media Using Domain Adaptive Convolutional                 Networks for Session-based Recommendations. CoRR abs/1606.08117 (2016).
     Neural Network. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial         arXiv:1606.08117 http://arxiv.org/abs/1606.08117
     Intelligence (AAAI’15). AAAI Press, 2389–2395. http://dl.acm.org/citation.cfm?      [31] Arthur Toth, Louis Tan, Giuseppe Di Fabbrizio, and Ankur Datta. 2017. Predicting
     id=2886521.2886653                                                                       Shopping Behavior with Mixture of RNNs. In ACM SIGIR Forum. ACM.
 [7] Jerome H. Friedman. 2000. Greedy Function Approximation: A Gradient Boosting        [32] Andrew Trotman, Jon Degenhardt, and Surya Kallumadi. 2017. The Architecture
     Machine. Annals of Statistics 29 (2000), 1189–1232.                                      of eBay Search. In ACM SIGIR Forum. ACM.
 [8] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT        [33] Alexander Vezhnevets and Olga Barinova. 2007. Avoiding Boosting Overfitting
     Press. http://www.deeplearningbook.org.                                                  by Removing Confusing Samples. In Proceedings of the 18th European Conference
 [9] Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan Bhamidipati,              on Machine Learning (ECML ’07). Springer-Verlag, Berlin, Heidelberg, 430–441.
     Jaikit Savla, Varun Bhagwan, and Doug Sharp. 2016. E-commerce in Your Inbox:             https://doi.org/10.1007/978-3-540-74958-5_40
     Product Recommendations at Scale. CoRR abs/1606.07154 (2016). arXiv:1606.07154      [34] Maksims Volkovs. 2015. Two-Stage Approach to Item Recommendation from
     http://arxiv.org/abs/1606.07154                                                          User Sessions. In Proceedings of the 2015 International ACM Recommender Systems
[10] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.               Challenge (RecSys ’15 Challenge). ACM, New York, NY, USA, Article 3, 4 pages.
     2015. Session-based Recommendations with Recurrent Neural Networks. CoRR                 https://doi.org/10.1145/2813448.2813512
     abs/1511.06939 (2015). arXiv:1511.06939 http://arxiv.org/abs/1511.06939             [35] Chao-Yuan Wu, Amr Ahmed, Alex Beutel, Alexander J. Smola, and How Jing. 2017.
[11] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.                    Recurrent Recommender Networks. In Proceedings of the Tenth ACM International
     Neural Comput. 9, 8 (Nov. 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.         Conference on Web Search and Data Mining (WSDM ’17). ACM, New York, NY,
     8.1735                                                                                   USA, 495–503. https://doi.org/10.1145/3018661.3018689
[12] Dietmar Jannach, Malte Ludewig, and Lukas Lerche. 2017. Session-based item          [36] Peng Yan, Xiaocong Zhou, and Yitao Duan. 2015. E-Commerce Item Recommen-
     recommendation in e-commerce: on short-term intents, reminders, trends and               dation Based on Field-aware Factorization Machine. In Proceedings of the 2015
     discounts. User Modeling and User-Adapted Interaction 27 (2017), 351–392.                International ACM Recommender Systems Challenge (RecSys ’15 Challenge). ACM,
[13] Andrej Karpathy, Justin Johnson, and Fei-Fei Li. 2015. Visualizing and Under-            New York, NY, USA, Article 2, 4 pages. https://doi.org/10.1145/2813448.2813511
     standing Recurrent Networks. CoRR abs/1506.02078 (2015). arXiv:1506.02078           [37] Yu Zhu, Hao Li, Yikang Liao, Beidou Wang, Ziyu Guan, Haifeng Liu, and Deng
     http://arxiv.org/abs/1506.02078                                                          Cai. 2017. What to Do Next: Modeling User Behaviors by Time-LSTM. In IJCAI.
[14] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimiza-
     tion. CoRR abs/1412.6980 (2014). arXiv:1412.6980 http://arxiv.org/abs/1412.6980
[15] Yehuda Koren. 2009. Collaborative Filtering with Temporal Dynamics. In
     Proceedings of the 15th ACM SIGKDD International Conference on Knowledge
     Discovery and Data Mining (KDD ’09). ACM, New York, NY, USA, 447–456.
     https://doi.org/10.1145/1557019.1557072
[16] Zachary Chase Lipton. 2015. A Critical Review of Recurrent Neural Networks
     for Sequence Learning. CoRR abs/1506.00019 (2015). arXiv:1506.00019 http:
     //arxiv.org/abs/1506.00019
[17] Malte Ludewig and Dietmar Jannach. 2018. Evaluation of Session-based Rec-
     ommendation Algorithms. CoRR abs/1803.09587 (2018). arXiv:1803.09587
     http://arxiv.org/abs/1803.09587
[18] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
     Estimation of Word Representations in Vector Space. CoRR abs/1301.3781 (2013).
     arXiv:1301.3781 http://arxiv.org/abs/1301.3781
[19] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the Difficulty
     of Training Recurrent Neural Networks. In Proceedings of the 30th International
     Conference on International Conference on Machine Learning - Volume 28 (ICML’13).
     JMLR.org, III–1310–III–1318. http://dl.acm.org/citation.cfm?id=3042817.3043083
[20] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
     Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
     2017. Automatic differentiation in PyTorch. (2017).
[21] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe:
     Global Vectors for Word Representation. In Empirical Methods in Natural Lan-
     guage Processing (EMNLP). 1532–1543. http://www.aclweb.org/anthology/
     D14-1162
[22] R.L. Phillips. 2005. Pricing and Revenue Optimization. Stanford University Press.
     https://books.google.co.uk/books?id=bXsyO06qikEC
[23] Reid Pryzant, Young joo Chung, and Dan Jurafsky. 2017. Predicting Sales from
     the Language of Product Descriptions. In ACM SIGIR Forum. ACM.
[24] Steffen Rendle. 2010. Factorization Machines. In Proceedings of the 2010 IEEE
     International Conference on Data Mining (ICDM ’10). IEEE Computer Society,
     Washington, DC, USA, 995–1000. https://doi.org/10.1109/ICDM.2010.127
[25] Retailrocket. 2017. Retailrocket recommender system dataset. https://www.
     kaggle.com/retailrocket/ecommerce-dataset. (2017). [Online; accessed 01-Feb-
     2018].
[26] Peter Romov and Evgeny Sokolov. 2015. RecSys Challenge 2015: Ensemble
     Learning with Categorical Features. In Proceedings of the 2015 International ACM
     Recommender Systems Challenge (RecSys ’15 Challenge). ACM, New York, NY,
     USA, Article 1, 4 pages. https://doi.org/10.1145/2813448.2813510
[27] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. 1986. Parallel Distributed
     Processing: Explorations in the Microstructure of Cognition, Vol. 1. MIT Press,
     Cambridge, MA, USA, Chapter Learning Internal Representations by Error Prop-
     agation, 318–362. http://dl.acm.org/citation.cfm?id=104279.104293
[28] D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar
     Ebner, Vinay Chaudhary, and Michael Young. 2014. Machine Learning: The
     High Interest Credit Card of Technical Debt. In SE4ML: Software Engineering for
     Machine Learning (NIPS 2014 Workshop).