<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Predicting purchasing intent: Automatic Feature Learning using Recurrent Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Humphrey Sheil</string-name>
          <email>sheilh@cardif.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Omer Rana</string-name>
          <email>RanaOF@cardif.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ronan Reilly</string-name>
          <email>Ronan.Reilly@mu.ie</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ecommerce, Deep Learning, Recurrent Neural Networks, Long</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cardif University</institution>
          ,
          <addr-line>Cardif, Wales</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Maynooth University</institution>
          ,
          <addr-line>Maynooth</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Short Term Memory (LSTM)</institution>
          ,
          <addr-line>Embedding</addr-line>
          ,
          <country>Vector Space Models</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>We present a neural network for predicting purchasing intent in an Ecommerce setting. Our main contribution is to address the significant investment in feature engineering that is usually associated with state-of-the-art methods such as Gradient Boosted Machines. We use trainable vector spaces to model varied, semi-structured input data comprising categoricals, quantities and unique instances. Multi-layer recurrent neural networks capture both session-local and dataset-global event dependencies and relationships for user sessions of any length. An exploration of model design decisions including parameter sharing and skip connections further increase model accuracy. Results on benchmark datasets deliver classification accuracy within 98% of state-of-the-art on one and exceed state-of-the-art on the second without the need for any domain / dataset-specific feature engineering on both short and long event sequences.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>In the Ecommerce domain, merchants can increase their sales
volume and profit margin by acquiring better answers for two
questions:
• Which users are most likely to purchase (predict purchasing
intent).
• Which elements of the product catalogue do users prefer
(rank content).</p>
      <p>By how much can merchants realistically increase profits? Table
1 illustrates that merchants can improve profit by between 2% and
11% depending on the contributing variable. In the fluid and highly
competitive world of online retailing, these margins are
significant, and understanding a user’s shopping intent can positively
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).</p>
      <p>SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA
© 2018 Copyright held by the owner/author(s).</p>
      <p>ACM ISBN 978-x-xxxx-xxxx-x/YY/MM.
https://doi.org/10.1145/nnnnnnn.nnnnnnn
influence three out of four major variables that afect profit. In
addition, merchants increasingly rely on (and pay advertising to) much
larger third-party portals (for example eBay, Google, Bing, Taobao,
Amazon) to achieve their distribution, so any direct measures the
merchant group can use to increase their profit is sorely needed.</p>
      <sec id="sec-1-1">
        <title>McKinsey</title>
      </sec>
      <sec id="sec-1-2">
        <title>A.T. Kearney</title>
        <p>11.1%
7.8%
3.3%
2.3%
8.2%
5.1%
3.0%
2.0%</p>
      </sec>
      <sec id="sec-1-3">
        <title>Afected by shopping intent Yes Yes</title>
        <p>Yes
No</p>
        <p>Virtually all Ecommerce systems can be thought of as a
generator of clickstream data - a log of {item - userid - action} tuples
which captures user interactions with the system. A chronological
grouping of these tuples by user ID is commonly known as a session.</p>
        <p>
          Predicting a users intent to purchase is more dificult than
ranking content for the following reasons [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]: Clickers (users who only
click and never purchase within a session) and buyers (users who
click and also purchase at least one item within a single session)
can appear to be very similar, right up until a purchase action
occurs. Additionally, the ratio between clickers and buyers is always
heavily imbalanced - and can be 20:1 in favour of clickers or higher.
An uninterested user will often click on an item during browsing as
there is no cost to doing so - an uninterested user will not purchase
an item however. In our opinion, this user behaviour is in stark
contrast to other settings such as predicting if a user will "like" or
"pin" a piece of content hosted on a social media platform after
viewing it, where there is no monetary amount at stake for the user.
As noted in [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ], shoppers behave diferently when visiting online
vs physical stores and online conversion rates are substantially
lower, for a variety of reasons.
        </p>
        <p>When a merchant has increased confidence that a subset of
users are more likely to purchase, they can use this information
in the form of proactive actions to maximize conversion and yield.
The merchant may ofer a time-limited discount, spend more on
targeted (and relevant) advertising to re-engage these users, create
bundles of complementary products to push the user to complete
their purchase, or even ofer a lower-priced own-brand alternative
if the product is deemed to be fungible.</p>
        <p>
          However there are counterweights to the desire to create more
and more accurate models of online user behaviour - namely user
privacy and ease of implementation. Users are increasingly
reluctant to share personal information with online services, while
complex Machine Learning models are dificult to implement and
maintain in a production setting [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ].
        </p>
        <p>
          We surveyed existing work in this area and found that
wellperforming approaches have a number of factors in common:
• Heavy investment in dataset-specific feature engineering
was necessary, regardless of the model implementation
chosen.
• Model choices favour techniques such as Gradient Boosted
Machines (GBM) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and Field-aware Factorization Machines
(FFM) [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] which are well-suited to creating representations
of semi-structured clickstream data once good features have
been developed [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ], [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ].
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ], an important feature class employed the notion of item
similarity, modelled as a learned vector space generated by word2vec
[
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] and calculated using a standard pairwise cosine metric between
item vectors. In an Ecommerce context, items are more similar if
they co-occur frequently over all user sessions in the corpus and are
dissimilar if they infrequently co-occur. The items themselves may
be physically dissimilar (for example - headphones and batteries),
but they are often browsed and purchased together.
        </p>
        <p>
          However, in common with other work, [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] still requires a heavy
investment in feature engineering. The drawback of specific
features is how tied they are to either a domain, dataset or both. The
ability of Deep Learning to discover good representations without
explicit feature engineering is well-known [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. In addition, Artificial
neural networks (ANNs) perform well with distributed
representations such as embeddings, and ANNs with a recurrence capability
to model events over time - Recurrent Neural Networks (RNNs)
are well-suited to sequence processing and labelling tasks [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
        </p>
        <p>Our motivation then is to build a good model of user intent
prediction which does not rely on private user data, and is also
straightforward to implement in a real-world environment. What
performance can RNNs with an appropriate input representation
and end-to-end training regime achieve on the prediction of
purchasing intent task? Can this performance be achieved within the
constraint of only processing anonymous session data and
remaining straightforward to implement on other Ecommerce datasets?
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        The problem of user intent or session classification in an online
setting has been heavily studied, with a variety of classic Machine
Learning and Deep Learning modelling techniques employed. [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]
was the original winner of the competition using one of the the
datasets considered here using a commercial implementation of
GBM with extensive feature engineering and is still to our
knowledge the State of the Art (SOTA) implementation for this dataset.
However the paper authors did make their model predictions freely
available and we use these in the Experiments section to compare
our model performance to theirs.
      </p>
      <p>
        [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] uses RNNs on a subset of the same dataset to predict the next
session click (regardless of user intent) so removed 1-click sessions
and merged clickers and buyers, whereas this work remains
focused on the user intent classification problem. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] compares [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
to a variety of classical Machine Learning algorithms on multiple
datasets and finds that performance varies considerably by dataset.
[
        <xref ref-type="bibr" rid="ref37">37</xref>
        ] extends [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] with a variant of LSTM to capture variations in
dwelltime between user actions. User dwelltime is considered an
important factor in multiple implementations and has been addressed
in multiple ways. For shopping behaviour prediction, [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] uses a
mixture of Recurrent Neural Networks and treats the problem as a
sequence-to-sequence translation problem, efectively combining
two models (prediction and recommendation) into one. However
only sessions of length 4 or greater are considered - removing the
bulk from consideration. From [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], we know that short sessions
are very common in Ecommerce datasets, moreover a user’s most
recent actions are often more important in deciphering their
intent than older actions. Therefore we argue that all session lengths
should be included. [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] adopts a tangential approach - still focused
on predicting purchases, but using textual product metadata to
correlate words and terms that suit a particular geographic market
better than others. Broadening our focus to include the general
use of RNNs in the Ecommerce domain, Recurrent Recommender
Networks are used in [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ] to incorporate temporal features with
user preferences to improve recommendations, to predict future
behavioural directions, but not purchase intent. [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] further extends
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] by focusing on data augmentation and compensating for shifts
in the underlying distribution of the data.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], the authors augment a more classical Machine Learning
approach (Singular Value Decomposition or SVD) to better capture
temporal information to predict user behaviour - an alternative
approach to the unrolling methodology used in this paper.
      </p>
      <p>
        Using embeddings as a learned representation is a common
technique. In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], embeddings are used to model items in a low
dimensional space to calculate a similarity metric, however
temporal ordering is discarded. Learnable embeddings are also used
in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to model items and used purchase confirmation emails as
a high-quality signal of user intent. Unrolling events that exceed
an arbitrary threshold to create a better input representation for
user dwelltime or interest is addressed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Convolutional
Neural Networks (CNNs) are used as the model implementation
and micro-blogging content is analyzed rather than an Ecommerce
clickstream.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>OUR APPROACH</title>
      <p>
        Classical Machine Learning approaches such as GBM work well
and are widely used on Ecommerce data, at least in part because
the data is structured. GBM is an eficient model as it enables an
additive expansion in a set of basis functions or weak learners to
continually minimize a residual error. One weakness of GBM is a
propensity for overly-deep or wide decision trees to over-fit the
training data and thus record poor performance on the validation
and test set due to high variance [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ], [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ]. although this can be
controlled using hyperparameters (namely tree depth, learning rate,
minimum weight to split a tree node (min_child_weight) and data
sub-sampling). GBM also requires significant feature engineering
efort and does not naturally process the sequence in order, rather
it consumes a compressed version of it (although it is possible to
Input Gate it
      </p>
      <p>Output Gate ot
xt
×
ht
provide a one-hot vector representation of the input sequence as a
feature). Our approach is dual in nature - firstly we construct an
input representation for clickstream / session data that eliminates
the need for feature engineering. Second, we design a model which
can consume this input representation and predict user purchase
intent in an end-to-end, sequence to prediction manner.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Embeddings as item / word representations</title>
      <p>Natural Language Processing (NLP) tasks, such as information
retrieval, part-of-speech tagging and chunking, operate by assigning
a probability value to a sequence of words. To this end, language
models have been developed, defining a mathematical model to
capture statistical properties of words and the dependencies among
them.</p>
      <p>
        Learning good representations of input data is a central task
in designing a machine learning model that can perform well. An
embedding is a vector space model where words are converted
to a low-dimensional vector. Vector space models embed words
where semantically similar words are mapped to nearby points.
Popular generators of word to vector mappings such as [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and
[
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], operate in an unsupervised manner - predicting similarity or
minimizing a perplexity metric using word co-occurrence counts
over a target corpus. We decided to employ embeddings as our
target representation since:
• We can train the embeddings layer at the same time as
training the model itself - promoting simplicity.
• Ecommerce data is straightforward to model as a dictionary
of words.
• Embedding size can be increased or decreased based on
dictionary size and word complexity during the architecture
tuning / hyper parameter search phase.
      </p>
      <p>
        Unlike [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], we chose not to pre-train the embeddings to
minimize a perplexity error measure. Instead we allow the model to
modify the embedding weights at training time by back-propagating
the loss from a binary classification criterion.
3.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Recurrent Neural Networks</title>
      <p>
        Recurrent neural networks [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] (RNNs) are a specialized class of
neural networks for processing sequential data. A recurrent
network is deep in time rather than space and arranges hidden state
vectors hlt in a two-dimensional grid, where t = 1 . . . T is thought
of as time and l = 1 . . . L is the depth. All intermediate vectors hlt
are computed as a function of hlt −1 and hlt−1. Through these hidden
vectors, each output y at some particular time step t becomes an
approximating function of all input vectors up to that time, x1, . . . , xt
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        3.2.1 LSTM and GRU. Long Short-Term Memory (LSTM) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is
an extension to colloquial or vanilla RNNs designed to address the
twin problems of vanishing and exploding gradients during
training [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Vanishing gradients make learning dificult as the correct
(downward) trajectory of the gradient is dificult to discern, while
exploding gradients make training unstable - both are undesirable
outcomes. Long-term dependencies in the input data, causing a
deep computational graph which must iterate over the data are
the root cause of vanishing / exploding gradients. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] explain this
xt
×
      </p>
      <p>Cell
ct
×
ft
xt</p>
      <p>Forget Gate
phenomenon succinctly. Like all deep learning models, RNNs
require multiplication by a matrix W . After t steps, this equates to
multiplying by W t . Therefore:</p>
      <p>W t = (V diaд(λ)V −1)t = V diaд(λ)t V −1</p>
      <p>Eigenvalues (λ) that are not more or less equal to 1 will either
explode if they are &gt; 1, or vanish if they are &lt; 1. Gradients will
then be scaled by diaд(λ)t .</p>
      <p>LSTM solves this problem by possessing an internal recurrence,
which stabilizes the gradient flow, even over long sequences.
However this comes at a price of complexity. For each element in the
input sequence, each layer computes the following function:
(1)
(2)
it = σ (Wii xt + bii + Whi h(t −1) + bhi )
ft = σ (Wif xt + bif + Whf h(t −1) + bhf )
дt = tanh(Wiдxt + biд + Whc h(t −1) + bhд )
ot = σ (Wioxt + bio + Whoh(t −1) + bho )
ct = ft ∗ c(t −1) + it ∗ дt
ht = ot ∗ tanh(ct )
where:
ht is the hidden state at time t,
ct is the cell state at time t,
xt is the hidden state of the previous layer at time t or inputt for
the first layer,
it , ft , дt , ot are the input, forget, cell, and out gates, respectively,
σ is the sigmoid function.</p>
      <p>
        Gated Recurrent Units, or GRU [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] are a simplification of LSTM,
with one less gate and the hidden state and cell state vectors
combined. In practice, both LSTM and GRU are used interchangeably
and the performance diference between both cell types is often
minimal and / or dataset-specific.
4
4.1
      </p>
    </sec>
    <sec id="sec-6">
      <title>IMPLEMENTATION</title>
    </sec>
    <sec id="sec-7">
      <title>Datasets used</title>
      <p>
        The RecSys 2015 Challenge [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and the Retail Rocket Kaggle [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]
datasets provide anonymous Ecommerce clickstream data well
suited to testing purchase prediction models. Both datasets are
reasonable in size - consisting of 9.2 million and 1.4 million user
sessions respectively. These sessions are anonymous and consist of
a chronological sequence of time-stamped events describing user
interactions (clicks) with content while browsing and shopping
online. The logic used to mark the start and end of a user session
is dataset-specific - the RecSys 2015 dataset contains more
sessions with a small item catalogue while the Retail Rocket dataset
contains less sessions with an item catalogue 5x larger than the
RecSys 2015 dataset. Both datasets contain a very high proportion
of short length sessions (&lt;= 3 events), making this problem setting
quite dificult for RNNs to solve. The Retail Rocket dataset contains
much longer sessions when measured by duration - the RecSys
2015 user sessions are much shorter in duration. In summary, the
datasets difer in important respects, and provide a good test of
model generalization ability.
      </p>
      <p>
        For both datasets, no sessions were excluded - both datasets in
their entirety were used in training and evaluation. This means that
for sequences with just one click, we require the trained embeddings
to accurately describe the item, and time of viewing by the user
to accurately classify the session, while for longer sessions, we
can rely more on the RNN model to extract information from the
sequence. This decision makes the training task much harder for
our RNN model, but is a fairer comparison to previous work using
GBM where all session lengths were also included [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ],[
        <xref ref-type="bibr" rid="ref34">34</xref>
        ],[
        <xref ref-type="bibr" rid="ref29">29</xref>
        ].
      </p>
      <p>The RecSys 2015 Challenge dataset includes a dedicated test set,
while the Retail Rocket dataset does not. We reserved 20% of the
Retail Rocket dataset for use in prediction / evaluation - the same
proportion as the RecSys 2015 Challenge test dataset.
The RecSys 2015 challenge dataset consists of 9.2 million user-item
click sessions. Sessions are anonymous and classes are imbalanced
with only 5% of sessions ending in one or more buy events. Each
user session captures the interactions between a single user and
items or products : Sn = e1, e2, .., ek , where ek is either a click or
buy event. An example 2-event session is:</p>
      <p>Both datasets contain missing or obfuscated data - presumably
for commercially sensitive reasons. Where sessions end with one
or more purchase events, the item price and quantity values are
provided only 30% of the time in the case of the RecSys 2015 dataset,
while prices are obfuscated for commercial reasons in the Retail</p>
      <p>SID Timestamp Item ID Cat ID
1 2014-04-07T10:51:09.277Z 214536502 0
1 2014-04-07T10:57:09.868Z 214536500 0
Table 3: An example of a clicker session from the RecSys
2015 dataset.</p>
      <p>Rocket dataset. Therefore these elements of the data provide limited
value.</p>
      <sec id="sec-7-1">
        <title>SID Item ID Price Quantity 420374 214537888 12462 1 420374 214537850 10471 1</title>
        <p>Table 4: An example of the buy events from a buyer session
(timestamp column elided for brevity).</p>
        <p>The Retail Rocket dataset consists of 1.4 million sessions.
Sessions are also anonymous and are even more imbalanced - just
0.7% of the sessions end in a buy event. This dataset also provides
item metadata but in order to standardize our approach across both
datasets, we chose not to use any information that was not
common to both datasets. In particular we discard and do not use the
additional "addtobasket" event that is present in the Retail Rocket
dataset. Since it is so closely correlated with the buy event (users
add to a basket before purchasing that basket), it renders the buyer
prediction task trivial and an AUC of 0.97 is easily achievable for
both our RNN and GBM models.</p>
        <p>Our approach in preparing the data for training is as follows. We
process each column as follows:
• Session IDs are discarded (of course we retain the sequence
grouping indicated by the IDs).
• Timestamps are quantized into bins 4 hours in duration.
• Item IDs are unchanged.
• Category IDs are unchanged.
• Purchase prices are unchanged. We calculate price variance
per item to convey price movements to our model (e.g. a
merchant special ofer).</p>
        <p>• Purchase quantities are unchanged.</p>
        <p>Each field is then converted to an embedding vocabulary - simply
a lookup table mapping values to integer IDs. We do not impose
a minimum occurrence limit on any field - a value occurring even
once will be represented in the respective embedding. This ensures
that even "long tail" items will be presented to the model during
training. Lookup tables are then converted to an embedding with
embedding weights initialized from a range {-0.075, +0.075} - Table
5 identifies the number of unique items per embedding and the
width used. The testing dataset contains both item IDs and category
IDs that are not present in the training set - however only a very
small number of sessions are afected by this data in the test set.</p>
        <p>This approach, combined with the use of Artificial Neural
Networks, provides a learnable capacity to encode more information
than just the original numeric value. For example, an item price of
$100 vs $150 is not simply a numeric price diference, it can also
signify learnable information on brand, premium vs value and so
on.
Data name Train Train+Test Embedding
Width</p>
        <p>Item ID 52,739 54,287 100
Category ID 340 348 10
Timestamp 4,368 4,368 10</p>
        <p>Price 667 667 10</p>
        <p>Quantity 1 1 10
Table 5: Data field embeddings and dimensions, along with
unique value counts for the training and test splits of the
RecSys 2015 dataset.</p>
        <p>Dataset Events before Events after % increase
RecSys 2015 41,255,735 56,059,913 36%
Retail Rocket 2,351,354 11,224,267 377%
Table 6: Efect of unrolling by dwelltime on the RecSys 2015
and Retail Rocket datasets. There is a clear diference in the
mean / median session duration of each dataset.
4.3</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Event Unrolling</title>
      <p>
        In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a more explicit representation of user dwelltime or interest
in a particular item ik in a sequence ei1 , . . . , eik is provided to
the model by repeating the presentation of the event containing
the item to the model in proportion to the elapsed time between
eik and eik+1 . In the example 2-event session displayed previously,
the elapsed time between the first and second event is 6 minutes,
therefore the first event is replayed 3 times during training and
inference (⌈360/150⌉). In contrast to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], we found that session
unrolling provided a smaller improvement in model performance
for example on the RecSys 2015 dataset our best AUC increased from
0.837 to 0.839 when using the optimal unrolling value (which we
discovered empirically using grid search) of 150 seconds. Unrolling
also comes with a cost of increasing session length and thus training
time - Table 6 demonstrates the efect of session unrolling on the
size of the training / validation and test sets on both datasets.
4.4
      </p>
    </sec>
    <sec id="sec-9">
      <title>Sequence Reversal</title>
      <p>
        From [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], we know that the most important item in a user session
is the last item (followed by the first item). Intuitively, the last
and first items a user browses in a session are most predictive of
purchase intent. To capitalize on this, we reversed the sequence
order for each session before presenting them as batches to the
model. Re-ordering the sequences provided an increase in test AUC
on the RecSys 2015 dataset of 0.005 - from 0.834 to 0.839.
4.5
      </p>
      <p>Model
4.5.1 Model Architecture. The data embedding modules are
concatenated and presented to a configurable number of RNN layers
(typically 3), with a final linear layer combining the output of the
hidden units from the last layer. A sigmoid function is then
applied to calculate a confidence probability in class membership. The
model is trained by minimizing an unweighted binary cross entropy
loss:
ln = − [yn · log xn + (1 − yn ) · log(1 − xn )]
(3)
where:
xn is the output label value from the model [0..1]
yn is the target label value {0, 1}.</p>
      <p>We conducted a grid search over the number of layers and layer
size by RNN type, as indicated in Table 7 below.</p>
      <sec id="sec-9-1">
        <title>Skip connections</title>
        <p>Time (width 10)
Item (width 100)
Price (width 10)
Price variance (width 10)
Item category (width 10)
Item quantity (width 10)
L
S
T
M
1
(
2
5
6
)</p>
        <p>L
S
T
M
2
(
2
5
6
)</p>
        <p>L
S
T
M
3
(
2
5
6
)</p>
        <p>L
i
n
e
a
r(
2
5
6
)</p>
      </sec>
      <sec id="sec-9-2">
        <title>Shared h(t) and c(t)</title>
        <p>4.5.2 Hidden layer parameter sharing. One model design
decision worthy of elaboration is how hidden information (and cell
state for LSTM) is shared between the RNN layers. We found that
best results were obtained by re-using hidden state across RNN
layers - i.e. re-using the hidden state vector (and cell state vector for
LSTM) from the previous layer in the architecture and initializing
the next layer with these state vectors before presenting the next
layer with the output from the previous layer. Our intuition here
is that for Ecommerce datasets, the features learned by each layer
are closely related due to the fact that Ecommerce log / clickstream
data is very structured, therefore re-using hidden and cell states
from lower layers helps higher layers to converge on important
abstractions. Aggressive sharing of the hidden layers led to a very
significant improvement in AUC, increasing from 0.75 to 0.84 in
our single best model.
5</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>EXPERIMENTS AND RESULTS</title>
      <p>In this section we describe the experimental setup, and the results
obtained when comparing our best RNN model to GBM State of
the Art, and a comparison of diferent RNN variants (vanilla, GRU,
LSTM).
5.1</p>
    </sec>
    <sec id="sec-11">
      <title>Training Details</title>
      <p>
        Both datasets were split into a training set and validation set in a
90:10 ratio. The model was trained using the Adam optimizer [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ],
coupled with a binary cross entropy loss metric and a learning rate
annealer. Training was halted after 2 successive epochs of
worsening validation AUC. Table 8 illustrates the main hyperparameters
and setting used during training.
      </p>
      <p>RNN
2</p>
      <p>GRU
2</p>
      <sec id="sec-11-1">
        <title>LSTM</title>
        <p>2</p>
        <p>We tested three main types of recurrent cells (vanilla RNN, GRU,
LSTM) as well as varying the number of cells per layer and layers
per model. While a 3-layer LSTM achieved the best performance,
vanilla RNNs which possess no memory mechanism are able to
achieve a competitive AUC. The datasets are heavily weighted
towards shorter session lengths (even after unrolling - see Figure 4
and 9). We posit that the power of LSTM and GRU is not needed
for the shorter sequences of the dataset, and colloquial recurrence
with embeddings has the capacity to model sessions over a short
enough sequence length.
5.2</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>Overall results</title>
      <p>
        The metric we used in our analysis was Area Under the ROC Curve
or AUC. AUC is insensitive to class imbalance, and also the raw
predictions from [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] were available to us, thus a detailed, like-for-like
AUC comparison using the test set is the best model comparison.
The organizers of the challenge also released the solution to the
challenge, enabling the test set to be used. After training, the LSTM
model AUC obtained on the test set was 0.839 - 98.4% of the AUC
(0.853) obtained by the SOTA model. As the subsequent experiments
demonstrate, a combination of feature embeddings and model
architecture decisions contribute to this performance. For all model
architecture variants tested (see Table 7), the best performance was
achieved after training for a small number of epochs (2 - 3). This
held true for both datasets.
      </p>
      <p>Our LSTM model achieved within 98% of the SOTA GBM
performance on the RecSys 2015 dataset, and outperformed our GBM
model by 0.5% on the Retail Rocket dataset, as table 9 shows.
5.3</p>
    </sec>
    <sec id="sec-13">
      <title>Analysis</title>
      <p>We constructed a number of tests to analyze model performance
based on interesting subsets of the test data.</p>
      <sec id="sec-13-1">
        <title>RecSys 2015 Retail Rocket</title>
        <p>LSTM 0.839 0.838</p>
        <p>GBM 0.853 0.834
Table 9: Classification performance measured using Area
under the ROC curve (AUC) of the GBM and LSTM models
on the RecSys 2015 and Retail Rocket datasets.</p>
        <p>ROC curves for LSTM and State of the art models
1.0
0.8</p>
        <p>5.3.1 Session length. Figure 4 graphs the best RNN model (a
3layer LSTM with 256 cells per layer) and the SOTA model, with AUC
scores broken down by session length. For context, the quantities
for each session length in the test set is also provided. Both models
underperform for sessions with just one click - clearly it is dificult
to split clickers from buyers with such a small input signal. For
the remaining session lengths, the relative model performance is
consistent, although the LSTM model does start to close the gap
after sessions with length &gt; 10.</p>
        <p>5.3.2 User dwelltime. Given that we unrolled long-running events
in order to provide more input to the RNN models, we evaluated
1,200,000
1,000,000
0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15</p>
        <p>Session length
the relative performance of each model when presented with
sessions with any dwelltime &gt; 1. As Figure 5 shows, LSTM is closer to
SOTA for this subset of sessions and indeed outperforms SOTA for
session length = 14, but the volume of sessions afected ( 5,433 ) is
not enough to materially impact the overall AUC.</p>
        <p>5.3.3 Item price. Like most Ecommerce catalogues, the
catalogue under consideration here displays a considerable range of
item prices. We first selected all sessions where any single item
price was &gt; 10,000 (capturing 544,014 sessions) and then user
sessions where the price was &lt;= 750 (roughly 25% of the maximum
price - capturing 1,063,034 sessions). Figures 6 and 7 show the
relative model performance for each session grouping. As with other
selections, the relative model performance is broadly consistent
there is no region where LSTM either dramatically outperforms or
underperforms the SOTA model.</p>
        <p>5.3.4 Gated vs un-gated RNNs. Table 7 shows that while gated
RNNs clearly outperform ungated or vanilla RNNs, the diference
is 0.02 of AUC which is less than might be expected. We believe
the reason for this is that the dataset contains many more shorter</p>
        <p>5.3.5 End-to-end learning. To measure the efect of allowing
(or not) the gradients from the output loss to flow unencumbered
throughout the model (including the embeddings), we froze the
embedding layer so no gradient updates were applied and then
trained the network. Model performance decreased to an AUC of
0.808 and training time increased by 3x to reach this AUC.
Depriving the model of the ability to dynamically modify the input data
representation using gradients derived from the output loss metric
reduces its ability to solve the classification problem posed.
5.4</p>
      </sec>
    </sec>
    <sec id="sec-14">
      <title>Transferring to another dataset</title>
      <p>
        The GBM model itself used in [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] is not publically available,
however we were able to use the GBM model described in [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. Figure
8 shows the respective ROC curves for the RNN (LSTM) and GBM
models when they are ported to the Retail Rocket dataset. Both
models still perform well, however the LSTM model slightly
outperforms the GBM model (AUC of 0.837 vs 0.834).
      </p>
      <p>A deeper analysis of the Area under the ROC curve demonstrates
how the characteristics of the dataset can impact on model
performance. The Retail Rocket dataset is heavily weighted towards
single-click sessions as Figure 9 shows. LSTM out-performs GBM
for these sessions - which can be attributed more to the learned
embeddings since there is no sequence to process. GBM by
contrast can extract only limited value from single-click sessions as
important feature components such as dwelltime, similarity etc. are
unavailable.</p>
      <p>ROC curves: LSTM and GBM models (Retail Rocket)
LSTM</p>
      <p>GBM
0.0
0.2
0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15</p>
      <p>Session length</p>
    </sec>
    <sec id="sec-15">
      <title>5.5 Training time and resources used</title>
      <p>
        We used PyTorch [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] to construct and train the LSTM models
while XGBoost [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] was used to train the GBM models. The LSTM
implementation was trained on a single Nvidia GeForce GTX TITAN
Black (circa 2014 and with single-precision performance of 5.1
TFLOPs vs a 2017 GTX 1080 Ti with 11.3 TFLOPs) and consumed
1.0
0.8
0.6
0.4
0.2
0.0
      </p>
      <p>
        C
U
A
between 1 and 2 GB RAM. The results reported were obtained
after 3 epochs of training on the full dataset - an average of 6
hours (2 hours per epoch). This compares favourably to the training
times and resources reported in [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], where 150 machines were
used for 12 hours. However, 12 hours denotes the time needed
to train two models (session purchase and item ranking) whereas
we train just one model. While we cannot compare GBM (which
was trained on a CPU cluster) to RNN (trained on a single GPU
with a diferent parallelization strategy) directly, we note that the
hardware resources required for our model are modest and hence
accessible to almost any commercial or academic setup. In addition,
real-world Ecommerce datasets are large [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] and change rapidly,
therefore usable models must be able to consume large datasets
and be re-trained readily to cater for new items / documents.
      </p>
    </sec>
    <sec id="sec-16">
      <title>5.6 Conclusions and future work</title>
      <p>We presented a Recurrent Neural Network (RNN) model which
recovers 98.4% of current SOTA performance on the user purchase
prediction problem in Ecommerce without using explicit features.
On a second dataset, our model fractionally exceeds SOTA
performance. The model is straightforward to implement, generalizes to
diferent datasets with comparable performance and can be trained
with modest hardware resource requirements.</p>
      <p>It is promising that gated RNNs with no feature engineering can
be competitive with Gradient Boosted Machines on short session
lengths and structured data - GBM is a more established model
choice in the domain of Recommender Systems and Ecommerce
in general. We believe additional work on input representation
(while still avoiding feature engineering) can further improve
results for both gated and non-gated RNNs. One area of focus will
be to investigate how parameter sharing at the hidden layer helps
RNNs to operate on short sequences of structured data prevalent
in Ecommerce.</p>
      <p>Lastly,we note that although our approach requires no feature
engineering, it is also inherently transductive - we plan to
investigate embedding generation and maintenance approaches for new
unseen items / documents to add an inductive capability to the
architecture.</p>
    </sec>
    <sec id="sec-17">
      <title>6 ACKNOWLEDGEMENTS</title>
      <p>
        We would like to thank the authors of [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] for making their
original test submission available and the organizers of the original
challenge in releasing the solution file after the competition ended,
enabling us to carry out our comparisons.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Oren</given-names>
            <surname>Barkan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Noam</given-names>
            <surname>Koenigstein</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Item2Vec: Neural Item Embedding for Collaborative Filtering</article-title>
          .
          <source>CoRR abs/1603</source>
          .04259 (
          <year>2016</year>
          ). arXiv:
          <volume>1603</volume>
          .04259 http: //arxiv.org/abs/1603.04259
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>David</given-names>
            <surname>Ben-Shimon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Tsikinovsky</surname>
          </string-name>
          , Michael Friedmann, Bracha Shapira, Lior Rokach, and
          <string-name>
            <given-names>Johannes</given-names>
            <surname>Hoerle</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>RecSys Challenge 2015 and the YOOCHOOSE Dataset</article-title>
          .
          <source>In Proceedings of the 9th ACM Conference on Recommender Systems (RecSys '15)</source>
          . ACM, New York, NY, USA,
          <fpage>357</fpage>
          -
          <lpage>358</lpage>
          . https://doi.org/10. 1145/2792838.2798723
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Veronika</given-names>
            <surname>Bogina</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tsvi</given-names>
            <surname>Kuflik</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Incorporating Dwell Time in SessionBased Recommendations with Recurrent Neural Networks</article-title>
          . In RecTemp@RecSys.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Tianqi</given-names>
            <surname>Chen</surname>
          </string-name>
          and
          <string-name>
            <given-names>Carlos</given-names>
            <surname>Guestrin</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>XGBoost: A Scalable Tree Boosting System</article-title>
          .
          <source>In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16)</source>
          . ACM, New York, NY, USA,
          <fpage>785</fpage>
          -
          <lpage>794</lpage>
          . https://doi.org/10.1145/2939672.2939785
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>KyungHyun</given-names>
            <surname>Cho</surname>
          </string-name>
          , Bart van Merrienboer,
          <string-name>
            <surname>Dzmitry Bahdanau</surname>
            , and
            <given-names>Yoshua</given-names>
          </string-name>
          <string-name>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>On the Properties of Neural Machine Translation: Encoder-Decoder Approaches</article-title>
          .
          <source>CoRR abs/1409</source>
          .1259 (
          <year>2014</year>
          ). arXiv:
          <volume>1409</volume>
          .1259 http://arxiv.org/abs/1409. 1259
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Xiao</given-names>
            <surname>Ding</surname>
          </string-name>
          , Ting Liu, Junwen Duan, and
          <string-name>
            <surname>Jian-Yun Nie</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Mining User Consumption Intention from Social Media Using Domain Adaptive Convolutional Neural Network</article-title>
          .
          <source>In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI'15)</source>
          . AAAI Press,
          <fpage>2389</fpage>
          -
          <lpage>2395</lpage>
          . http://dl.acm.org/citation.cfm? id=
          <volume>2886521</volume>
          .
          <fpage>2886653</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Jerome</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Friedman</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Greedy Function Approximation: A Gradient Boosting Machine</article-title>
          .
          <source>Annals of Statistics</source>
          <volume>29</volume>
          (
          <year>2000</year>
          ),
          <fpage>1189</fpage>
          -
          <lpage>1232</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Ian</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          , Yoshua Bengio, and
          <string-name>
            <given-names>Aaron</given-names>
            <surname>Courville</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep Learning</article-title>
          . MIT Press. http://www.deeplearningbook.org.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Mihajlo</given-names>
            <surname>Grbovic</surname>
          </string-name>
          , Vladan Radosavljevic, Nemanja Djuric, Narayan Bhamidipati, Jaikit Savla, Varun Bhagwan, and
          <string-name>
            <given-names>Doug</given-names>
            <surname>Sharp</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>E-commerce in Your Inbox: Product Recommendations at Scale</article-title>
          .
          <source>CoRR abs/1606</source>
          .07154 (
          <year>2016</year>
          ). arXiv:
          <volume>1606</volume>
          .07154 http://arxiv.org/abs/1606.07154
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Balázs</surname>
            <given-names>Hidasi</given-names>
          </string-name>
          , Alexandros Karatzoglou, Linas Baltrunas, and
          <string-name>
            <given-names>Domonkos</given-names>
            <surname>Tikk</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Session-based Recommendations with Recurrent Neural Networks</article-title>
          .
          <source>CoRR abs/1511</source>
          .06939 (
          <year>2015</year>
          ). arXiv:
          <volume>1511</volume>
          .06939 http://arxiv.org/abs/1511.06939
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jürgen</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Long Short-Term Memory</article-title>
          .
          <source>Neural Comput. 9</source>
          ,
          <issue>8</issue>
          (Nov.
          <year>1997</year>
          ),
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          . https://doi.org/10.1162/neco.
          <year>1997</year>
          .
          <volume>9</volume>
          . 8.
          <fpage>1735</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Dietmar</surname>
            <given-names>Jannach</given-names>
          </string-name>
          , Malte Ludewig, and
          <string-name>
            <given-names>Lukas</given-names>
            <surname>Lerche</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Session-based item recommendation in e-commerce: on short-term intents, reminders, trends and discounts. User Modeling and User-Adapted Interaction 27 (</article-title>
          <year>2017</year>
          ),
          <fpage>351</fpage>
          -
          <lpage>392</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Andrej</surname>
            <given-names>Karpathy</given-names>
          </string-name>
          , Justin Johnson, and
          <string-name>
            <surname>Fei-Fei Li</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Visualizing and Understanding Recurrent Networks</article-title>
          .
          <source>CoRR abs/1506</source>
          .
          <year>02078</year>
          (
          <year>2015</year>
          ).
          <source>arXiv:1506</source>
          .
          <year>02078</year>
          http://arxiv.org/abs/1506.02078
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Diederik</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Kingma</surname>
            and
            <given-names>Jimmy</given-names>
          </string-name>
          <string-name>
            <surname>Ba</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Adam: A Method for Stochastic Optimization</article-title>
          .
          <source>CoRR abs/1412</source>
          .6980 (
          <year>2014</year>
          ). arXiv:
          <volume>1412</volume>
          .6980 http://arxiv.org/abs/1412.6980
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Yehuda</given-names>
            <surname>Koren</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Collaborative Filtering with Temporal Dynamics</article-title>
          .
          <source>In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '09)</source>
          . ACM, New York, NY, USA,
          <fpage>447</fpage>
          -
          <lpage>456</lpage>
          . https://doi.org/10.1145/1557019.1557072
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Zachary</surname>
            <given-names>Chase</given-names>
          </string-name>
          <string-name>
            <surname>Lipton</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>A Critical Review of Recurrent Neural Networks for Sequence Learning</article-title>
          .
          <source>CoRR abs/1506</source>
          .00019 (
          <year>2015</year>
          ). arXiv:
          <volume>1506</volume>
          .00019 http: //arxiv.org/abs/1506.00019
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Malte</given-names>
            <surname>Ludewig</surname>
          </string-name>
          and
          <string-name>
            <given-names>Dietmar</given-names>
            <surname>Jannach</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Evaluation of Session-based Recommendation Algorithms</article-title>
          . CoRR abs/
          <year>1803</year>
          .09587 (
          <year>2018</year>
          ). arXiv:
          <year>1803</year>
          .09587 http://arxiv.org/abs/
          <year>1803</year>
          .09587
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          , Kai Chen, Greg Corrado, and
          <string-name>
            <given-names>Jefrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Eficient Estimation of Word Representations in Vector Space</article-title>
          .
          <source>CoRR abs/1301</source>
          .3781 (
          <year>2013</year>
          ). arXiv:
          <volume>1301</volume>
          .3781 http://arxiv.org/abs/1301.3781
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Razvan</surname>
            <given-names>Pascanu</given-names>
          </string-name>
          , Tomas Mikolov, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>On the Dificulty of Training Recurrent Neural Networks</article-title>
          .
          <source>In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (ICML'13)</source>
          . JMLR.org, III-1310
          <string-name>
            <surname>-</surname>
          </string-name>
          III-1318. http://dl.acm.org/citation.cfm?id=
          <volume>3042817</volume>
          .
          <fpage>3043083</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Adam</surname>
            <given-names>Paszke</given-names>
          </string-name>
          , Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
          <string-name>
            <surname>Zachary</surname>
            <given-names>DeVito</given-names>
          </string-name>
          , Zeming Lin, Alban Desmaison, Luca Antiga, and
          <string-name>
            <given-names>Adam</given-names>
            <surname>Lerer</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Automatic diferentiation in PyTorch</article-title>
          . (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Jefrey</surname>
            <given-names>Pennington</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher D.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>GloVe: Global Vectors for Word Representation</article-title>
          .
          <source>In Empirical Methods in Natural Language Processing (EMNLP)</source>
          .
          <volume>1532</volume>
          -
          <fpage>1543</fpage>
          . http://www.aclweb.org/anthology/ D14-1162
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>R.L.</given-names>
            <surname>Phillips</surname>
          </string-name>
          .
          <year>2005</year>
          . Pricing and
          <string-name>
            <given-names>Revenue</given-names>
            <surname>Optimization</surname>
          </string-name>
          . Stanford University Press. https://books.google.co.uk/books?id=bXsyO06qikEC
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Reid</surname>
            <given-names>Pryzant</given-names>
          </string-name>
          , Young joo Chung, and
          <string-name>
            <given-names>Dan</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Predicting Sales from the Language of Product Descriptions</article-title>
          .
          <source>In ACM SIGIR Forum. ACM.</source>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Stefen</given-names>
            <surname>Rendle</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Factorization Machines</article-title>
          .
          <source>In Proceedings of the 2010 IEEE International Conference on Data Mining (ICDM '10)</source>
          . IEEE Computer Society, Washington, DC, USA,
          <fpage>995</fpage>
          -
          <lpage>1000</lpage>
          . https://doi.org/10.1109/ICDM.
          <year>2010</year>
          .127
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Retailrocket</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Retailrocket recommender system dataset</article-title>
          . https://www. kaggle.com/retailrocket/ecommerce-dataset. (
          <year>2017</year>
          ). [Online; accessed 01-
          <fpage>Feb2018</fpage>
          ].
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Romov</surname>
          </string-name>
          and
          <string-name>
            <given-names>Evgeny</given-names>
            <surname>Sokolov</surname>
          </string-name>
          .
          <year>2015</year>
          . RecSys Challenge 2015:
          <article-title>Ensemble Learning with Categorical Features</article-title>
          .
          <source>In Proceedings of the 2015 International ACM Recommender Systems Challenge (RecSys '15 Challenge)</source>
          . ACM, New York, NY, USA, Article
          <volume>1</volume>
          , 4 pages. https://doi.org/10.1145/2813448.2813510
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Rumelhart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Williams</surname>
          </string-name>
          .
          <source>1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition</source>
          , Vol.
          <volume>1</volume>
          . MIT Press, Cambridge, MA, USA,
          <source>Chapter Learning Internal Representations by Error Propagation</source>
          ,
          <fpage>318</fpage>
          -
          <lpage>362</lpage>
          . http://dl.acm.org/citation.cfm?id=
          <volume>104279</volume>
          .
          <fpage>104293</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>D.</given-names>
            <surname>Sculley</surname>
          </string-name>
          , Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Young</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Machine Learning: The High Interest Credit Card of Technical Debt</article-title>
          .
          <source>In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop).</source>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>Humphrey</given-names>
            <surname>Sheil</surname>
          </string-name>
          and
          <string-name>
            <given-names>Omer</given-names>
            <surname>Rana</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Classifying and Recommending Using Gradient Boosted Machines and Vector Space Models</article-title>
          .
          <source>In Advances in Computational Intelligence Systems. UKCI</source>
          <year>2017</year>
          .,
          <string-name>
            <given-names>Zhang Q Chao F.</given-names>
            ,
            <surname>Schockaert</surname>
          </string-name>
          <string-name>
            <surname>S</surname>
          </string-name>
          . (Ed.), Vol.
          <volume>650</volume>
          . Springer, Cham. https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -66939-7_
          <fpage>18</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Yong</given-names>
            <surname>Kiam</surname>
          </string-name>
          <string-name>
            <given-names>Tan</given-names>
            , Xinxing
            <surname>Xu</surname>
          </string-name>
          , and Yong Liu.
          <year>2016</year>
          .
          <article-title>Improved Recurrent Neural Networks for Session-based Recommendations</article-title>
          .
          <source>CoRR abs/1606</source>
          .08117 (
          <year>2016</year>
          ). arXiv:
          <volume>1606</volume>
          .08117 http://arxiv.org/abs/1606.08117
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Arthur</surname>
            <given-names>Toth</given-names>
          </string-name>
          , Louis Tan, Giuseppe Di Fabbrizio, and
          <string-name>
            <given-names>Ankur</given-names>
            <surname>Datta</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Predicting Shopping Behavior with Mixture of RNNs. In ACM SIGIR Forum</article-title>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Trotman</surname>
          </string-name>
          , Jon Degenhardt, and
          <string-name>
            <given-names>Surya</given-names>
            <surname>Kallumadi</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>The Architecture of eBay Search</article-title>
          .
          <source>In ACM SIGIR Forum. ACM.</source>
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Vezhnevets</surname>
          </string-name>
          and
          <string-name>
            <given-names>Olga</given-names>
            <surname>Barinova</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Avoiding Boosting Overfitting by Removing Confusing Samples</article-title>
          .
          <source>In Proceedings of the 18th European Conference on Machine Learning (ECML '07)</source>
          . Springer-Verlag, Berlin, Heidelberg,
          <fpage>430</fpage>
          -
          <lpage>441</lpage>
          . https://doi.org/10.1007/978-3-
          <fpage>540</fpage>
          -74958-5_
          <fpage>40</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>Maksims</given-names>
            <surname>Volkovs</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Two-Stage Approach to Item Recommendation from User Sessions</article-title>
          .
          <source>In Proceedings of the 2015 International ACM Recommender Systems Challenge (RecSys '15 Challenge)</source>
          . ACM, New York, NY, USA, Article
          <volume>3</volume>
          , 4 pages. https://doi.org/10.1145/2813448.2813512
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <surname>Chao-Yuan</surname>
            <given-names>Wu</given-names>
          </string-name>
          , Amr Ahmed, Alex Beutel,
          <string-name>
            <given-names>Alexander J.</given-names>
            <surname>Smola</surname>
          </string-name>
          , and
          <string-name>
            <given-names>How</given-names>
            <surname>Jing</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Recurrent Recommender Networks</article-title>
          .
          <source>In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM '17)</source>
          . ACM, New York, NY, USA,
          <fpage>495</fpage>
          -
          <lpage>503</lpage>
          . https://doi.org/10.1145/3018661.3018689
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <surname>Peng</surname>
            <given-names>Yan</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiaocong Zhou</surname>
            , and
            <given-names>Yitao</given-names>
          </string-name>
          <string-name>
            <surname>Duan</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>E-Commerce Item Recommendation Based on Field-aware Factorization Machine</article-title>
          .
          <source>In Proceedings of the 2015 International ACM Recommender Systems Challenge (RecSys '15 Challenge)</source>
          . ACM, New York, NY, USA, Article
          <volume>2</volume>
          , 4 pages. https://doi.org/10.1145/2813448.2813511
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <surname>Yu</surname>
            <given-names>Zhu</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Hao</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Yikang</given-names>
            <surname>Liao</surname>
          </string-name>
          , Beidou Wang, Ziyu Guan, Haifeng Liu, and
          <string-name>
            <given-names>Deng</given-names>
            <surname>Cai</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>What to Do Next: Modeling User Behaviors by Time-LSTM</article-title>
          .
          <source>In IJCAI.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>