<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Combining Context Features in Sequence-Aware Recommender Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sarai Mizrachi Booking.com Tel Aviv</string-name>
          <email>pavel.levin@booking.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Israel sarai.mizrachi@booking.com</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Pavel Levin Booking.com Tel Aviv</institution>
          ,
          <country country="IL">Israel</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>There are several important design choices that machine learning practitioners need to make when incorporating predictors into RNN-based contextual recommender systems. A great deal of currently reported findings about these decisions focus on the seting where predicted items take on values from the space of sequence items. This work provides an empirical evaluation of some straightforward approaches of dealing with such problems on a real-world large scale prediction problem from the travel domain, where predicted entities do not live in the space of sequence items.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Similarly, when recommending a travel destination we can use the ordered sequence of previously
booked destinations as input.</p>
      <p>However, some situations require predictions in an output space which is diferent from the input
space. A classic example from the field of natural language processing is document classification: the
document is represented by a sequence of words and the prediction happens in the space of possible
topics, intents or sentiments. In travel domain we may want to recommend a country to visit next
based on user’s past history of accommodation bookings (cities, accommodation types, lengths of
stay, etc). User history items takes on diferent values from prediction items.</p>
      <p>
        In both situations (sequence completion and diferent domain prediction) recurrent neural networks
(RNNs) including their gated variants (e.g. GRU [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], LSTM [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) are commonly used. Both problems
become more complex if we have token-level, and/or sequence-level features that we want to factor in.
In our destination prediction example we could use a user-level feature such as their home country,
as well as booking-specific features (lengths of stay, time since last booking, etc).
      </p>
      <p>This work focuses on the second type of sequence-aware recommendation problem, specifically
when we do not assume the predicted items to come from the same space as sequence items. We
look at several basic ways of incorporating context into RNN-based recommendation systems and
benchmarks their performance.</p>
    </sec>
    <sec id="sec-2">
      <title>THE SETUP</title>
      <p>
        Our goal is to compare diferent approaches to account for token- and sequence-level context in
RNNbased recommendation systems. An important distinction of this work from much of the previously
reported results (e.g. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]) is that we do not assume that the items in our sequence and the
predicted items come from the same space. This set up can be thought of as RNN Encoder → Decoder,
where the decoder can be as simple as softmax regression in case of a categorical prediction.
      </p>
      <p>Each user’s data point is represented by a sequence of items of nu items x(u) = x 1(u:n)u , I item-level
feature sequences f (u) = { fi(,u1:)nu : i ∈ 1, . . . , I } and sequence-level features s(u). One comparison we
do is between two popular techniques for fusing embedded feature information fk(,u1):nu with the item
embeddings x k(u) (concatenation vs element-wise multiplication). Another issue we look at is how to
best input sequence-level information s(u), by fusing it at each time step with the items along with the
token-level features, or by embedding them in the items space and simply using them as additional
tokens in the sequence.
where et is RNN input for time step t and W is the linear model for multiclass item prediction. The
rest of this section will look into how exactly RNN inputs et should be derived.</p>
    </sec>
    <sec id="sec-3">
      <title>Baselines</title>
      <p>We look at two simple baselines which do not use any sequence, or past history information, and
one sequence-aware baseline with no context features. The first baseline is recommending the most
popular items based on the last token of the sequence: yˆ(u) = argmaxy P (y | x n(uu)). The second
baseline is recommending the most popular items according to the sequence-level features: yˆ(u) =
argmaxy P (y | s(u)). Our third baseline is a standard GRU sequence classification model with no
context features.</p>
    </sec>
    <sec id="sec-4">
      <title>Embedding sequence-level features in the items space</title>
      <p>The idea behind this approach is that sequence- or user-level features can be simply treated as extra
tokens in the sequence. This means that in our RNN architectures those features should be represented
in the same space as the items, i.e. we need a single embedding matrix E ∈ IR(K+M)×ditems to represent
K items and M levels of sequence-level features in ditems dimensions. All token-level feature would
still be embedded in separate vector spaces and represented by matrices Ej ∈ IR|Fj |×dFj , where dFj is
the embedding dimension of feature j and |Fj | is its cardinality. The following two approaches discuss
how we merge token-level embeddings with item embeddings.</p>
      <p>Concatenation merge. One of the more popular and straightforward approaches for merging item and
feature embeddings is simply by concatenating them (see Fig. 1).</p>
      <p>hk+1 = RN N (concat (x k(u), f1(,uk), . . . , fi(,un)u ), hk )
One obvious advantage of this approach is the ability to chose diferent embedding dimensions for
each feature Fi according to its cardinality and distributions of values.
Baselines</p>
      <p>Last item only 0.206 0.460</p>
      <p>Seq features only 0.196 0.481
Items only (no features) 0.608 0.788
Seq-level features as seq items</p>
      <p>Concatenation 0.657 0.823</p>
      <p>Multiplication 0.648 0.811
Seq-level features as token-level features</p>
      <p>Concatenation 0.656 0.822
Multiplication 0.644 0.808</p>
      <p>
        Multiplication merge. Another popular way of fusing embeddings is through element-wise
multiplication (Fig. 2). This approach forces us to have ditems = dFj , j ∈ {1 . . . I }. In case when I &gt; 1, i.e. we
have more than one token-level feature, we follow [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and first apply element-wise summation to all
features, and only then element-wise multiply the result with the item embedding.
      </p>
      <p>hk+1 = RN N (x k(u) ⊙ [f1(,uk) + . . . + fi(,un)u ], hk )</p>
    </sec>
    <sec id="sec-5">
      <title>Fusing sequence-level features with the items</title>
      <p>Another approach toward sequence-level features that we consider is treating them as additional
token-level features. Of course, since sequence-level features do not change across time steps, we
merge the same values to each sequence item. As before, we consider two basic merge functions:
concatenation and element-wise multiplication.</p>
      <p>Concatenation merge. The only diference here from the concatenation model above is that now we
concatenate an extra feature embedding to our items (Fig. 3). This lets us have shorter sequences, but
the input dimension of our RNN needs to be bigger.</p>
      <p>Multiplication merge. In this scenario (Fig. 4) all embedding dimensions need to be equal. As before, we
first sum the feature embedding vectors and then element-wise multiply them with item embeddings.</p>
    </sec>
    <sec id="sec-6">
      <title>DATASET AND EXPERIMENTS</title>
      <p>We run our experiments on a proprietary travel dataset of 30 millions travellers from 250 diferent
countries or territories sampled from the last 3.5 years. All users in our dataset made at least three
international bookings. To benchmark the performance of our approaches we predict user’s last visited
country based on a sequence of visited destinations (cities, villages, etc.). The gap between the last
booking in the sequence and the target (last visited country) is at least one week.</p>
      <p>We used one sequence-level feature, traveler’s home country, and two token-level features (days
since last trip, accommodation type). Our sequence-level feature clearly takes on diferent values
from our items (countries vs cities), however both are geographical entities, so it is unclear a-priori
whether or not embedding them in the same vector space would hurt the performance or not. We
evaluate how models perform by measuring precision@k (k ∈ {1, 4, 10}) on the data from additional
1,500,000 users. To be consistent we use embedding dimension of 50 for all items and features in all
models. The GRUs have two layers and 500 hidden units. Our main sequence tokens (destinations)
have cardinality of 50,000. The sequence-level home country feature is of size 250. For token-level
features there are 1,244 possible values for “days since last booking" and 31 diferent accommodation
types . The target is one of the 250 possible countries (or territories).
Brazil SouArgtehntinMaAeximcCohileeCorloimcbiaa UrPuegrCuuoaEsPyctauanaRadmiocraa GuaDteomPmaairnlaaicBgPauounalievyRVritaeeopCnRuebizEceulloiceSlanaFlvrHeatnodcrnohdrauProTallrsyinNnAieidBcsaaiadrhmBaeagmrnumFdJaieajBusiTmde(oarlCtaibhziaiceeyga)cmoaanIBsSaluarnbridanHdsaaomistei AGnutiygaunGaaraeSnnadamSdMBioeaaiarcCrrPbahoTauanLoldadeenausNogEinaaaMreuiotrrPnueTittaosckeaerirlanatu</p>
      <p>BulgaSrilaoveCnroiaatia EquatoriaGlDuNGoiinuSmgFeioeniaalnreolkiCamclaeaoGnnBnatdurmIarIssuSlblalLnoAaiinadmbnfdriediascrsTliaiNauanivuaReleupublic</p>
      <p>Togo</p>
      <p>Montenegro
BosniaandHerzegovina</p>
      <p>MacedoniaAlbania
SerbBiaalkans</p>
      <p>SouthSudan</p>
      <p>SanMarino</p>
      <p>FrenchGuiana
AndorLaiechtenstein</p>
      <p>Kenya</p>
      <p>Guest_house</p>
      <p>Hostel</p>
      <p>HomestayCapsule_HotelStudenRti_aadccommodation</p>
      <p>Figure 6: Accommodation types
Visualization of a token-level feature “accommodation type" from
the model in Fig. 4. Similar property types tend to be closer together
in the embedding space</p>
      <p>Table 1 shows the precision@k results for the models. Concatenation seems to perform beter than
multiplication for both ways of inputing sequence-level features. All sequence recommenders do
significantly beter than our naive baselines. Our featureless sequence recommender baseline also
significantly outperforms the naive baselines, but noticeably worse than the context-aware models.
On the other hand, the choice of inputing sequence-level features as items, although slightly beter,
seems to mater much less in terms of model accuracy.</p>
    </sec>
    <sec id="sec-7">
      <title>DISCUSSION</title>
      <p>We have looked at simple ways of merging item and feature embeddings in RNN-based
recommendation problems where sequence items and prediction items take on values from diferent spaces. Our
main conclusion is that for simple RNN-based sequence models concatenating features seems to work
beter than merging them element-wise, while our choice of how we input our sequence-level features
(as extra items or as token features) maters less. Despite the limited scope of our study, we believe it
will help to guide machine learning practitioners in designing more efective architectures that are
able to incorporate both sequence- and item-level context into RNN-based recommender systems.
We have only analyzed the case of a single sequence-level feature of relatively small cardinality. In
follow-up work it would be beneficial to look at more general cases of multiple sequence-level features
and various strategies to fuse them together, along with item-level information. It is also important
to look at more complex merge functions, such as feedforward neural networks or bilinear forms in
future research.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Alex</given-names>
            <surname>Beutel</surname>
          </string-name>
          , Paul Covington, Sagar Jain, Can Xu,
          <string-name>
            <given-names>Jia</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Vince</given-names>
            <surname>Gato</surname>
          </string-name>
          , and
          <string-name>
            <surname>Ed</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Chi</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Latent Cross: Making Use of Context in Recurrent Recommender Systems</article-title>
          .
          <source>In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM '18)</source>
          . ACM, New York, NY, USA,
          <fpage>46</fpage>
          -
          <lpage>54</lpage>
          . htps://doi.org/10.1145/3159652.3159727
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Kyunghyun</given-names>
            <surname>Cho</surname>
          </string-name>
          , Bart van Merriënboer,
          <string-name>
            <surname>Caglar Gulcehre</surname>
            , Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and
            <given-names>Yoshua</given-names>
          </string-name>
          <string-name>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation</article-title>
          .
          <source>In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          .
          <article-title>Association for Computational Linguistics</article-title>
          , Doha, Qatar,
          <fpage>1724</fpage>
          -
          <lpage>1734</lpage>
          . htps://doi.org/10.3115/v1/
          <fpage>D14</fpage>
          -1179
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jürgen</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Long Short-Term Memory</article-title>
          .
          <source>Neural Comput. 9</source>
          ,
          <issue>8</issue>
          (Nov.
          <year>1997</year>
          ),
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          . htps://doi.org/10.1162/neco.
          <year>1997</year>
          .
          <volume>9</volume>
          .8.
          <fpage>1735</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Context-Aware Sequential Recommendation</article-title>
          .
          <source>In 2016 IEEE 16th International Conference on Data Mining (ICDM)</source>
          .
          <volume>1053</volume>
          -
          <fpage>1058</fpage>
          . htps://doi.org/10.1109/ICDM.
          <year>2016</year>
          .0135
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Massimo</given-names>
            <surname>Quadrana</surname>
          </string-name>
          , Paolo Cremonesi, and
          <string-name>
            <given-names>Dietmar</given-names>
            <surname>Jannach</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Sequence-Aware Recommender Systems</article-title>
          .
          <source>ACM Comput. Surv</source>
          .
          <volume>51</volume>
          ,
          <issue>4</issue>
          ,
          <string-name>
            <surname>Article 66</surname>
          </string-name>
          (
          <year>July 2018</year>
          ),
          <volume>36</volume>
          pages. htps://doi.org/10.1145/3190616
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Elena</given-names>
            <surname>Smirnova</surname>
          </string-name>
          and
          <string-name>
            <given-names>Flavian</given-names>
            <surname>Vasile</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Contextual Sequence Modeling for Recommendation with Recurrent Neural Networks</article-title>
          .
          <source>In Proceedings of the 2Nd Workshop on Deep Learning for Recommender Systems (DLRS</source>
          <year>2017</year>
          ). ACM, New York, NY, USA,
          <fpage>2</fpage>
          -
          <lpage>9</lpage>
          . htps://doi.org/10.1145/3125486.3125488
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>