=Paper=
{{Paper
|id=Vol-2431/paper3
|storemode=property
|title=Combining Context Features in Sequence-Aware Recommender Systems
|pdfUrl=https://ceur-ws.org/Vol-2431/paper3.pdf
|volume=Vol-2431
|authors=Sarai Mizrachi,Pavel Levin
|dblpUrl=https://dblp.org/rec/conf/recsys/MizrachiL19
}}
==Combining Context Features in Sequence-Aware Recommender Systems==
Combining Context Features in
Sequence-Aware Recommender
Systems
Sarai Mizrachi Pavel Levin
Booking.com Booking.com
Tel Aviv, Israel Tel Aviv, Israel
sarai.mizrachi@booking.com pavel.levin@booking.com
ABSTRACT
There are several important design choices that machine learning practitioners need to make when
incorporating predictors into RNN-based contextual recommender systems. A great deal of currently
reported findings about these decisions focus on the setting where predicted items take on values
from the space of sequence items. This work provides an empirical evaluation of some straightforward
approaches of dealing with such problems on a real-world large scale prediction problem from the
travel domain, where predicted entities do not live in the space of sequence items.
CCS CONCEPTS
• Information systems → Recommender systems; • Computing methodologies → Neural
networks.
KEYWORDS
recommender systems; recurrent neural networks; context-aware recommendations; sequence-aware
recommendations
INTRODUCTION
Many types of recommendation problems can be naturally viewed as sequence extension problems.
A classic example is a language model offering real-time next word recommendations while typing.
ACM RecSys 2019 Late-breaking Results, 16th-20th September 2019, Copenhagen, Denmark
Copyright ©2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International
(CC BY 4.0).
Combining context features in sequence-aware recommender systems ACM RecSys 2019 Late-breaking Results, 16th-20th September 2019, Copenhagen, Denmark
Similarly, when recommending a travel destination we can use the ordered sequence of previously
booked destinations as input.
However, some situations require predictions in an output space which is different from the input
space. A classic example from the field of natural language processing is document classification: the
document is represented by a sequence of words and the prediction happens in the space of possible
topics, intents or sentiments. In travel domain we may want to recommend a country to visit next
Figure 1: Sequence-level tokens embed- based on user’s past history of accommodation bookings (cities, accommodation types, lengths of
ded with items, concatenation stay, etc). User history items takes on different values from prediction items.
When we embed items and sequence-level information in the same In both situations (sequence completion and different domain prediction) recurrent neural networks
vector space, we need to “pad" sequence feature embeddings to be of (RNNs) including their gated variants (e.g. GRU [2], LSTM [3]) are commonly used. Both problems
become more complex if we have token-level, and/or sequence-level features that we want to factor in.
ÍI
dimension d i t ems + 1 d f i since that would be the expected RNN
input dimension. The simplest way to achieve this is by introducing
In our destination prediction example we could use a user-level feature such as their home country,
extra “dummy" values for each token features whose embeddings
as well as booking-specific features (lengths of stay, time since last booking, etc).
would be concatenated to sequence-level feature at the beginning.
This work focuses on the second type of sequence-aware recommendation problem, specifically
when we do not assume the predicted items to come from the same space as sequence items. We
look at several basic ways of incorporating context into RNN-based recommendation systems and
benchmarks their performance.
THE SETUP
Our goal is to compare different approaches to account for token- and sequence-level context in RNN-
based recommendation systems. An important distinction of this work from much of the previously
reported results (e.g. [4], [5], [6]) is that we do not assume that the items in our sequence and the
predicted items come from the same space. This set up can be thought of as RNN Encoder → Decoder,
where the decoder can be as simple as softmax regression in case of a categorical prediction.
(u)
Each user’s data point is represented by a sequence of items of nu items x(u) = x 1:n u
, I item-level
Figure 2: Sequence-level tokens embed- (u)
ded with items, multiplication feature sequences f (u) = { fi,1:n u
: i ∈ 1, . . . , I } and sequence-level features s (u) . One comparison we
(u)
Unlike in the concatenation example, we don’t need to introduce do is between two popular techniques for fusing embedded feature information fk,1:n u
with the item
dummy values for token features, or pad embeddings in any way
embeddings x k(u) (concatenation vs element-wise multiplication). Another issue we look at is how to
since this approach forces all embeddings to be of the same size and
we only perform element-wise operations to merge them. best input sequence-level information s (u) , by fusing it at each time step with the items along with the
token-level features, or by embedding them in the items space and simply using them as additional
tokens in the sequence.
Combining context features in sequence-aware recommender systems ACM RecSys 2019 Late-breaking Results, 16th-20th September 2019, Copenhagen, Denmark
MODEL ARCHITECTURES
We do our comparisons using four different context-aware RNN-based models, one RNN model with
no context and two naive sequence-agnostic baselines.
The RNN architectures use GRU cell as the base and softmax regression to decode sequence
representation as a product prediction
ht(u)
+1 = GRU (e t , h t )
ŷ (u) = So f tmax(W hnu )
Figure 3: Sequence-level tokens merged
with token-level features, concatenation where et is RNN input for time step t and W is the linear model for multiclass item prediction. The
Here we do not use separate sequence tokens to encode
rest of this section will look into how exactly RNN inputs et should be derived.
sequence-level features. Instead we treat sequence features the same
way as token features and concatenate them to the item
Baselines
embeddings.
We look at two simple baselines which do not use any sequence, or past history information, and
one sequence-aware baseline with no context features. The first baseline is recommending the most
popular items based on the last token of the sequence: ŷ (u) = argmaxy P(y | x n(u)
u
). The second
baseline is recommending the most popular items according to the sequence-level features: ŷ (u) =
argmaxy P(y | s(u) ). Our third baseline is a standard GRU sequence classification model with no
context features.
Embedding sequence-level features in the items space
The idea behind this approach is that sequence- or user-level features can be simply treated as extra
tokens in the sequence. This means that in our RNN architectures those features should be represented
in the same space as the items, i.e. we need a single embedding matrix E ∈ IR(K +M )×di t ems to represent
K items and M levels of sequence-level features in dit ems dimensions. All token-level feature would
still be embedded in separate vector spaces and represented by matrices E j ∈ IR |Fj |×d F j , where d F j is
the embedding dimension of feature j and |Fj | is its cardinality. The following two approaches discuss
how we merge token-level embeddings with item embeddings.
Figure 4: Sequence-level tokens merged Concatenation merge. One of the more popular and straightforward approaches for merging item and
with token-level features, multiplication feature embeddings is simply by concatenating them (see Fig. 1).
When we merge more than two embeddings through multiplication,
we first sum all feature embeddings together and then element-wise hk +1 = RN N (concat(x k(u) , f 1,k
(u) (u)
, . . . , fi,n u
), hk )
multiply the result with item embeddings.
One obvious advantage of this approach is the ability to chose different embedding dimensions for
each feature Fi according to its cardinality and distributions of values.
Combining context features in sequence-aware recommender systems ACM RecSys 2019 Late-breaking Results, 16th-20th September 2019, Copenhagen, Denmark
Multiplication merge. Another popular way of fusing embeddings is through element-wise multipli-
cation (Fig. 2). This approach forces us to have dit ems = d F j , j ∈ {1 . . . I }. In case when I > 1, i.e. we
have more than one token-level feature, we follow [1] and first apply element-wise summation to all
features, and only then element-wise multiply the result with the item embedding.
hk+1 = RN N (x k(u) ⊙ [f 1,k
(u) (u)
+ . . . + fi,n u
], hk )
Table 1: Precision@k results for baselines
and proposed models Fusing sequence-level features with the items
Another approach toward sequence-level features that we consider is treating them as additional
Prec@1 Prec@4 Prec@10 token-level features. Of course, since sequence-level features do not change across time steps, we
Baselines merge the same values to each sequence item. As before, we consider two basic merge functions:
Last item only 0.206 0.460 0.686 concatenation and element-wise multiplication.
Seq features only 0.196 0.481 0.714
Concatenation merge. The only difference here from the concatenation model above is that now we
Items only (no features) 0.608 0.788 0.889
concatenate an extra feature embedding to our items (Fig. 3). This lets us have shorter sequences, but
Seq-level features as seq items
Concatenation 0.657 0.823 0.912 the input dimension of our RNN needs to be bigger.
Multiplication 0.648 0.811 0.904 Multiplication merge. In this scenario (Fig. 4) all embedding dimensions need to be equal. As before, we
Seq-level features as token-level features
first sum the feature embedding vectors and then element-wise multiply them with item embeddings.
Concatenation 0.656 0.822 0.911
Multiplication 0.644 0.808 0.902
DATASET AND EXPERIMENTS
We run our experiments on a proprietary travel dataset of 30 millions travellers from 250 different
countries or territories sampled from the last 3.5 years. All users in our dataset made at least three
international bookings. To benchmark the performance of our approaches we predict user’s last visited
country based on a sequence of visited destinations (cities, villages, etc.). The gap between the last
booking in the sequence and the target (last visited country) is at least one week.
We used one sequence-level feature, traveler’s home country, and two token-level features (days
since last trip, accommodation type). Our sequence-level feature clearly takes on different values
from our items (countries vs cities), however both are geographical entities, so it is unclear a-priori
whether or not embedding them in the same vector space would hurt the performance or not. We
evaluate how models perform by measuring precision@k (k ∈ {1, 4, 10}) on the data from additional
1,500,000 users. To be consistent we use embedding dimension of 50 for all items and features in all
models. The GRUs have two layers and 500 hidden units. Our main sequence tokens (destinations)
have cardinality of 50,000. The sequence-level home country feature is of size 250. For token-level
features there are 1,244 possible values for “days since last booking" and 31 different accommodation
types . The target is one of the 250 possible countries (or territories).
Combining context features in sequence-aware recommender systems ACM RecSys 2019 Late-breaking Results, 16th-20th September 2019, Copenhagen, Denmark
Puerto Rico Jamaica Haiti Antigua and Barbuda
Table 1 shows the precision@k results for the models. Concatenation seems to perform better than
Dominican Republic
French Polynesia
Bahamas
Fiji (the)
Nicaragua
Belize
Barbados
Guyana
Grenada
Palau
multiplication for both ways of inputting sequence-level features. All sequence recommenders do
significantly better than our naive baselines. Our featureless sequence recommender baseline also
Mexico Costa Rica Guatemala Cayman Islands
Trinidad and Tobago
Honduras
South America
Bermuda Tonga
Central America
Ecuador El Salvador Suriname
Brazil
Uruguay Bolivia Sierra Leone Pitcairn
Argentina ParaguayVenezuela
significantly outperforms the naive baselines, but noticeably worse than the context-aware models.
Panama
Chile Colombia Peru Samoa
Nauru
Micronesia Tokelau
Montserrat
On the other hand, the choice of inputting sequence-level features as items, although slightly better,
Chad
Eritrea
Falkland Islands
Tuvalu
Niue
Liberia
Solomon Islands
Slovenia
seems to matter much less in terms of model accuracy.
Montenegro
Croatia Bosnia and Herzegovina Gambia
Albania Equatorial Guinea Burundi
Central African Republic
Macedonia Guinea
Bulgaria Niger
Dominica Somalia
Balkans San Marino Togo
Serbia
French Guiana
South Sudan DISCUSSION
Europe
Liechtenstein
Andorra
Greenland
Spain Austria
Italy Switzerland
GermanyPoland
Hungary
Czechia
Portugal
Slovakia
Kenya Ethiopia
Uganda Rwanda
We have looked at simple ways of merging item and feature embeddings in RNN-based recommenda-
tion problems where sequence items and prediction items take on values from different spaces. Our
FranceNetherlands Luxembourg
Belgium
Great Britain
Israel
Romania
Ukraine
Belarus
Iceland
Georgia
Armenia
Africa Congo Mali
Benin
Malawi
Ireland Lithuania Nigeria
Azerbaijan
main conclusion is that for simple RNN-based sequence models concatenating features seems to work
Malta Gibraltar
Denmark Senegal Gabon
Kazakhstan Uzbekistan Cameroon Bhutan
Latvia Ghana Zambia
Namibia
Kyrgyzstan Papua New Guinea
Tajikistan
Mayotte
Russia Estonia Botswana Madagascar Lesotho
Mauritania
Moldova Mongolia Seychelles
better than merging them element-wise, while our choice of how we input our sequence-level features
Sweden FinlandSouth Africa Mozambique Tanzania
Angola
Turkey Mauritius Zimbabwe
Norway
United States of America Morocco Cambodia
Greece Tunisia
Algeria Bangladesh Maldives
Nepal
Libya
Sri Lanka MonacoMyanmar
(as extra items or as token features) matters less. Despite the limited scope of our study, we believe it
Cyprus
Canada Thailand
India Brunei Darussalam
Middle East
New ZealandSingapore
Malaysia Egypt Syrian
will help to guide machine learning practitioners in designing more effective architectures that are
Indonesia Viet Nam Lebanon
Jordan Turkmenistan
Iraq Yemen
Australia
China
Asia Pakistan
Macao
New Caledonia
Iran Sudan
able to incorporate both sequence- and item-level context into RNN-based recommender systems.
Taiwan
Qatar
Oman
Korea Hong Kong
Kuwait
Philippines Bahrain
Saudi Arabia
Japan United Arab Emirates
We have only analyzed the case of a single sequence-level feature of relatively small cardinality. In
Figure 5: Country embeddings follow-up work it would be beneficial to look at more general cases of multiple sequence-level features
Country embeddings as sequence-level features from model in Fig 4 and various strategies to fuse them together, along with item-level information. It is also important
visualized 2D. The model successfully learned reasonable clustering to look at more complex merge functions, such as feedforward neural networks or bilinear forms in
of countries without any direct supervision related to their locations. future research.
REFERENCES
Guest_house [1] Alex Beutel, Paul Covington, Sagar Jain, Can Xu, Jia Li, Vince Gatto, and Ed H. Chi. 2018. Latent Cross: Making Use of
Hostel
Homestay
Context in Recurrent Recommender Systems. In Proceedings of the Eleventh ACM International Conference on Web Search
Student_accommodation
Capsule_Hotel Riad and Data Mining (WSDM ’18). ACM, New York, NY, USA, 46–54. https://doi.org/10.1145/3159652.3159727
Inn [2] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and
Sanatorium
Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation.
Cruise
Gite In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for
Boat Hotel
Love_Hotel
Motel Ryokan Computational Linguistics, Doha, Qatar, 1724–1734. https://doi.org/10.3115/v1/D14-1179
Cottage ApartHotel
Japanese-style_Business_Hotel [3] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (Nov. 1997), 1735–1780.
Apartment https://doi.org/10.1162/neco.1997.9.8.1735
Lodge Tented_camp Condo
[4] Q. Liu, S. Wu, D. Wang, Z. Li, and L. Wang. 2016. Context-Aware Sequential Recommendation. In 2016 IEEE 16th International
Resort Conference on Data Mining (ICDM). 1053–1058. https://doi.org/10.1109/ICDM.2016.0135
Villa
Chalet [5] Massimo Quadrana, Paolo Cremonesi, and Dietmar Jannach. 2018. Sequence-Aware Recommender Systems. ACM Comput.
Country_house Surv. 51, 4, Article 66 (July 2018), 36 pages. https://doi.org/10.1145/3190616
Camping
Holiday_Park Farm_stay [6] Elena Smirnova and Flavian Vasile. 2017. Contextual Sequence Modeling for Recommendation with Recurrent Neural
Holiday_home Networks. In Proceedings of the 2Nd Workshop on Deep Learning for Recommender Systems (DLRS 2017). ACM, New York,
Bed_and_Breakfast
NY, USA, 2–9. https://doi.org/10.1145/3125486.3125488
Figure 6: Accommodation types
Visualization of a token-level feature “accommodation type" from
the model in Fig. 4. Similar property types tend to be closer together
in the embedding space