=Paper=
{{Paper
|id=Vol-2855/challenge_short_2
|storemode=property
|title=Using Deep Learning to Win the Booking.com WSDM WebTour21 Challenge on Sequential Recommendations
|pdfUrl=https://ceur-ws.org/Vol-2855/challenge_short_2.pdf
|volume=Vol-2855
|authors=Benedikt Schifferer,Chris Deotte,Jean-Francois Puget,Gabriel de Souza Pereira Moreira,Gilberto Titericz, Jiwei Liu,Ronay Ak
|dblpUrl=https://dblp.org/rec/conf/wsdm/SchiffererDPMTL21
}}
==Using Deep Learning to Win the Booking.com WSDM WebTour21 Challenge on Sequential Recommendations==
ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel 23
Using Deep Learning to Win the Booking.com WSDM
WebTour21 Challenge on Sequential Recommendations
Benedikt Schifferer∗ Chris Deotte∗ Jean-François Puget∗
bschifferer@nvidia.com cdeotte@nvidia.com jpuget@nvidia.com
NVIDIA NVIDIA NVIDIA France Developpement
New York City, New York, United San Diego, California, United States Saint Raphael, France
States
Gabriel de Souza Pereira Gilberto Titericz∗ Jiwei Liu∗
Moreira∗ gtitericz@nvidia.com jiweil@nvidia.com
gmoreira@nvidia.com NVIDIA NVIDIA
NVIDIA Curitiba, Parana, Brazil Pittsburgh, Pennsylvania, United
São Paulo, Brazil States
Ronay Ak∗
ronaya@nvidia.com
NVIDIA
Sarasota, Florida, United States
ABSTRACT In this paper, we present our 1st place solution for the Book-
In this paper we present our 1st place solution of the WSDM Web- ing.com WSDM WebTour 21 Challenge organized by Booking.com
Tour 21 Challenge. The competition task was to predict the city of for the WebTour workshop at ACM WSDM 2021 [1].
the last booked hotel in a user trip, based on the previously visited The task was to recommend (predict) the final city destination
cities. For our final solution, we designed three deep learning archi- for a traveller trip given their previous booking history within the
tectures based on Multilayer Perceptron (MLP), Gated Recurrent trip. The competition evaluation metric was Precision@4, which
Units (GRU) and XLNet (Transformer) building blocks. A shared scored true only if the correct next city was among the top-4 rec-
component among the architectures was a Session-based Matrix ommendations (regardless the order) out of the about 40,000 cities
Factorization head (SMF), which learns a linear mapping between found in the dataset.
the item (city) embeddings and the session (trip) embeddings to Previous recent RecSys competitions often require identifying
generate recommendations by a dot product operation. Our final the positive interactions given a set of positives and negatives (e.g.
leaderboard result was 0.5939 for precision@4 and scored 2.8% bet- click-through rate (CTR) prediction), which somehow shifts the
ter than the second solution. We published our implementation, task from generating recommendations to a binary classification
using RAPIDS cuDF, TensorFlow and PyTorch, on github.com1 . problem. In that setting, the focus is often on creating features
to separate positive from negative examples rather than model
CCS CONCEPTS design [4]. Boosted trees algorithms often performed well in such
binary classification tasks, as observed in the last RecSys Challenge
• Information systems → Personalization; Recommender sys-
competitions [3] [5].
tems; Learning to rank.
Differently from previous RecSys competitions, this dataset con-
tained only positive examples and deep learning based solutions
KEYWORDS worked best for us, challenging the status-quo highlighted in [4].
Recommender Systems, WSDM WebTour 21 Challenge, Deep Learn- For our final solution, we developed three different neural ar-
ing, Sequential recommendation, MLP, GRU, Transformer, Matrix chitectures: (1) MLP with Session-based Matrix Factorization net
Factorization (MLP-SMF), (2) GRU with MultiStage Session-based Matrix Fac-
torization net (GRU-MS-SMF), and (3) XLNet with Session-based
1 INTRODUCTION Matrix Factorization net (XLNet-SMF). The final submission was a
Recommender systems (RecSys) have become a key component in simple ensemble (weighted average) of the predictions from models
many online services, such as e-commerce, social media, news ser- of each of those three architectures.
vice, or online video streaming. Sequence-based recommendation
is a highly relevant task in real-world applications, like predicting 2 PREPROCESSING AND FEATURE
the next item in the online shopping cart, the next online course a ENGINEERING
student will choose, or the next travel destination of a traveller.
The competition dataset from Booking.com challenge has a tabular
∗ Authors contributed equally to this research. data structure, consisting of 1.5M of anonymized reservations of
1 https://github.com/rapidsai/deeplearning/tree/main/WSDM2021 269k trips. Each reservation is a part of a customer’s trip (identified
Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel 24
by utrip_id) which includes at least 4 consecutive reservations. A 3 MODEL ARCHITECTURES
reservation contains the metadata and city/country information but This section describes the final three neural architectures designed
does not include information about the hotel. Thus, the sequence of for this competition, which were ensembled for the final submission.
trip reservations can be obtained by sorting on the check-in date. Each architecture can be trained with a single NVIDIA V100 GPU.
As the cardinality of the label (city) is not large, all models treated
2.1 Data splitting for Cross-Validation the recommendation as a multi-class classification problem, by
Unlike most competitions, this competition allowed only two sub- using softmax cross-entropy loss function. A common component
missions per team - the intermediate one and the final one. It was of all architectures was the Session-Based Matrix Factorization
crucial to design a good the validation set, because it would be key (SMF) head, which is described in the next section.
to select the best features and models for the final submission. We
used k-fold cross validation: predicting the last city of the trips on
the train set split into 5-folds. For each fold, the fold train data was 3.1 Session-based Matrix Factorization head
used as validation set, and data (both train and test) from the other (SMF)
folds (Out-Of-Fold - OOF) was used to train the model. We used Framing this problem under the recommender systems taxonomy,
the full OOF dataset, predicting the next city given previous ones. the cities are the items we want to recommend for a trip. The trip
The cross-validation (CV) score was the average precision@4 from is analogous to a session in the session-based recommendation
each of the five folds. task, which is generally a sequence of user interactions – hotel
reservations in this case.
2.2 Feature Engineering For this competition, it was designed a Session-based Matrix
The original dataset has 9 features: user_id, utrip_id, checkin, check- Factorization (SMF) head to produce the scores (logits) for all items,
out, city_id (label), hotel_country, device_class, affiliate_id and instead of a regular Multi-Layer Perceptron (MLP) layer. This design
booker_country. For the last reservation of each trip of the test set, was inspired by the MF-based Collaborative Filtering, which learns
the city_id and hotel_country were not provided. latent factors for users and items by performing a dot product of
In our initial experiments, building new features improved the their embeddings to predict the relevance of an item for an user.
accuracy of our models. The main features included in our models The large majority of users had only a single trip available in
are listed in Table 2 in Appendix A. All features that used infor- the dataset. So, instead of trying to model the user preference, we
mation from other rows (e.g. user features, city features) were used the last layer of the network to represent the session (trip)
computed OOF, to avoid leakage from the fold labels. Depending embedding. Then, we compute the dot product between the session
on the architecture, a subset of the features were used. embedding s and all the set 𝐼 of item embeddings 𝑖, where 𝑖 ∈ 𝐼 , to
We implemented our preprocessing and feature engineering model items relevance probability distribution 𝑟 of each item (city)
pipeline with RAPIDS cuDF, a library for GPU-accelerated dataframe being the next for that session (trip), such as 𝑟 = softmax(𝑠 · 𝐼 ).
transformations.
2.3 Data augmentation with reversed trips 3.2 MLP with Session-based Matrix
The dataset with 1.5M bookings is relatively small in comparison Factorization head (MLP-SMF)
to other recommendation datasets. We explored techniques to in- The MLP with Session-based Matrix Factorization head uses feed-
crease the training data by data augmentation and discovered that forward and embedding layers, and can be seen in Figure 1. Cat-
doubling the dataset with reversed trips improved model’s accu- egorical input features are fed through an embedding layer and
racy. A trip is an ordered sequence of cities and although there are continuous input features are individually projected via a linear
many permutations to visit a set of cities, there is a logical ordering layer to embeddings, followed by batch normalization and Rectified
implied by distances between cities and available transportation Linear Unit (ReLU) non-linear activation. All embedding dimen-
connections. These characteristics are commutative. For example, a sions are made equal. The embeddings of continuous input features
trip of Boston->New York->Princeton->Philadelphia->Washington and categorical input features, except the lag features, are combined
DC can be booked in reverse order, but not many people would book via summation. The output is concatenated with the embeddings
in a random order like Boston->Princeton->Washington DC->New of the 5 last cities and countries (lag features).
York->Philadelphia. The embedding tables for the city lags are shared, and similarly
for hotel country lags. The lag embeddings are concatenated, but
2.4 Low frequency cities hashing the model should still be able to learn the sequential patterns of
The distribution of the cities reservation frequency was long-tailed cities by the order of lag features, i.e., city lag1’s embedding vector
as one would expect - some cities are much more popular for is always in the same position of the concatenated vector.
tourism and business than others. To help the models not to focus The concatenated vector is fed through 3 feed-forward layers
too much on very unpopular cities (city reservation frequency < 9 in with batch normalization, Parametric Rectified Linear Unit (PReLU)
our case), we assigned their city id to a single value for training and activation function and dropout, to form the session (trip) embed-
ensured that those cities were never ranked high during inference. ding. It is used by the Session-based Matrix Factorization head
Such approach reduced the cardinality of the cities from more than to produce the scores for all cities. More details can be found in
44,000 to about 11,000 cities (see Figure 4 in Appendix B). Appendix C.
Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel 25
Using Deep Learning to Win the Booking.com WSDM WebTour21 Challenge on Sequential Recommendations
Figure 1: Visualization of MLP with Session-based Matrix
Factorization head architecture.
Figure 2: Visualization of GRU with MultiStage Session-
based Matrix Factorization head architecture.
3.3 GRU with MultiStage Session-based Matrix
Factorization head (GRU-MS-SMF)
[6] for sequential recommendation. In that approach, for each train-
The GRU with MultiStage Session-based Matrix Factorization head ing step, a proportion of all items are masked from the input se-
uses a GRU cell for embedding the historical user actions (previous quence (i.e., replaced with a single trainable embedding), and then
5 visited cities), similar to GRU4Rec [2]. Missing values (sequences the original ids of the masked items are predicted using other items
with less than 5 length) are padded with 0s. of the sequence, from both left and right sides. When a masked
The last GRU hidden state is concatenated with the embeddings item is not the last one, this approach allows the usage of privileged
of the other categorical input features, as shown in Figure 2. The information of the future reservations in the trip during training.
embedding tables for the previous cities are shared. The model uses Therefore, during inference, only the last item of the sequence is
only categorical input features, including some numerical features masked, to match the sequential recommendation task and do not
modeled as embeddings such as trip length and reservation order. leak future information of the trip [6].
The concatenated embeddings are fed through a MultiStage For this network, each example is a trip, represented as a se-
Session-based Matrix Factorization head. In The first stage is a quence of its reservations. The reservation embedding is generated
softmax head over all items, which is used to select the top-50 cities by concatenating the features and projecting using MLP layers.
for the next stage. In the second stage, the Session-based Matrix The sequence of reservation embeddings is fed to the XLNet
Factorization head is applied using only the top-50 cities of the first Transformer stacked blocks, in which the output of each block is
stage and two representations of the trip embeddings (the outputs the input of the next block. The final transformer block output
of the last and second-to-last MLP layers, after the concatenation), generates the trip embedding.
resulting in two additional heads. The final output is a weighted Finally, we use the Matrix Factorization head (dot product be-
sum of all three heads, with trainable weights. tween the trip embedding and city embeddings) to produce a prob-
The multi-stage head works as a 2-stage ranking problem. The ability distribution over the cities for the masked items in the se-
first head ranks all items and the other two heads can focus on the quence. Model hyperparameters can be found in Appendix C.
reranking of the top-50 items from the first stage. This approach can
potentially scale to large item catalogs, i.e., in the order of millions. 4 ENSEMBLING
This dataset did not require such scalability, but this multi-stage
Ensembling is a proven approach to improve the accuracy of models,
design might be effective for deployment in production. Training
by combining their predictions and improving generalization.
details can be found in Appendix C.
We used the k-fold cross-validation and bagging (training the
model many times with different random seeds in our case) tech-
3.4 XLNet with Session-based Matrix niques to ensemble models from our three architectures, as shown
Factorization head (XLNet-SMF) in Algorithm 1 in Appendix D.
The XLNet with Session-based Matrix Factorization head (XLNet- In general, the higher the diversity of model’s predictions, the
SMF) uses a Transformer architecture named XLNet [7], originally more the ensembling technique can potentially improve the final
proposed for the permutation-based language modeling task in scores. In our case, the correlation of the predicted city scores
Natural Language Processing (NLP). In our case, instead of modeling between each two combinations of the three architectures was
the sequence of word tokens, we model the sequence of items in around 80%, which resulted in a representative improvement with
the session (trip). ensembling in our final CV score.
We have adapted the XLNet training task for Masked Language In addition to ensembling the three models by averaging the
Modeling (also known as Cloze task), like proposed by BERT4Rec predictions, we also tried to add another ensembling layer: a model
Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel 26
Single bag CV Ensemble CV Final LB
MLP-SMF 0.5667 0.5756
GRU-MS-SMF 0.5664 0.5762
XLNet-SMF 0.5681 0.5751
Final Ensemble 0.5825 0.5939
Table 1: Final results by architecture and for the final ensem-
ble - MLP-SMF uses 8 bags, GRU-MS-SMF uses 7 bags and
XLNet-SMF uses 5 bags.
As all three neural network architectures use the Session-based
Matrix Factorization head and the data augmentation by reversing
trips, we wanted to understand the contribution of each component
by a small ablation study. Table 3 in Appendix F summarizes the
results training the MLP architecture with and without the Session-
based Matrix Factorization head (MLP-SMF). We found out that the
SMF head improved the baseline by 2.1%.
Analogously, we trained the MLP-SMF model without and with
data augmentation and noticed an improvement of 0.8% (see Table
4 in Appendix F).
6 CONCLUSION
Figure 3: Visualization of XLNet with Session-based Matrix
In this paper we presented our 1st place solution of the Booking.com
Factorization head architecture.
WSDM WebTour 21 Challenge. We designed three different deep
learning architectures based on MLP, GRU and Transformer neural
to rerank the top 20 cities from the averaged predictions. That in- building blocks. Some techniques resulted in improved performance
sight came from the fact that the precision@20 of our averaged for all models, like our proposed Session-based Matrix Factorization
predictions was 0.777, a much higher score than for the competi- head and the data augmentation with reversed trips. The diversity
tion metric precision@4 (around 0.57) (see Figure 5 in Appendix of our architectures resulted in significant accuracy improvements
E). The reranker was a gradient boosted decision trees model for by ensembling model predictions.
binary classification of cities (0=incorrect, 1=correct last city of the We hope our final solution, which is provided on github.com2 ,
trip), having as input features the predictions from the averaged is useful to others in the RecSys field who build sequential recom-
predictions and additional features like the frequency of the city. mendation systems.
Our initial experiments showed promising results, with improved
precision@4 by 0.5%. Due to the lack of a public leaderboard and ACKNOWLEDGMENTS
only one final submission, we decided for the conservative ensem- The authors wish to thank our colleagues on the NVIDIA Merlin,
bling approach of weighted averaged predictions to rank the cities KGMON and RAPIDS.AI teams for their support in this competition.
for submission, as described in Algorithm 1 in Appendix D.
REFERENCES
5 RESULTS AND DISCUSSION [1] Dmitri Goldenberg, Kostia Kofman, Pavel Levin, Sarai Mizrachi, Maayan Kafry,
and Guy Nadav. 2021. Booking.com WSDM WebTour 2021 Challenge. https:
To make it clear the contribution from each architecture and from //www.bookingchallenge.com/. In ACM WSDM Workshop on Web Tourism (WSDM
the ensembling algorithm for our final result, we present in Table 1 WebTour’21).
the cross-validation Precision@4 by architecture, individual and [2] Balázs Hidasi, Alexandros Karatzoglou, L. Baltrunas, and D. Tikk. 2016. Session-
based Recommendations with Recurrent Neural Networks. CoRR abs/1511.06939
after bagging ensembling, and the final ensemble from the three (2016).
architectures combined. We can notice that the XLNet-SMF was [3] Pawe Jankiewicz, Liudmyla Kyrashchuk, Pawe Sienkowski, and Magdalena Wójcik.
2019. Boosting Algorithms for a Session-Based, Context-Aware Recommender
the most accurate single model. System in an Online Travel Domain. In Proceedings of the Workshop on ACM
Interestingly, the MLP-SMF architecture achieved a CV score Recommender Systems Challenge.
comparable with the other two architectures that explicitly model [4] Dietmar Jannach, Gabriel de Souza P. Moreira, and Even Oldridge. 2020. Why Are
Deep Learning Models Not Consistently Winning Recommender Systems Compe-
the sequences using GRU and XLNet (Transformer). In MLP-SMF titions Yet? A Position Paper. In Proceedings of the Recommender Systems Challenge
the embeddings of the last 5 cities are simply concatenated and 2020 (Virtual Event, Brazil) (RecSysChallenge ’20). Association for Computing
combined by MLP layers, while for the GRU-MS-SMF the last 5 Machinery, New York, NY, USA, 44–49. https://doi.org/10.1145/3415959.3416001
[5] Benedikt Schifferer, Gilberto Titericz, Chris Deotte, Christof Henkel, Kazuki On-
cities are processed by a GRU layer, and for XLNet-SMF the last 15 odera, Jiwei Liu, Bojan Tunguz, Even Oldridge, Gabriel De Souza Pereira Moreira,
cities are processed by the Transformer. As the MLP-SMF model was and Ahmet Erdem. 2020. GPU Accelerated Feature Engineering and Training
for Recommender Systems. In Proceedings of the Recommender Systems Challenge
lightweight, it was much faster to train than the other architectures,
which sped up its experimentation cycle and improvements. 2 https://github.com/rapidsai/deeplearning/tree/main/WSDM2021
Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel 27
Using Deep Learning to Win the Booking.com WSDM WebTour21 Challenge on Sequential Recommendations
2020 (Virtual Event, Brazil) (RecSysChallenge ’20). Association for Computing B DISTRIBUTION FREQUENCY CITIES
Machinery, New York, NY, USA, 16–23. https://doi.org/10.1145/3415959.3415996
[6] Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019.
BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations
from Transformer. Association for Computing Machinery, New York, NY, USA,
1441–1450. https://doi.org/10.1145/3357384.3357895
[7] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov,
and Quoc Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language
Understanding.
A FEATURES
Figure 4: Distribution of frequency of city_id in train and
test dataset. Around 10,000 city ids appeared only once in
the dataset.
Feature Set Features
original categorical city_id, hotel_country, booker_country de-
features vice_class, affiliate_id
check-in and check- day-of-week, week-of-year, month, isweek-
out date features end, season, stay length (checkout -
checkin), days since last booking (checkin -
previous checkout)
C TRAINING DETAILS
sequence features first city in the trip, lagged (previous 5) cities MLP-MF: In this architecture, embedding tables were initialized
and countries from the trip with 𝑁 ∼ (0, 0.01). Low frequent cities are ignored by Categorical
Cross Entropy loss. The model is trained with Adam optimizer for
current trip stats trip length (# reservations), trip duration 12 epochs with a batch size of 1024, with a 1 cycle learning rate
(days), reservation order (in ascending and scheduler.
descending orders) GRU-MS-SMF: The model is trained with the Adam optimizer for
5 epochs, with a batch size of 512, and a learning rate step decay
past user trips stats # user’s past reservations, # user’s past trips, schedule (1e-3, 1e-3, 1e-4, 1e-5, 1e-6). A variation of bagging is
# user’s past visited cities, # user’s past vis- applied by training the architecture 7 times with different seeds
ited countries and a slight variation of used input features. The final predictions
are generated by averaging the probabilities over all 7 models.
geographic- Features based on the conditional probabil- XLNet-SMF: It was stacked 4 XLNet Transformer blocks, with 4
seasonal city ities of a city c from a country co, being heads each, with dimension 512. The model was trained for 30
popularity visited at a month m or at a week-of-year epochs, with initial learning rate 2e-4 and linear decay. The batch
w, as follows: P(c | m), P(c | m,co), P(c | w), size was 128, where each training instance is a session (trip) trun-
P(c | w,co) cated to maximum length 15. For regularization, it was used dropout
0.1 and AdamW optimizer with weight decay of 1e-5. The masking
Table 2: Features used by at least one of the architectures. probability during training was 0.1, but ensuring that we always
have at least one masked item for each session (trip).
Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel 28
D ENSEMBLING ALGORITHM 1 model CV
Algorithm 1: Ensembling algorithm MLP-SMF without data augmentation 0.5620
split data in k-folds; MLP-SMF 0.5667 (+0.8%)
for each architecture do Table 4: Cross validation score for training MLP-SMF with
j = number of bags for this architecture; and without data augmentation.
for each fold in folds do
for i in [1, ..., j] do
concatenate train and test out-of-fold data
(OOF);
train model on OOF;
evaluate model based on last city of in-fold train
data;
predict last city of all folds of test data;
end
end
ensemble by averaging over each fold and each bag;
end
ensemble by averaging over architectures;
E PRECISION@K
Figure 5: Precision@k for top20 recommendations by en-
semble model.
F ABLATION STUDY
1 model CV
MLP 0.5550
MLP-SMF 0.5667 (+2.1%)
Table 3: Cross validation score for training MLP architec-
ture with and without the Session-based Matrix Factoriza-
tion head (MLP-SMF).
Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).