=Paper= {{Paper |id=Vol-2855/challenge_short_2 |storemode=property |title=Using Deep Learning to Win the Booking.com WSDM WebTour21 Challenge on Sequential Recommendations |pdfUrl=https://ceur-ws.org/Vol-2855/challenge_short_2.pdf |volume=Vol-2855 |authors=Benedikt Schifferer,Chris Deotte,Jean-Francois Puget,Gabriel de Souza Pereira Moreira,Gilberto Titericz, Jiwei Liu,Ronay Ak |dblpUrl=https://dblp.org/rec/conf/wsdm/SchiffererDPMTL21 }} ==Using Deep Learning to Win the Booking.com WSDM WebTour21 Challenge on Sequential Recommendations== https://ceur-ws.org/Vol-2855/challenge_short_2.pdf
ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel                                                                                                     23




                Using Deep Learning to Win the Booking.com WSDM
                WebTour21 Challenge on Sequential Recommendations
                  Benedikt Schifferer∗                                     Chris Deotte∗                                  Jean-François Puget∗
               bschifferer@nvidia.com                                 cdeotte@nvidia.com                                  jpuget@nvidia.com
                       NVIDIA                                               NVIDIA                                   NVIDIA France Developpement
           New York City, New York, United                     San Diego, California, United States                      Saint Raphael, France
                        States

              Gabriel de Souza Pereira                                   Gilberto Titericz∗                                       Jiwei Liu∗
                      Moreira∗                                          gtitericz@nvidia.com                               jiweil@nvidia.com
                   gmoreira@nvidia.com                                         NVIDIA                                            NVIDIA
                         NVIDIA                                        Curitiba, Parana, Brazil                     Pittsburgh, Pennsylvania, United
                     São Paulo, Brazil                                                                                            States

                                                                             Ronay Ak∗
                                                                         ronaya@nvidia.com
                                                                                NVIDIA
                                                                    Sarasota, Florida, United States
    ABSTRACT                                                                                In this paper, we present our 1st place solution for the Book-
    In this paper we present our 1st place solution of the WSDM Web-                     ing.com WSDM WebTour 21 Challenge organized by Booking.com
    Tour 21 Challenge. The competition task was to predict the city of                   for the WebTour workshop at ACM WSDM 2021 [1].
    the last booked hotel in a user trip, based on the previously visited                   The task was to recommend (predict) the final city destination
    cities. For our final solution, we designed three deep learning archi-               for a traveller trip given their previous booking history within the
    tectures based on Multilayer Perceptron (MLP), Gated Recurrent                       trip. The competition evaluation metric was Precision@4, which
    Units (GRU) and XLNet (Transformer) building blocks. A shared                        scored true only if the correct next city was among the top-4 rec-
    component among the architectures was a Session-based Matrix                         ommendations (regardless the order) out of the about 40,000 cities
    Factorization head (SMF), which learns a linear mapping between                      found in the dataset.
    the item (city) embeddings and the session (trip) embeddings to                         Previous recent RecSys competitions often require identifying
    generate recommendations by a dot product operation. Our final                       the positive interactions given a set of positives and negatives (e.g.
    leaderboard result was 0.5939 for precision@4 and scored 2.8% bet-                   click-through rate (CTR) prediction), which somehow shifts the
    ter than the second solution. We published our implementation,                       task from generating recommendations to a binary classification
    using RAPIDS cuDF, TensorFlow and PyTorch, on github.com1 .                          problem. In that setting, the focus is often on creating features
                                                                                         to separate positive from negative examples rather than model
    CCS CONCEPTS                                                                         design [4]. Boosted trees algorithms often performed well in such
                                                                                         binary classification tasks, as observed in the last RecSys Challenge
    • Information systems → Personalization; Recommender sys-
                                                                                         competitions [3] [5].
    tems; Learning to rank.
                                                                                            Differently from previous RecSys competitions, this dataset con-
                                                                                         tained only positive examples and deep learning based solutions
    KEYWORDS                                                                             worked best for us, challenging the status-quo highlighted in [4].
    Recommender Systems, WSDM WebTour 21 Challenge, Deep Learn-                             For our final solution, we developed three different neural ar-
    ing, Sequential recommendation, MLP, GRU, Transformer, Matrix                        chitectures: (1) MLP with Session-based Matrix Factorization net
    Factorization                                                                        (MLP-SMF), (2) GRU with MultiStage Session-based Matrix Fac-
                                                                                         torization net (GRU-MS-SMF), and (3) XLNet with Session-based
    1    INTRODUCTION                                                                    Matrix Factorization net (XLNet-SMF). The final submission was a
    Recommender systems (RecSys) have become a key component in                          simple ensemble (weighted average) of the predictions from models
    many online services, such as e-commerce, social media, news ser-                    of each of those three architectures.
    vice, or online video streaming. Sequence-based recommendation
    is a highly relevant task in real-world applications, like predicting                2    PREPROCESSING AND FEATURE
    the next item in the online shopping cart, the next online course a                       ENGINEERING
    student will choose, or the next travel destination of a traveller.
                                                                                         The competition dataset from Booking.com challenge has a tabular
    ∗ Authors contributed equally to this research.                                      data structure, consisting of 1.5M of anonymized reservations of
    1 https://github.com/rapidsai/deeplearning/tree/main/WSDM2021                        269k trips. Each reservation is a part of a customer’s trip (identified




                    Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel                                                                                                     24




    by utrip_id) which includes at least 4 consecutive reservations. A                 3     MODEL ARCHITECTURES
    reservation contains the metadata and city/country information but                 This section describes the final three neural architectures designed
    does not include information about the hotel. Thus, the sequence of                for this competition, which were ensembled for the final submission.
    trip reservations can be obtained by sorting on the check-in date.                 Each architecture can be trained with a single NVIDIA V100 GPU.
                                                                                          As the cardinality of the label (city) is not large, all models treated
    2.1    Data splitting for Cross-Validation                                         the recommendation as a multi-class classification problem, by
    Unlike most competitions, this competition allowed only two sub-                   using softmax cross-entropy loss function. A common component
    missions per team - the intermediate one and the final one. It was                 of all architectures was the Session-Based Matrix Factorization
    crucial to design a good the validation set, because it would be key               (SMF) head, which is described in the next section.
    to select the best features and models for the final submission. We
    used k-fold cross validation: predicting the last city of the trips on
    the train set split into 5-folds. For each fold, the fold train data was           3.1     Session-based Matrix Factorization head
    used as validation set, and data (both train and test) from the other                      (SMF)
    folds (Out-Of-Fold - OOF) was used to train the model. We used                     Framing this problem under the recommender systems taxonomy,
    the full OOF dataset, predicting the next city given previous ones.                the cities are the items we want to recommend for a trip. The trip
    The cross-validation (CV) score was the average precision@4 from                   is analogous to a session in the session-based recommendation
    each of the five folds.                                                            task, which is generally a sequence of user interactions – hotel
                                                                                       reservations in this case.
    2.2    Feature Engineering                                                            For this competition, it was designed a Session-based Matrix
    The original dataset has 9 features: user_id, utrip_id, checkin, check-            Factorization (SMF) head to produce the scores (logits) for all items,
    out, city_id (label), hotel_country, device_class, affiliate_id and                instead of a regular Multi-Layer Perceptron (MLP) layer. This design
    booker_country. For the last reservation of each trip of the test set,             was inspired by the MF-based Collaborative Filtering, which learns
    the city_id and hotel_country were not provided.                                   latent factors for users and items by performing a dot product of
       In our initial experiments, building new features improved the                  their embeddings to predict the relevance of an item for an user.
    accuracy of our models. The main features included in our models                      The large majority of users had only a single trip available in
    are listed in Table 2 in Appendix A. All features that used infor-                 the dataset. So, instead of trying to model the user preference, we
    mation from other rows (e.g. user features, city features) were                    used the last layer of the network to represent the session (trip)
    computed OOF, to avoid leakage from the fold labels. Depending                     embedding. Then, we compute the dot product between the session
    on the architecture, a subset of the features were used.                           embedding s and all the set 𝐼 of item embeddings 𝑖, where 𝑖 ∈ 𝐼 , to
       We implemented our preprocessing and feature engineering                        model items relevance probability distribution 𝑟 of each item (city)
    pipeline with RAPIDS cuDF, a library for GPU-accelerated dataframe                 being the next for that session (trip), such as 𝑟 = softmax(𝑠 · 𝐼 ).
    transformations.

    2.3    Data augmentation with reversed trips                                       3.2     MLP with Session-based Matrix
    The dataset with 1.5M bookings is relatively small in comparison                           Factorization head (MLP-SMF)
    to other recommendation datasets. We explored techniques to in-                    The MLP with Session-based Matrix Factorization head uses feed-
    crease the training data by data augmentation and discovered that                  forward and embedding layers, and can be seen in Figure 1. Cat-
    doubling the dataset with reversed trips improved model’s accu-                    egorical input features are fed through an embedding layer and
    racy. A trip is an ordered sequence of cities and although there are               continuous input features are individually projected via a linear
    many permutations to visit a set of cities, there is a logical ordering            layer to embeddings, followed by batch normalization and Rectified
    implied by distances between cities and available transportation                   Linear Unit (ReLU) non-linear activation. All embedding dimen-
    connections. These characteristics are commutative. For example, a                 sions are made equal. The embeddings of continuous input features
    trip of Boston->New York->Princeton->Philadelphia->Washington                      and categorical input features, except the lag features, are combined
    DC can be booked in reverse order, but not many people would book                  via summation. The output is concatenated with the embeddings
    in a random order like Boston->Princeton->Washington DC->New                       of the 5 last cities and countries (lag features).
    York->Philadelphia.                                                                   The embedding tables for the city lags are shared, and similarly
                                                                                       for hotel country lags. The lag embeddings are concatenated, but
    2.4    Low frequency cities hashing                                                the model should still be able to learn the sequential patterns of
    The distribution of the cities reservation frequency was long-tailed               cities by the order of lag features, i.e., city lag1’s embedding vector
    as one would expect - some cities are much more popular for                        is always in the same position of the concatenated vector.
    tourism and business than others. To help the models not to focus                     The concatenated vector is fed through 3 feed-forward layers
    too much on very unpopular cities (city reservation frequency < 9 in               with batch normalization, Parametric Rectified Linear Unit (PReLU)
    our case), we assigned their city id to a single value for training and            activation function and dropout, to form the session (trip) embed-
    ensured that those cities were never ranked high during inference.                 ding. It is used by the Session-based Matrix Factorization head
    Such approach reduced the cardinality of the cities from more than                 to produce the scores for all cities. More details can be found in
    44,000 to about 11,000 cities (see Figure 4 in Appendix B).                        Appendix C.




                  Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel                                                                                                   25


    Using Deep Learning to Win the Booking.com WSDM WebTour21 Challenge on Sequential Recommendations




    Figure 1: Visualization of MLP with Session-based Matrix
    Factorization head architecture.
                                                                                       Figure 2: Visualization of GRU with MultiStage Session-
                                                                                       based Matrix Factorization head architecture.
    3.3    GRU with MultiStage Session-based Matrix
           Factorization head (GRU-MS-SMF)
                                                                                       [6] for sequential recommendation. In that approach, for each train-
    The GRU with MultiStage Session-based Matrix Factorization head                    ing step, a proportion of all items are masked from the input se-
    uses a GRU cell for embedding the historical user actions (previous                quence (i.e., replaced with a single trainable embedding), and then
    5 visited cities), similar to GRU4Rec [2]. Missing values (sequences               the original ids of the masked items are predicted using other items
    with less than 5 length) are padded with 0s.                                       of the sequence, from both left and right sides. When a masked
       The last GRU hidden state is concatenated with the embeddings                   item is not the last one, this approach allows the usage of privileged
    of the other categorical input features, as shown in Figure 2. The                 information of the future reservations in the trip during training.
    embedding tables for the previous cities are shared. The model uses                Therefore, during inference, only the last item of the sequence is
    only categorical input features, including some numerical features                 masked, to match the sequential recommendation task and do not
    modeled as embeddings such as trip length and reservation order.                   leak future information of the trip [6].
       The concatenated embeddings are fed through a MultiStage                           For this network, each example is a trip, represented as a se-
    Session-based Matrix Factorization head. In The first stage is a                   quence of its reservations. The reservation embedding is generated
    softmax head over all items, which is used to select the top-50 cities             by concatenating the features and projecting using MLP layers.
    for the next stage. In the second stage, the Session-based Matrix                     The sequence of reservation embeddings is fed to the XLNet
    Factorization head is applied using only the top-50 cities of the first            Transformer stacked blocks, in which the output of each block is
    stage and two representations of the trip embeddings (the outputs                  the input of the next block. The final transformer block output
    of the last and second-to-last MLP layers, after the concatenation),               generates the trip embedding.
    resulting in two additional heads. The final output is a weighted                     Finally, we use the Matrix Factorization head (dot product be-
    sum of all three heads, with trainable weights.                                    tween the trip embedding and city embeddings) to produce a prob-
       The multi-stage head works as a 2-stage ranking problem. The                    ability distribution over the cities for the masked items in the se-
    first head ranks all items and the other two heads can focus on the                quence. Model hyperparameters can be found in Appendix C.
    reranking of the top-50 items from the first stage. This approach can
    potentially scale to large item catalogs, i.e., in the order of millions.          4    ENSEMBLING
    This dataset did not require such scalability, but this multi-stage
                                                                                       Ensembling is a proven approach to improve the accuracy of models,
    design might be effective for deployment in production. Training
                                                                                       by combining their predictions and improving generalization.
    details can be found in Appendix C.
                                                                                          We used the k-fold cross-validation and bagging (training the
                                                                                       model many times with different random seeds in our case) tech-
    3.4    XLNet with Session-based Matrix                                             niques to ensemble models from our three architectures, as shown
           Factorization head (XLNet-SMF)                                              in Algorithm 1 in Appendix D.
    The XLNet with Session-based Matrix Factorization head (XLNet-                        In general, the higher the diversity of model’s predictions, the
    SMF) uses a Transformer architecture named XLNet [7], originally                   more the ensembling technique can potentially improve the final
    proposed for the permutation-based language modeling task in                       scores. In our case, the correlation of the predicted city scores
    Natural Language Processing (NLP). In our case, instead of modeling                between each two combinations of the three architectures was
    the sequence of word tokens, we model the sequence of items in                     around 80%, which resulted in a representative improvement with
    the session (trip).                                                                ensembling in our final CV score.
       We have adapted the XLNet training task for Masked Language                        In addition to ensembling the three models by averaging the
    Modeling (also known as Cloze task), like proposed by BERT4Rec                     predictions, we also tried to add another ensembling layer: a model




                  Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel                                                                                                                26




                                                                                                                  Single bag CV         Ensemble CV          Final LB
                                                                                           MLP-SMF                     0.5667               0.5756
                                                                                           GRU-MS-SMF                  0.5664               0.5762
                                                                                           XLNet-SMF                   0.5681               0.5751
                                                                                           Final Ensemble                                   0.5825            0.5939
                                                                                       Table 1: Final results by architecture and for the final ensem-
                                                                                       ble - MLP-SMF uses 8 bags, GRU-MS-SMF uses 7 bags and
                                                                                       XLNet-SMF uses 5 bags.



                                                                                          As all three neural network architectures use the Session-based
                                                                                       Matrix Factorization head and the data augmentation by reversing
                                                                                       trips, we wanted to understand the contribution of each component
                                                                                       by a small ablation study. Table 3 in Appendix F summarizes the
                                                                                       results training the MLP architecture with and without the Session-
                                                                                       based Matrix Factorization head (MLP-SMF). We found out that the
                                                                                       SMF head improved the baseline by 2.1%.
                                                                                          Analogously, we trained the MLP-SMF model without and with
                                                                                       data augmentation and noticed an improvement of 0.8% (see Table
                                                                                       4 in Appendix F).

                                                                                       6    CONCLUSION
    Figure 3: Visualization of XLNet with Session-based Matrix
                                                                                       In this paper we presented our 1st place solution of the Booking.com
    Factorization head architecture.
                                                                                       WSDM WebTour 21 Challenge. We designed three different deep
                                                                                       learning architectures based on MLP, GRU and Transformer neural
    to rerank the top 20 cities from the averaged predictions. That in-                building blocks. Some techniques resulted in improved performance
    sight came from the fact that the precision@20 of our averaged                     for all models, like our proposed Session-based Matrix Factorization
    predictions was 0.777, a much higher score than for the competi-                   head and the data augmentation with reversed trips. The diversity
    tion metric precision@4 (around 0.57) (see Figure 5 in Appendix                    of our architectures resulted in significant accuracy improvements
    E). The reranker was a gradient boosted decision trees model for                   by ensembling model predictions.
    binary classification of cities (0=incorrect, 1=correct last city of the              We hope our final solution, which is provided on github.com2 ,
    trip), having as input features the predictions from the averaged                  is useful to others in the RecSys field who build sequential recom-
    predictions and additional features like the frequency of the city.                mendation systems.
       Our initial experiments showed promising results, with improved
    precision@4 by 0.5%. Due to the lack of a public leaderboard and                   ACKNOWLEDGMENTS
    only one final submission, we decided for the conservative ensem-                  The authors wish to thank our colleagues on the NVIDIA Merlin,
    bling approach of weighted averaged predictions to rank the cities                 KGMON and RAPIDS.AI teams for their support in this competition.
    for submission, as described in Algorithm 1 in Appendix D.
                                                                                       REFERENCES
    5    RESULTS AND DISCUSSION                                                        [1] Dmitri Goldenberg, Kostia Kofman, Pavel Levin, Sarai Mizrachi, Maayan Kafry,
                                                                                           and Guy Nadav. 2021. Booking.com WSDM WebTour 2021 Challenge. https:
    To make it clear the contribution from each architecture and from                      //www.bookingchallenge.com/. In ACM WSDM Workshop on Web Tourism (WSDM
    the ensembling algorithm for our final result, we present in Table 1                   WebTour’21).
    the cross-validation Precision@4 by architecture, individual and                   [2] Balázs Hidasi, Alexandros Karatzoglou, L. Baltrunas, and D. Tikk. 2016. Session-
                                                                                           based Recommendations with Recurrent Neural Networks. CoRR abs/1511.06939
    after bagging ensembling, and the final ensemble from the three                        (2016).
    architectures combined. We can notice that the XLNet-SMF was                       [3] Pawe Jankiewicz, Liudmyla Kyrashchuk, Pawe Sienkowski, and Magdalena Wójcik.
                                                                                           2019. Boosting Algorithms for a Session-Based, Context-Aware Recommender
    the most accurate single model.                                                        System in an Online Travel Domain. In Proceedings of the Workshop on ACM
       Interestingly, the MLP-SMF architecture achieved a CV score                         Recommender Systems Challenge.
    comparable with the other two architectures that explicitly model                  [4] Dietmar Jannach, Gabriel de Souza P. Moreira, and Even Oldridge. 2020. Why Are
                                                                                           Deep Learning Models Not Consistently Winning Recommender Systems Compe-
    the sequences using GRU and XLNet (Transformer). In MLP-SMF                            titions Yet? A Position Paper. In Proceedings of the Recommender Systems Challenge
    the embeddings of the last 5 cities are simply concatenated and                        2020 (Virtual Event, Brazil) (RecSysChallenge ’20). Association for Computing
    combined by MLP layers, while for the GRU-MS-SMF the last 5                            Machinery, New York, NY, USA, 44–49. https://doi.org/10.1145/3415959.3416001
                                                                                       [5] Benedikt Schifferer, Gilberto Titericz, Chris Deotte, Christof Henkel, Kazuki On-
    cities are processed by a GRU layer, and for XLNet-SMF the last 15                     odera, Jiwei Liu, Bojan Tunguz, Even Oldridge, Gabriel De Souza Pereira Moreira,
    cities are processed by the Transformer. As the MLP-SMF model was                      and Ahmet Erdem. 2020. GPU Accelerated Feature Engineering and Training
                                                                                           for Recommender Systems. In Proceedings of the Recommender Systems Challenge
    lightweight, it was much faster to train than the other architectures,
    which sped up its experimentation cycle and improvements.                          2 https://github.com/rapidsai/deeplearning/tree/main/WSDM2021




                  Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel                                                                                                     27


    Using Deep Learning to Win the Booking.com WSDM WebTour21 Challenge on Sequential Recommendations


        2020 (Virtual Event, Brazil) (RecSysChallenge ’20). Association for Computing        B    DISTRIBUTION FREQUENCY CITIES
        Machinery, New York, NY, USA, 16–23. https://doi.org/10.1145/3415959.3415996
    [6] Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019.
        BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations
        from Transformer. Association for Computing Machinery, New York, NY, USA,
        1441–1450. https://doi.org/10.1145/3357384.3357895
    [7] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov,
        and Quoc Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language
        Understanding.




    A     FEATURES




                                                                                             Figure 4: Distribution of frequency of city_id in train and
                                                                                             test dataset. Around 10,000 city ids appeared only once in
                                                                                             the dataset.


     Feature Set                  Features
     original categorical         city_id, hotel_country, booker_country de-
     features                     vice_class, affiliate_id

     check-in and check- day-of-week, week-of-year, month, isweek-
     out date features   end, season, stay length (checkout -
                         checkin), days since last booking (checkin -
                         previous checkout)
                                                                                             C    TRAINING DETAILS
     sequence features            first city in the trip, lagged (previous 5) cities         MLP-MF: In this architecture, embedding tables were initialized
                                  and countries from the trip                                with 𝑁 ∼ (0, 0.01). Low frequent cities are ignored by Categorical
                                                                                             Cross Entropy loss. The model is trained with Adam optimizer for
     current trip stats           trip length (# reservations), trip duration                12 epochs with a batch size of 1024, with a 1 cycle learning rate
                                  (days), reservation order (in ascending and                scheduler.
                                  descending orders)                                         GRU-MS-SMF: The model is trained with the Adam optimizer for
                                                                                             5 epochs, with a batch size of 512, and a learning rate step decay
     past user trips stats        # user’s past reservations, # user’s past trips,           schedule (1e-3, 1e-3, 1e-4, 1e-5, 1e-6). A variation of bagging is
                                  # user’s past visited cities, # user’s past vis-           applied by training the architecture 7 times with different seeds
                                  ited countries                                             and a slight variation of used input features. The final predictions
                                                                                             are generated by averaging the probabilities over all 7 models.
     geographic-                  Features based on the conditional probabil-                XLNet-SMF: It was stacked 4 XLNet Transformer blocks, with 4
     seasonal             city    ities of a city c from a country co, being                 heads each, with dimension 512. The model was trained for 30
     popularity                   visited at a month m or at a week-of-year                  epochs, with initial learning rate 2e-4 and linear decay. The batch
                                  w, as follows: P(c | m), P(c | m,co), P(c | w),            size was 128, where each training instance is a session (trip) trun-
                                  P(c | w,co)                                                cated to maximum length 15. For regularization, it was used dropout
                                                                                             0.1 and AdamW optimizer with weight decay of 1e-5. The masking
     Table 2: Features used by at least one of the architectures.                            probability during training was 0.1, but ensuring that we always
                                                                                             have at least one masked item for each session (trip).




                    Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel                                                                                                     28




    D     ENSEMBLING ALGORITHM                                                                                                               1 model CV
     Algorithm 1: Ensembling algorithm                                                        MLP-SMF without data augmentation              0.5620
        split data in k-folds;                                                                MLP-SMF                                        0.5667 (+0.8%)
        for each architecture do                                                       Table 4: Cross validation score for training MLP-SMF with
            j = number of bags for this architecture;                                  and without data augmentation.
            for each fold in folds do
                for i in [1, ..., j] do
                     concatenate train and test out-of-fold data
                      (OOF);
                     train model on OOF;
                     evaluate model based on last city of in-fold train
                      data;
                     predict last city of all folds of test data;
                end
            end
            ensemble by averaging over each fold and each bag;
        end
        ensemble by averaging over architectures;


    E     PRECISION@K




    Figure 5: Precision@k for top20 recommendations by en-
    semble model.




    F     ABLATION STUDY



                                        1 model CV
                          MLP           0.5550
                          MLP-SMF       0.5667 (+2.1%)
    Table 3: Cross validation score for training MLP architec-
    ture with and without the Session-based Matrix Factoriza-
    tion head (MLP-SMF).




                  Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).