=Paper=
{{Paper
|id=Vol-2855/challenge_short_7
|storemode=property
|title=Weighted Averaging of Various LSTM Models for Next Destination Recommendation
|pdfUrl=https://ceur-ws.org/Vol-2855/challenge_short_7.pdf
|volume=Vol-2855
|authors=Shotaro Ishihara,Shuhei Goda,Yuya Matsumura
|dblpUrl=https://dblp.org/rec/conf/wsdm/IshiharaGM21
}}
==Weighted Averaging of Various LSTM Models for Next Destination Recommendation==
<pdf width="1500px">https://ceur-ws.org/Vol-2855/challenge_short_7.pdf</pdf>
<pre>
ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel                                                                                                      46


                              Weighted Averaging of Various LSTM Models
                                for Next Destination Recommendation
                    Shotaro Ishihara∗                                      Shuhei Goda∗                                     Yuya Matsumura∗
                      Nikkei, Inc.                                         Wantedly, Inc.                                      Wantedly, Inc.
                     Tokyo, Japan                                           Tokyo, Japan                                       Tokyo, Japan
           shotaro.ishihara@nex.nikkei.com                               shu@wantedly.com                                   yuya@wantedly.com

    ABSTRACT                                                                             2 CHALLENGE TASK
    This paper describes the 6th place approach to Booking.com WSDM                      2.1 Metrics
    WebTour 2021 Challenge, which is a challenge with a task of pre-
                                                                                         The goal of the challenge is to develop a strategy for making the
    dicting travellers’ next destination. We, in the team "hakubishin3
                                                                                         best recommendation of cities as a travellers’ next destination. The
    & u++ & yu-y4", trained four types of Long short-term memory
                                                                                         quality of the recommendations are evaluated by using Precision@4
    (LSTM) models, and achieved the final score: 0.5399 by weighted
                                                                                         metric. In other words, it is considered correct when the true city
    averaging of these predictions. There are some differences in these
                                                                                         is one of the top four suggestions for each trip.
    models in feature engineering, multi-task learning, and data aug-
    mentation. Our experiments showed that the diversity of the models
    boosted the final result. Our codes are available at https://github.                 2.2     Dataset Description
    com/hakubishin3/booking-challenge-2021 and https://github.com/                       The training dataset consists of over a million of anonymized hotel
    upura/booking-challenge-2021.                                                        reservations, with the following columns:
                                                                                               • user_id: User ID
    CCS CONCEPTS                                                                               • check-in: Reservation check-in date
    • Information systems → Information systems applications;                                  • checkout: Reservation check-out date
    Recommender systems;                                                                       • affiliate_id: An anonymized ID of affiliate channels where
                                                                                                 the booker came from (e.g. direct, some third party referrals,
    KEYWORDS                                                                                     paid search engine, etc.)
    Booking.com WSDM WebTour 2021 Challenge, Recommender sys-                                  • device_class: desktop/mobile
    tems, Long short-term memory                                                               • booker_country: Country from which the reservation was
                                                                                                 made (anonymized)
                                                                                               • hotel_country: Country of the hotel (anonymized)
    1    INTRODUCTION                                                                          • city_id: city_id of the hotel’s city (anonymized)
    Booking.com is one of the world largest online travel agencies. Its                        • utrip_id: Unique identification of user’s trip (a group of multi-
    mission is to make it easier for everyone to experience the world,                           destinations bookings within the same trip)
    and it seeks the way of using information technology that helps                         The evaluation dataset is constructed similarly, however the
    reduce the time and effort regarding travel [4]. One of the imple-                   city_id and hotel_country of the final reservation of each trip are
    mentations is the recommendation of the destination. According                       concealed. We are required to predict the city_id.
    to Booking.com, many of the travellers go on trips which include                        The distribution of the city_id appearing in the dataset is long-
    more than one destination. Suggesting travel destinations based on                   tailed. Though there are about 40,000 candidates, we can achieve
    the past history would be helpful for travellers.                                    the score of 0.036 by just suggesting the top four most frequent
       In 2021, Booking.com published dataset and organized a chal-                      cities in the dataset.
    lenge with a task of predicting travellers’ next destination. In this
    challenge, participants consider a scenario in which the users of
                                                                                         3     BASELINE APPROACH
    Booking.com make a reservation and immediately suggest options
    for extending their trips.                                                           The organizer’s previous works [6, 9] were a good starting point.
       The rest of this paper is organized as follows. In section 2, the                 We implemented a neural network model based on these references
    overview of the challenge is described. Before our own approach,                     as a baseline for our solution. This model consists of recurrent
    section 3 describes the previous works by the organizer that we                      neural networks (RNNs) model that handles a series of features. As
    followed. From section 4 to 6, we present our solution step by step.                 features, affiliate_id, device_class, booker_country, and city_id are
    Each section shows the architecture of the neural network models,                    used. These sequential categorical features are encoded by GRU cell
    the input features, and the training methods. In section 7, we report                [5], and the output is converted to probability values for each city_id
    the experimental results of our proposed method. The final section                   through a Softmax layer. The top four city_ids with the highest
    provides a conclusion of this paper.                                                 probability values can be regarded as the model’s recommendation.
                                                                                            Some beneficial ideas are also explained in the previous works.
                                                                                         First, the experiment showed that adding features other than the
    ∗ These authors contributed equally to this work.                                    series of city_id greatly improved the performance[9]. This made us


                    Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel                                                                                                       47


                                                                                                                                      Ishihara, Goda, and Matsumura


    realize the importance of feature engineering in this task. The sec-               5     FEATURES
    ond is how the features should be used. Two basic merge functions,                 We generated some features from the original dataset, and removed
    concatenation and element-wise multiplication, were compared                       some of the original dataset. Features can be divided into three
    in the experiment[9]. And the result showed that the former was                    types, categorical, numerical, and graphical variables. Each type of
    better. Finally, some tips for training were introduced. For example,              feature is briefly summarized in this section.
    the following techniques were shown in the previous work of [6].
          • The data should be sorted by series length to reduce the
                                                                                       5.1     Categorical Features
            number of padding in making batches.
          • The duplicates should be eliminated when the same city_id                  There are seven different categorical features in the dataset, and six
            is consecutive.                                                            of them are used in the baseline as described in section 3. The only
                                                                                       removed one is utrip_id, a unique identificator.
    4     MODEL ARCHITECTURE                                                              We added some categorical features. The followings are categor-
                                                                                       ical features we used.
    We prepared two types of model architecture. Both architectures
    are described in this section.                                                           • month_checkin: Month of check-in.
                                                                                             • past_city_id: Previous city_id.
    4.1     Long short-term memory                                                           • past_hotel_country: Previous hotel_country.
    As RNNs, we adopted Long short-term memory (LSTM) [2] instead
    of GRU. Figure 1 shows an architecture of a LSTM model. There                      5.2     Numerical Features
    are no major differences from the baseline presented in section 3,
    except for the replacement of the RNNs unit. The differences in                    We extracted some numerical features. The days_stay is a feature
    feature engineering and training process are described in section 5                that uses future information, we should be careful. In this challenge,
    and 6.                                                                             check-in and checkout date of the targets are given. Therefore,
                                                                                       days_stay which should not be available practically can be calcu-
                                                                                       lated, which is a useful feature since it indicates the characteristics
    4.2     LSTM with Multi-task Learning
                                                                                       of the targets.
    We also used a LSTM model with another type of structure, shown
    in Figure 2. This is an extension of the model with the concept of                       • days_stay: The number of days staying in a current hotel.
    multi-task learning (MTL). MTL is a training paradigm in which                           • days_move: The number of days from a previous hotel check-
    machine learning models are trained with the dataset from multiple                         out date to a current hotel check-in date.
    tasks simultaneously [3]. It is known that there are some advantages                     • num_checkin: The number of check-in within utrip_id.
    like improved data efficiency, reduced overfitting through shared                        • num_visit_drop_duplicates: The number of unique city_id
    representations, and so on.                                                                within utrip_id.
       In this challenge, we built an architecture that predicts not only                    • num_visit: The number of city_id within utrip_id.
    the city_id but also the hotel_country at the same time. Since there                     • num_visit_same_city: The number of duplicated city_id within
    is an inclusion relationship between hotel_country and city_id, we                         utrip_id.
    thought that the prediction of hotel_country would also contribute                       • num_stay_consecutively: The number of consecutive stay in
    to the quality of the prediction of city_id. To predict hotel_country                      the same city_id.
    is an easier task than city_id. When a hotel_country is given, the
    candidates of city_id are limited. This can help our model to predict
    a correct city_id.


                                                                                       Figure 2: Model Architecture of LSTM with Multi-task Learn-
                 Figure 1: Model Architecture of LSTM                                  ing


                  Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel                                                                                                        48

    Weighted Averaging of Various LSTM Models
    for Next Destination Recommendation


    Figure 3: The embeddings of city_id calculated by Word2Vec                                 Figure 4: The distribution of the length of trips
    compressed by UMAP


                                                                                                     Table 1: Settings of Four LSTM Models

    5.3    Graph Features                                                                 Model    MTL           Graph                 Loss           Augmentation
    Each sequence of trips are just a fragment of it. We believe that
                                                                                         LSTM 1     false      Word2Vec          CrossEntropyLoss          false
    graph related features are important because it can lead to re-                      LSTM 2     true       Word2Vec          CrossEntropyLoss          false
    construct geographical information. We believe that the graph re-                    LSTM 3     true       Word2Vec          CrossEntropyLoss      flip dataset
    lated features are important because they can lead to reconstruct                    LSTM 4     false   PyTorch-BigGraph         FocalLoss         test dataset
    geographical information. We used Word2Vec [8] and PyTorch-                        In LSTM 3, booker_country was removed from categorical features as de-
    BigGraph [1] to create graph features using the sequences of each                  scribed in subsection 6.3.
    trip.
       Figure 3 is a scatter plot of city_id vectors calculated by Word2Vec
    compressed by UMAP [7]. When we used MTL architecture, the
    same embeddings were calculated regarding hotel_country.                           city_id from the final reservation as the target of each trip and
                                                                                       added it to the training data.
    6 TRAINING PROCESS
    6.1 Loss Functions                                                                 7 EXPERIMENTS
    CrossEntropyLoss is commonly used in multi-classification tasks.                   7.1 Settings
    However, the classes of city_id that are required to be predicted                  Table 1 shows the settings of each model. Each model was composed
    in this challenge are imbalanced with long-tailed distribution. In                 of the combination of the proposed architecture, features, and the
    order to suppress the bias of the loss caused by the class imbalance,              training methods.
    we adopted FocalLoss [10] as one of the options of a loss function.                   MTL architecture was used in LSTM 2 and 3. Graph features
                                                                                       and loss function were different in LSTM 4. Data augmentation
    6.2    Validation Strategy                                                         techniques were applied in LSTM 3 and 4.
    We chose Stratified K-Folds so that the distribution of the length of
    trips in each fold is equal. This is because there is a difference in              7.2     Results for Each Model
    the length of trips shown in Figure 4, which may affect the quality                Table 2 shows the results of our experiments for each model. The
    of the prediction. The number of folds was set to five. We trained                 validation scores were calculated by averaging the predictions of
    15 epochs in each fold, and used a model with the best validation                  all folds.
    score for predictions.                                                                 We can see that our proposed models outperformed the baseline.
                                                                                       From the comparison between LSTM 1 to 3, it was observed that
    6.3    Data Augmentation                                                           MTL worked in our experiment, and data expansion by flipping
    We believe that each sequence of trips can be flipped. In applying                 didn’t work. LSTM 4 gave us the best result.
    this data augmentation, booker_country must be removed in order
    to keep consistency of the dataset.                                                7.3     Weighted Averaging
       In this challenge, we can use the information of the evalua-                    A weighted averaging was used for the ensembling. In this chal-
    tion dataset. The final city of each trip in the evaluation dataset is             lenge, we decided to use it because of the small opportunity of
    masked to be used for evaluation. Therefore, we created an addi-                   submissions and the high computational cost. Table 2 shows that
    tional dataset from the evaluation dataset by regarding the second                 the diversity of the models leads to higher scores. It is interesting


                  Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel                                                                                                                49


                                                                                                                                           Ishihara, Goda, and Matsumura


    Table 2: Validation Scores of Four LSTM Models and                                 ACKNOWLEDGMENTS
    Weighted Averaging
                                                                                       We thank the organizers of the Booking.com WSDM WebTour 2021
                                                                                       Challenge for the opportunity to participate in this interesting
                  Model           Validation score          Weights                    challenge.
               Baseline                0.4629                   -
                LSTM 1                 0.4927                   -                      REFERENCES
                LSTM 2                 0.4937                   -                       [1] Lerer Adam, Wu Ledell, Shen Jiajun, Lacroix Timothee, Wehrstedt Luca, Bose
                LSTM 3                 0.4862                   -                           Abhijit, and Peysakhovich Alex. 2019. PyTorch-BigGraph: A Large-scale Graph
                LSTM 4                 0.5043                   -                           Embedding System. In Proceedings of the 2nd SysML Conference. Palo Alto, CA,
          Weighted averaging 1         0.5156         (0.2, 0.15, 0.05, 0.7)                USA.
                                                                                        [2] Sarah Cohen, Werner Nutt, and Yehoshua Sagic. 1997. Long Short-Term Memory.
          Weighted averaging 2         0.5149        (0.15, 0.10, 0.05, 0.7)
                                                                                            Neural Comput 9, 8 (1997), 1735–1780.
          Weighted averaging 3         0.5160        (0.25, 0.20, 0.15, 0.4)            [3] Michael Crawshaw. 2020. Multi-Task Learning with Deep Neural Networks: A
     The four numbers in Weights mean the ratio of LSTM 1 to 4 respectively.                Survey. arXiv preprint arXiv:2009.09796v1 (2020).
                                                                                        [4] Goldenberg Dmitri, Kofman Kostia, Levin Pavel, Mizrachi Sarai, Kafry Maayan,
                                                                                            and Nadav Guy. 2021.          Booking.com WSDM WebTour 2021 Challenge.
                                                                                            https://www.bookingchallenge.com/. In ACM WSDM Workshop on Web Tourism
                                                                                            (WSDM WebTour’21).
    to note that even though LSTM 4 performed well on its own, the                      [5] Matthew Van Gundy, Davide Balzarotti, and Giovanni Vigna. 2014. Learning
                                                                                            Phrase Representations using RNN Encoder–Decoder for Statistical Machine
    best result was obtained when the ratio was reduced to 0.4.                             Translation. https://doi.org/10.3115/v1/D14-1179. In Proceedings of the 2014 Con-
                                                                                            ference on Empirical Methods in Natural Language Processing (EMNLP). Association
    8    CONCLUSION                                                                         for Computational Linguistics, Doha, Qatar. 1724––1734.
                                                                                        [6] Pavel Levin. 2018. Modeling Multi-Destination Trips with RNNs. Technical Report.
    This paper described our approach to Booking.com WSDM Web-                              DataConf 2018, October 4th, Jerusalem, Israel.
    Tour 2021 Challenge. We trained four types of LSTM model, and                       [7] Leland McInnes, John Healy, and James Melville. 2020. UMAP: Uniform Man-
                                                                                            ifold Approximation and Projection for Dimension Reduction. arXiv preprint
    weighted averaging of these predictions increased the accuracy.                         arXiv:1802.03426v3 (2020).
    There is a diversity regarding feature engineering, multi-task learn-               [8] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estima-
                                                                                            tion of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781v3
    ing, and data augmentation. In the end, we won the 6th place on                         (2013).
    the final leaderboard of this challenge.                                            [9] Sarai Mizrachi and Pavel Levin. 2019. Combining Context Features in Sequence-
                                                                                            Aware Recommender Systems. ACM RecSys 2019 Late-breaking Results, 16th-20th
                                                                                            September 2019, Copenhagen, Denmark (Sept. 2019).
                                                                                       [10] Lin Tsung-Yi, Goyal Priya, Girshick Ross, He Kaiming, and Dollar Piotr. 2017.
                                                                                            Focal Loss for Dense Object Detection. In Proceedings of the IEEE International
                                                                                            Conference on Computer Vision (ICCV).


                  Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

</pre>