=Paper= {{Paper |id=Vol-2855/challenge_short_4 |storemode=property |title=Explore next destination prediction |pdfUrl=https://ceur-ws.org/Vol-2855/challenge_short_4.pdf |volume=Vol-2855 |authors=Yuanzhe Zhou,Shikang Wu,Chenyang Zheng |dblpUrl=https://dblp.org/rec/conf/wsdm/ZhouWZ21 }} ==Explore next destination prediction== https://ceur-ws.org/Vol-2855/challenge_short_4.pdf
                                       Explore next destination prediction
                    Yuanzhe Zhou                                          Shikang Wu                                      Chenyang Zheng
              zhouyuanzhe@whu.edu.cn                               danny199607@gmail.com                            zhengchenyang01@baidu.com
                  WUHAN, CHINA                                        BEIJING, CHINA                                      BEIJING, CHINA
    Next destination prediction problem by users’ traveling sequence
    has provoked many researchers’ attention recently. It is a perfect
    benchmark for a graph neural network or graph embedding algo-
    rithm, which have achieved state of the art for many kinds of tasks
    involving graph information. In this paper, we explore this problem
    by recurrent neural network and we have achieved 0.5741 top-4

    • Information systems → Recommender systems.

     Booking.com Data Challenge, neural networks, deep learning,
    network embeddings, recommendation systems

    In this paper, we will introduce our model and method employed
    in the WebTour 2021 Challenge [2] holding by Booking.com. The                                            Figure 1: Model structure
    source code for training and prediction can be accessed here:

    2    DETAILS OF MODEL                                                              2.1     Manual features
    As the preprocessing, We pad/crop users’ destination sequences to                  Since Booking.com allows us to use part of the information about
    the same length 20 in order to use recurrent neural network. User                  the next destination, we made some manual features concerning
    sequences with length less than 2 are removed since they will not                  the check-out date of the last destination, check-in date of the
    appear in our testing data. We use the data from both training data                next destination and the duration the traveler will stay. The extra
    and test data for Word2Vec[5][6] pretraining. We choose window                     features improve our top-4 accuracy by 1.5%.
    size 1 and embedding size ranging from 64 to 256 under mode sg                        Our features include the features concerning time,
    (skip-gram).                                                                           • Duration of last trip.
       The structure [Figure:1] of our model is shown blow. The input                      • Day of the month for check in.
    of our model are destination sequences and manual features. We use                     • Day of the week for check in.
    LSTM [3] (Long Short-Term Memory) and GRU [1] ( Gated recurrent                        • Day of the month for check out.
    unit) to extract features from destination sequences. We use one                       • Day of the week for check out.
    layer of bi-directional GRU after one layer of bi-directional LSTM.                   The sinus and cosinus value of the time are used. For example,
    The last output state of GRU is used as the feature of the destination
    sequence. Then we introduce the gate structure proposed in [4] to                              day of the week sin = sin(day of the week/7)
    model the relationship between the last destination and the next
    destination. It is conducted in the following way,
                                                                                                  day of the week cos = cos(day of the week/7)
             𝑑 output = 𝛼 · 𝑑𝑛 + (1 − 𝛼) · 𝑅𝑁 𝑁 ([𝑑 1, 𝑑 2, ..., 𝑑𝑛 ])
                                                                                       Other features like affiliate_id, device_class and booker_country
                                                                                       are also included. The source is processed by a trainable embedding
                   𝛼 = 𝑀𝐿𝑃 ([𝑑𝑛 , 𝑅𝑁 𝑁 ([𝑑 1, 𝑑 2, ..., 𝑑𝑛 ])])
                                                                                       layer. The additional information improved the final result by 1.5%.
    where 𝑑𝑖 are embedding vector of different cities.                                    We use a 2 layers MLP to process the input of features. It is then
      We have 2 outputs, next destination and next country. The extra                  concatenated to the output from RNN. We make the final prediction
    country information introduced here help us to decrease the loss                   by a densely connected layer, with the activation function softmax.
    by a lot. We apply softmax activation function to the output and                   All the cities are treated as the promising destination (this part can
    use cross-entropy to optimize our model.                                           be improved by carefully select positive and negative samples).

    3     TRAINING STRATEGY                                                               [4] Qiao Liu, Yifu Zeng, Refuoe Mokhosi, and Haibin Zhang. 2018. STAMP: Short-Term
                                                                                              Attention/Memory Priority Model for Session-Based Recommendation. In Pro-
    Our training strategy is 20 k-fold cross-validation with learning                         ceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery
    rate decay strategy, ’reduce lr on plateau’. We trained our model                         Data Mining (London, United Kingdom) (KDD ’18). Association for Computing Ma-
                                                                                              chinery, New York, NY, USA, 1831–1839. https://doi.org/10.1145/3219819.3219950
    with Adam optimizer with learning rate 0.001 and batch size 8096.                     [5] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
    What contributes most to our final result is the two phase training                       Distributed representations of words and phrases and their compositionality. In
    strategy.                                                                                 Advances in Neural Information Processing Systems. 3111–3119.
                                                                                          [6] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling
                                                                                              with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges
    3.1     Pretraining for sequential recommendation                                         for NLP Frameworks. ELRA, Valletta, Malta, 45–50.
    Pretraining has been proved to be efficient and effective for many
    tasks. Although RNN is not as capable as transformer, we can still
    improve its performance by using
       We pretrain our RNN model by using the sub-sequences from
    the complete sequence. Assume that the complete sequence is,
                                   (𝑑 0, 𝑑 1, 𝑑 2, ..., 𝑑𝑛 )
    The sub-sequences are,
            (𝑑 0 ), (𝑑 0, 𝑑 1 ), (𝑑 0, 𝑑 1, 𝑑 2 ), ..., (𝑑 0, 𝑑 1, 𝑑 2, ..., 𝑑𝑛−1 )
    where 𝑑𝑖 denotes the destinations.

    3.2     Fine-tune
    Pretraining can also be problematic because some intermediate
    destinations are not directly connected with user’s origin. Thus
    some sub-sequences are noisy and the data distribution of sub-
    sequences is different from that of the original data. To handle this
    difference of data distribution, we only need to fine-tune our model
    with the original training data.
       The following [Table 1] is the comparison of top-4 accuracy for
    local validation data between different settings.

                                                  top-4 accuracy
        LSTM                                      52.51%
        LSTM+features                             54.02%
        LSTM+features+pretrain                    54.95%
        LSTM+features+pretrain+finetune           55.53%
        LSTM+features+pretrain+finetune+test data 57.49%
             Table 1: Top-4 accuracy on local validation

    4     CONCLUSION
    In this paper, we start from a very simple RNN model for next
    destination prediction. The main improvements are made through
    understanding the data and making the most of the data through
    data exploration. We have shown that pretraining for sequential
    recommendation system is beneficial and fine-tuning can improve
    the performance further. Our final result is competitive to SOTA.

