ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel 50 Combining RNN with Transformer for Modeling Multi-Leg Trips Yoshihiro Sakatani sakatani.yoshihiro@sdtech.co.jp sdtech Inc. Minato, Tokyo, Japan ABSTRACT channel, and country where the reservation was made. That infor- Recommending destinations of trips based on user behavior is mation other than the city of stay was expected to play an important an important task for travel agencies such as Booking.com. The role in the performance of inference, as shown in a previous study Booking.com Challenge - WebTour 2021 ACM WSDM workshop [8]. Due to time constraints, however, my approach did not make is aimed at building models for this task; the goal is to pre- use of this kind of contextual information. The approach, which dict the final destination of multi-destination trips, based on a focused on integrating several recent natural language processing large dataset of over a million-trip reservations at Booking.com techniques for modeling sentences and used only a sequence of the with date, destination, etc. In this paper, I present my approach visited cities for input, showed a top-4 accuracy of 0.4720 in the where I leverage recent advances in language modeling tech- final leaderboard results. niques including Transformer. The approach, which used only a sequence of the visited cities for input, showed a top-4 accuracy of 2 APPROACH 0.4720 on the final result leaderboard. Full code is available here: https://github.com/sakatani/BookingcomChallenge2021. 2.1 Data Splitting In this challenge, as the evaluation data consisted of travel reser- CCS CONCEPTS vations of four or more legs only, the provided training data was • Information systems → Recommender systems; • Computing grouped by reservation ID and then only the records for four or methodologies → Neural networks. more legs were extracted. In addition to the extracted training data, the evaluation data was also used for building models; the evalua- KEYWORDS tion data was grouped by reservation IDs in the same way for the training data, and the records with four or more legs before the final Recommender Systems, Recurrent Neural Network, Transformer, destination were extracted. Finally, the collected data was randomly Sequence-Aware Recommendation split into 15% for the local evaluation, 15% for the validation during training, and 70% for the training. 1 INTRODUCTION Booking.com is the world’s largest online travel agency, which is being used by millions of users to find accommodations. Rec- 2.2 Models ommending destinations based on user behavior is an important The model used in my approach is illustrated in Figure 1, where the task for travel agencies such as Booking.com [1][6]. Within the positional encoding of a Transformer is replaced with a single-layer domain of accommodation, where people experience much time LSTM. and pay much money for it, the accuracy of recommenders is highly For modeling multi-destination trips, an RNN-based model has important. been proposed [8] that predicts the next destination of a sequence Booking.com Challenge [5] was aimed at building a recom- of destinations in a similar way that RNN-based language mod- mender system that could perform the task of estimating the final els predict the next word in a sentence. Recently, representative destination city of multi-destination trips based on the visited cities models in the field of natural language processing have been domi- (i.e. cities where the user stayed before the final destination) and as nated by Transformer[10]-based models, as seen in the examples of well as additional contextual information such as reservation date. Google BERT [4] and OpenAI GPT-3 [2]. Hence, the Transformer The challenge was based on over a million anonymized real-world architecture was adopted as the basis for my approach. reservations made on Booking.com in recent years. Thus, there Although Transformer-based models have achieved a number were tens of thousands of possible final destination cities. The train- of state-of-the-arts techniques in the field of natural language pro- ing data with complete itineraries and the test data with hidden cessing, they have one computational cost weakness against con- final destinations were released for the development of models. ventional RNN models when considering its use in the generative The evaluation dataset was not partitioned into public and private task for destination recommendation. As the Transformer does not leaderboards; two submissions of predictions for a single evaluation preserve the hidden state unlike RNNs but uses positional encoding dataset were allowed, one for the intermediate leaderboard and one to represent the sequential nature of tokens, it is expected to be for the final leaderboard, and the final results were used for the computationally more expensive than RNNs because it needs to evaluation. recompute the entire history in the context window at each time Every record in the data was a user’s reservation and contained step. information such as city of stay, country of stay, reservation ID, To address this problem, a model called LSTM + Transformer, in user ID, reservation date, check-in date, check-out date, affiliation which the positional encoding of Transformer is replaced with a Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel 51 Sakatani, Y. set to 16. The Adam optimizer was used during training. The learn- ing rate was linearly warmed up to 7 × 10−4 with 4000 iterations, and then decayed in proportion to the inverse square root of the iteration number according to the formula in [10]. In addition to the LSTM-Transformer combined model, a model consisting of six layers of Transformers and a model consisted of LSTM or GRU were also prepared for comparison. The six-layer Transformer model was identical to the LSTM-Transformer model, except that a Transformer layer was assigned instead of the LSTM layer. The LSTM and GRU models have two layers and 512 hidden units. The GRU model was based on the previous study on the RNN-based multi-destination trip model [8], but the model details, including hyperparameters, were unable to be reproduced exactly due to the lack of sufficient information. The models were evaluated using perplexity, which was based on the negative log-likelihood loss, and top-4 accuracy on the local evaluation data according to the rules of the challenge. Figure 2 shows the perplexity during training on the training and validation datasets. The perplexity of each model on the training dataset de- creased steadily with the number of iterations. On the validation dataset, both the perplexity of Transformer and LSTM-Transformer combined model decreased steadily until at least 350,000 iterations, while that of the GRU and LSTM model decreased shakingly and then increased after 50,000 iterations. The perplexity and top-4 accuracy on the local evaluation dataset were then evaluated using the checkpoint with the minimum per- plexity on the validation dataset (Table 2). The LSTM-Transformer combined model showed the lowest perplexity and highest top-4 accuracy. The Transformer model performed as well as the LSTM- Transformer combined model. These results indicate that the re- placement of the positional encoding of Transformer with the LSTM Figure 1: Model Architechture. LSTM-Transformer combined layer does not adversely affect the accuracy for this type of task, model is used for the prediction of the final destinations. but rather it can have a positive effect on performance. The LSTM This uses a single-LSTM layer to represent the sequential and GRU models showed smaller perplexities than the Transformer- nature of tokens instead of the positional encoding. based models during training on both the training and validation datasets, whereas they showed worse perplexities and top-4 accu- racy on the evaluation data than the Transformer-based models. A possible explanation for this is that the RNN-based models might single-layer LSTM, has been proposed in a previous study [9]. Trans- have been overfitting to the training or validation dataset. former with positional encoding needs to recompute all the tokens Finally, the local evaluation dataset was split into 18% for val- in a context window at each timestep as the window slides, while the idation and 82% for training, then the LSTM-Transformer model LSTM + Transformer model only needs to compute for a new token was re-trained with the datasets. The predictions submitted to the at each timestep since the LSTM keeps the hidden state. The study final result leaderboard were generated by the retrained model and reported that this model achieved 86% of the computational cost of showed 0.4720 for the top-4 accuracy score. a Transformer-based model in the sentence generation task with While this is irrelevant for the evaluation of the challenge, the the WikiText-103 dataset [7]. Although this LSTM-Transformer time required for each model to generate tokens for 10000 steps combined model is similar to the Cascaded Encoder model [3], it was measured three times on both GPU and CPU (Table 2). The uses RNNs for decoding as well as for encoding. processing time of the LSTM-Transformer combined model was reduced to 59% of that of the Transformer model on the CPU and 3 EXPERIMENTS 90% on the GPU. These results indicate that replacing positional en- All the experiments were conducted on Google Colaboratory and it coding of Transformer with LSTM reduces the computational cost. was made sure that Intel(R) Xeon(R) CPU @ 2.30GHz, 250GB RAM, The processing time on GPU, however, was not reduced as much as and NVIDIA Tesla T4 GPUs were allocated. on CPU. The LSTM-Transformer model may have some processing The LSTM-Transformer combined model consisted of one LSTM bottlenecks on GPU, as pointed out in a previous study[9]. layer and five Transformer layers, with 512 dimensions of the LSTM hidden layer and Transformer, 1024 dimensions of the feedforward Transformer layers, and eight attention heads. The batch size was Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel 52 Combining RNN with Transformer for Modeling Multi-Leg Trips Table 2: The processing time to generating tokens for 10,000 steps. Model CPU GPU LSTM-Transformer 192.5 (2.0) 56.32 (0.77) Transformer 324.5 (3.8) 62.47 (0.94) GRU 200.2 (1.6) 20.53 (0.18) LSTM 237.3 (0.5) 21.55 (0.17) REFERENCES Figure 2: Perplexities of each model on the training and vali- [1] Lucas Bernardi, Themistoklis Mavridis, and Pablo Estevez. 2019. 150 successful machine learning models: 6 lessons learned at booking. com. In Proceedings of dation dataset. A). Training perplexity. B) Validation perplex- the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data ity. Mining. 1743–1751. [2] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Table 1: Model performance on the local evaluation dataset. Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. Model Loss Perplexity Top-4 accuracy arXiv:2005.14165 [cs.CL] [3] Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, LSTM-Transformer 5.69 269.85 0.451 George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. 2018. The Best of Both Worlds: Combining Recent Transformer 5.62 276.98 0.440 Advances in Neural Machine Translation. arXiv:1804.09849 [cs.CL] GRU 5.80 328.74 0.397 [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: LSTM 5.92 376.98 0.365 Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL] [5] Dmitri Goldenberg, Kostia Kofman, Pavel Levin, Sarai Mizrachi, Maayan Kafry, and Guy Nadav. 2021. Booking.com WSDM WebTour 2021 Challenge. In ACM WSDM Workshop on Web Tourism (WSDM WebTour’21). https://www. bookingchallenge.com/ [6] Julia Kiseleva, Melanie JI Mueller, Lucas Bernardi, Chad Davis, Ivan Kovacek, 4 CONCLUSION Mats Stafseng Einarsen, Jaap Kamps, Alexander Tuzhilin, and Djoerd Hiemstra. 2015. Where to go on your next trip? Optimizing travel destinations based on In this paper, I describe my approach to the Booking.com Challenge user preferences. In Proceedings of the 38th International ACM SIGIR Conference WebTour 2021 ACM WSDM workshop. The approach showed the on Research and Development in Information Retrieval. 1097–1100. LSTM substitution of positional encoding can have a positive effect [7] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer Sentinel Mixture Models. arXiv:1609.07843 [cs.CL] on both the prediction and computational performance. [8] Sarai Mizrachi and Pavel Levin. 2019. Combining Context Features in Sequence- Because of time constraints, the approach did not make use of Aware Recommender Systems. In RecSys (Late-Breaking Results). 11–15. [9] Akihiro Tanikawa. 2020. [Deep Learning Study] Text Generation with any information other than the visited destinations for building LSTM + Transformer Model (Japanese). https://note.com/diatonic_codes/ the models. Although the LSTM-Transformer model used only se- n/nab29c78bbf2e quences of the visited cities for input, it achieved the 13th score [10] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All on the final result leaderboard. A previous study has shown that You Need. arXiv:1706.03762 [cs.CL] the prediction performance of RNN models can be improved by combining contextual information such as users’ home country with a sequence of visited destinations [8]. Thus, it is expected that the model used in this paper would improve in the predic- tion performance by considering such contextual information. A further study of how to combine the contextual information with visited destinations for Transformer-based models including the LSTM-Transformer combined model should be conducted. The effect of replacing the positional encoding with LSTM on the computational cost was briefly investigated, but not fully explored in this study. It is expected that the LSTM-Transformer model re- duces the computational cost especially when multiple destinations are predicted as the model has the advantage that the expansion of the context window does not affect the computational cost for the input. Another future work involves an investigation on how replac- ing positional encoding with LSTM affects the computational cost of Transformer-based models in generative tasks for destination recommendation. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).