=Paper=
{{Paper
|id=Vol-2855/challenge_short_7
|storemode=property
|title=Weighted Averaging of Various LSTM Models for Next Destination Recommendation
|pdfUrl=https://ceur-ws.org/Vol-2855/challenge_short_7.pdf
|volume=Vol-2855
|authors=Shotaro Ishihara,Shuhei Goda,Yuya Matsumura
|dblpUrl=https://dblp.org/rec/conf/wsdm/IshiharaGM21
}}
==Weighted Averaging of Various LSTM Models for Next Destination Recommendation==
ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel 46 Weighted Averaging of Various LSTM Models for Next Destination Recommendation Shotaro Ishihara∗ Shuhei Goda∗ Yuya Matsumura∗ Nikkei, Inc. Wantedly, Inc. Wantedly, Inc. Tokyo, Japan Tokyo, Japan Tokyo, Japan shotaro.ishihara@nex.nikkei.com shu@wantedly.com yuya@wantedly.com ABSTRACT 2 CHALLENGE TASK This paper describes the 6th place approach to Booking.com WSDM 2.1 Metrics WebTour 2021 Challenge, which is a challenge with a task of pre- The goal of the challenge is to develop a strategy for making the dicting travellers’ next destination. We, in the team "hakubishin3 best recommendation of cities as a travellers’ next destination. The & u++ & yu-y4", trained four types of Long short-term memory quality of the recommendations are evaluated by using Precision@4 (LSTM) models, and achieved the final score: 0.5399 by weighted metric. In other words, it is considered correct when the true city averaging of these predictions. There are some differences in these is one of the top four suggestions for each trip. models in feature engineering, multi-task learning, and data aug- mentation. Our experiments showed that the diversity of the models boosted the final result. Our codes are available at https://github. 2.2 Dataset Description com/hakubishin3/booking-challenge-2021 and https://github.com/ The training dataset consists of over a million of anonymized hotel upura/booking-challenge-2021. reservations, with the following columns: • user_id: User ID CCS CONCEPTS • check-in: Reservation check-in date • Information systems → Information systems applications; • checkout: Reservation check-out date Recommender systems; • affiliate_id: An anonymized ID of affiliate channels where the booker came from (e.g. direct, some third party referrals, KEYWORDS paid search engine, etc.) Booking.com WSDM WebTour 2021 Challenge, Recommender sys- • device_class: desktop/mobile tems, Long short-term memory • booker_country: Country from which the reservation was made (anonymized) • hotel_country: Country of the hotel (anonymized) 1 INTRODUCTION • city_id: city_id of the hotel’s city (anonymized) Booking.com is one of the world largest online travel agencies. Its • utrip_id: Unique identification of user’s trip (a group of multi- mission is to make it easier for everyone to experience the world, destinations bookings within the same trip) and it seeks the way of using information technology that helps The evaluation dataset is constructed similarly, however the reduce the time and effort regarding travel [4]. One of the imple- city_id and hotel_country of the final reservation of each trip are mentations is the recommendation of the destination. According concealed. We are required to predict the city_id. to Booking.com, many of the travellers go on trips which include The distribution of the city_id appearing in the dataset is long- more than one destination. Suggesting travel destinations based on tailed. Though there are about 40,000 candidates, we can achieve the past history would be helpful for travellers. the score of 0.036 by just suggesting the top four most frequent In 2021, Booking.com published dataset and organized a chal- cities in the dataset. lenge with a task of predicting travellers’ next destination. In this challenge, participants consider a scenario in which the users of 3 BASELINE APPROACH Booking.com make a reservation and immediately suggest options for extending their trips. The organizer’s previous works [6, 9] were a good starting point. The rest of this paper is organized as follows. In section 2, the We implemented a neural network model based on these references overview of the challenge is described. Before our own approach, as a baseline for our solution. This model consists of recurrent section 3 describes the previous works by the organizer that we neural networks (RNNs) model that handles a series of features. As followed. From section 4 to 6, we present our solution step by step. features, affiliate_id, device_class, booker_country, and city_id are Each section shows the architecture of the neural network models, used. These sequential categorical features are encoded by GRU cell the input features, and the training methods. In section 7, we report [5], and the output is converted to probability values for each city_id the experimental results of our proposed method. The final section through a Softmax layer. The top four city_ids with the highest provides a conclusion of this paper. probability values can be regarded as the model’s recommendation. Some beneficial ideas are also explained in the previous works. First, the experiment showed that adding features other than the ∗ These authors contributed equally to this work. series of city_id greatly improved the performance[9]. This made us Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel 47 Ishihara, Goda, and Matsumura realize the importance of feature engineering in this task. The sec- 5 FEATURES ond is how the features should be used. Two basic merge functions, We generated some features from the original dataset, and removed concatenation and element-wise multiplication, were compared some of the original dataset. Features can be divided into three in the experiment[9]. And the result showed that the former was types, categorical, numerical, and graphical variables. Each type of better. Finally, some tips for training were introduced. For example, feature is briefly summarized in this section. the following techniques were shown in the previous work of [6]. • The data should be sorted by series length to reduce the 5.1 Categorical Features number of padding in making batches. • The duplicates should be eliminated when the same city_id There are seven different categorical features in the dataset, and six is consecutive. of them are used in the baseline as described in section 3. The only removed one is utrip_id, a unique identificator. 4 MODEL ARCHITECTURE We added some categorical features. The followings are categor- ical features we used. We prepared two types of model architecture. Both architectures are described in this section. • month_checkin: Month of check-in. • past_city_id: Previous city_id. 4.1 Long short-term memory • past_hotel_country: Previous hotel_country. As RNNs, we adopted Long short-term memory (LSTM) [2] instead of GRU. Figure 1 shows an architecture of a LSTM model. There 5.2 Numerical Features are no major differences from the baseline presented in section 3, except for the replacement of the RNNs unit. The differences in We extracted some numerical features. The days_stay is a feature feature engineering and training process are described in section 5 that uses future information, we should be careful. In this challenge, and 6. check-in and checkout date of the targets are given. Therefore, days_stay which should not be available practically can be calcu- lated, which is a useful feature since it indicates the characteristics 4.2 LSTM with Multi-task Learning of the targets. We also used a LSTM model with another type of structure, shown in Figure 2. This is an extension of the model with the concept of • days_stay: The number of days staying in a current hotel. multi-task learning (MTL). MTL is a training paradigm in which • days_move: The number of days from a previous hotel check- machine learning models are trained with the dataset from multiple out date to a current hotel check-in date. tasks simultaneously [3]. It is known that there are some advantages • num_checkin: The number of check-in within utrip_id. like improved data efficiency, reduced overfitting through shared • num_visit_drop_duplicates: The number of unique city_id representations, and so on. within utrip_id. In this challenge, we built an architecture that predicts not only • num_visit: The number of city_id within utrip_id. the city_id but also the hotel_country at the same time. Since there • num_visit_same_city: The number of duplicated city_id within is an inclusion relationship between hotel_country and city_id, we utrip_id. thought that the prediction of hotel_country would also contribute • num_stay_consecutively: The number of consecutive stay in to the quality of the prediction of city_id. To predict hotel_country the same city_id. is an easier task than city_id. When a hotel_country is given, the candidates of city_id are limited. This can help our model to predict a correct city_id. Figure 2: Model Architecture of LSTM with Multi-task Learn- Figure 1: Model Architecture of LSTM ing Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel 48 Weighted Averaging of Various LSTM Models for Next Destination Recommendation Figure 3: The embeddings of city_id calculated by Word2Vec Figure 4: The distribution of the length of trips compressed by UMAP Table 1: Settings of Four LSTM Models 5.3 Graph Features Model MTL Graph Loss Augmentation Each sequence of trips are just a fragment of it. We believe that LSTM 1 false Word2Vec CrossEntropyLoss false graph related features are important because it can lead to re- LSTM 2 true Word2Vec CrossEntropyLoss false construct geographical information. We believe that the graph re- LSTM 3 true Word2Vec CrossEntropyLoss flip dataset lated features are important because they can lead to reconstruct LSTM 4 false PyTorch-BigGraph FocalLoss test dataset geographical information. We used Word2Vec [8] and PyTorch- In LSTM 3, booker_country was removed from categorical features as de- BigGraph [1] to create graph features using the sequences of each scribed in subsection 6.3. trip. Figure 3 is a scatter plot of city_id vectors calculated by Word2Vec compressed by UMAP [7]. When we used MTL architecture, the same embeddings were calculated regarding hotel_country. city_id from the final reservation as the target of each trip and added it to the training data. 6 TRAINING PROCESS 6.1 Loss Functions 7 EXPERIMENTS CrossEntropyLoss is commonly used in multi-classification tasks. 7.1 Settings However, the classes of city_id that are required to be predicted Table 1 shows the settings of each model. Each model was composed in this challenge are imbalanced with long-tailed distribution. In of the combination of the proposed architecture, features, and the order to suppress the bias of the loss caused by the class imbalance, training methods. we adopted FocalLoss [10] as one of the options of a loss function. MTL architecture was used in LSTM 2 and 3. Graph features and loss function were different in LSTM 4. Data augmentation 6.2 Validation Strategy techniques were applied in LSTM 3 and 4. We chose Stratified K-Folds so that the distribution of the length of trips in each fold is equal. This is because there is a difference in 7.2 Results for Each Model the length of trips shown in Figure 4, which may affect the quality Table 2 shows the results of our experiments for each model. The of the prediction. The number of folds was set to five. We trained validation scores were calculated by averaging the predictions of 15 epochs in each fold, and used a model with the best validation all folds. score for predictions. We can see that our proposed models outperformed the baseline. From the comparison between LSTM 1 to 3, it was observed that 6.3 Data Augmentation MTL worked in our experiment, and data expansion by flipping We believe that each sequence of trips can be flipped. In applying didn’t work. LSTM 4 gave us the best result. this data augmentation, booker_country must be removed in order to keep consistency of the dataset. 7.3 Weighted Averaging In this challenge, we can use the information of the evalua- A weighted averaging was used for the ensembling. In this chal- tion dataset. The final city of each trip in the evaluation dataset is lenge, we decided to use it because of the small opportunity of masked to be used for evaluation. Therefore, we created an addi- submissions and the high computational cost. Table 2 shows that tional dataset from the evaluation dataset by regarding the second the diversity of the models leads to higher scores. It is interesting Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). ACM WSDM WebTour 2021, March 12th, 2021 Jerusalem, Israel 49 Ishihara, Goda, and Matsumura Table 2: Validation Scores of Four LSTM Models and ACKNOWLEDGMENTS Weighted Averaging We thank the organizers of the Booking.com WSDM WebTour 2021 Challenge for the opportunity to participate in this interesting Model Validation score Weights challenge. Baseline 0.4629 - LSTM 1 0.4927 - REFERENCES LSTM 2 0.4937 - [1] Lerer Adam, Wu Ledell, Shen Jiajun, Lacroix Timothee, Wehrstedt Luca, Bose LSTM 3 0.4862 - Abhijit, and Peysakhovich Alex. 2019. PyTorch-BigGraph: A Large-scale Graph LSTM 4 0.5043 - Embedding System. In Proceedings of the 2nd SysML Conference. Palo Alto, CA, Weighted averaging 1 0.5156 (0.2, 0.15, 0.05, 0.7) USA. [2] Sarah Cohen, Werner Nutt, and Yehoshua Sagic. 1997. Long Short-Term Memory. Weighted averaging 2 0.5149 (0.15, 0.10, 0.05, 0.7) Neural Comput 9, 8 (1997), 1735–1780. Weighted averaging 3 0.5160 (0.25, 0.20, 0.15, 0.4) [3] Michael Crawshaw. 2020. Multi-Task Learning with Deep Neural Networks: A The four numbers in Weights mean the ratio of LSTM 1 to 4 respectively. Survey. arXiv preprint arXiv:2009.09796v1 (2020). [4] Goldenberg Dmitri, Kofman Kostia, Levin Pavel, Mizrachi Sarai, Kafry Maayan, and Nadav Guy. 2021. Booking.com WSDM WebTour 2021 Challenge. https://www.bookingchallenge.com/. In ACM WSDM Workshop on Web Tourism (WSDM WebTour’21). to note that even though LSTM 4 performed well on its own, the [5] Matthew Van Gundy, Davide Balzarotti, and Giovanni Vigna. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine best result was obtained when the ratio was reduced to 0.4. Translation. https://doi.org/10.3115/v1/D14-1179. In Proceedings of the 2014 Con- ference on Empirical Methods in Natural Language Processing (EMNLP). Association 8 CONCLUSION for Computational Linguistics, Doha, Qatar. 1724––1734. [6] Pavel Levin. 2018. Modeling Multi-Destination Trips with RNNs. Technical Report. This paper described our approach to Booking.com WSDM Web- DataConf 2018, October 4th, Jerusalem, Israel. Tour 2021 Challenge. We trained four types of LSTM model, and [7] Leland McInnes, John Healy, and James Melville. 2020. UMAP: Uniform Man- ifold Approximation and Projection for Dimension Reduction. arXiv preprint weighted averaging of these predictions increased the accuracy. arXiv:1802.03426v3 (2020). There is a diversity regarding feature engineering, multi-task learn- [8] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estima- tion of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781v3 ing, and data augmentation. In the end, we won the 6th place on (2013). the final leaderboard of this challenge. [9] Sarai Mizrachi and Pavel Levin. 2019. Combining Context Features in Sequence- Aware Recommender Systems. ACM RecSys 2019 Late-breaking Results, 16th-20th September 2019, Copenhagen, Denmark (Sept. 2019). [10] Lin Tsung-Yi, Goyal Priya, Girshick Ross, He Kaiming, and Dollar Piotr. 2017. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).