Hotel2vec: Learning Hotel Embeddings from User Click Sessions with Side Information IOANNIS PARTALAS, Expedia Group, Switzerland ANNE MORVAN, Expedia Group, Switzerland ALI SADEGHIAN∗ , University of Florida, USA SHERVIN MINAEE∗ , Snap, Inc, USA XINXIN LI∗ , Expedia Group, USA BROOKE COWAN∗ , Apex Clearing Corporation, USA DAISY ZHE WANG, University of Florida, USA We propose a new neural network architecture for learning vector representations of items with attributes, specifically hotels. Unlike previous works, which typically only rely on modeling of user-item interactions for learning item embeddings, we propose a framework that combines several sources of data, including user clicks, hotel attributes (e.g., property type, star rating, average user rating), amenity information (e.g., if the hotel has free Wi-Fi or free breakfast), and geographic information that leverages an hexagonal geospatial system as well as spatial encoders. During model training, a joint embedding is learned from all of the above information. We show that including structured attributes about hotels enables us to make better predictions in a downstream task than when we rely exclusively on click data. We train our embedding model on more than 60 million user click sessions from a leading online travel platform, and learn embeddings for more than one million hotels. Our final learned embeddings integrate distinct sub-embeddings for user clicks, hotel attributes, and geographic information, providing a representation that can be used flexibly depending on the application. An important advantage of the proposed neural model is that it addresses the cold-start problem for hotels with insufficient historical click information by incorporating additional hotel attributes, which are available for all hotels. We show through the results of an online A/B test that our model generates high-quality representations that boost the performance of a hotel recommendation system on a large online travel platform. CCS Concepts: • Computing methodologies → Ranking; Learning latent representations; Learning from implicit feedback; Neural networks. Additional Key Words and Phrases: neural networks, embeddings 1 INTRODUCTION Learning semantic representations of different entities, such as textual, commercial, and physical, has been a recent and active area of research. Such representations can facilitate applications that rely on a notion of similarity, for example recommendation systems and ranking algorithms in e-commerce [2, 5, 6, 9, 18, 22, 39]. In natural language processing, word2vec [28] learns vector representations of words from large quantities of text, where each word is mapped to a 𝑑-dimensional vector such that semantically similar words have geometrically closer vectors. This is achieved by predicting either the context words appearing in a window around a given target word (skip-gram model), or the target word given the context (CBOW model). The main assumption is that words appearing frequently in similar contexts share statistical properties (the distributional hypothesis). Crucially, word2vec models, like many other word embedding models, preserve sequential information encoded in text so as to leverage word co-occurrence statistics. The skip-gram model has been adapted to other domains in order to learn dense representations ∗ Work completed while author was at Expedia Group. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Ioannis Partalas and Anne Morvan, et al. of items other than words. For example, product embeddings in e-commerce [12] or vacation rental embeddings in the hospitality domain [11] can be learned by treating purchase histories or user click sequences as sentences, and applying a word2vec approach. Most of the prior work on item embedding exploit the co-occurrence of items in a sequence as the main signal for learning the representation. One disadvantage of this approach is that it fails to incorporate rich structured information associated with the embedded items. For example, in the travel domain, where we seek to embed hotels and other travel-related entities, it could be helpful to encode explicit information such as user ratings, star ratings, hotel amenities, and location in addition to implicit information encoded in the click-stream. In this work, we propose an algorithm for learning hotel embeddings that combines sequential user click information in a skip-gram approach with additional structured information about hotels. We propose a neural architecture that adopts and extends the skip-gram model to accommodate arbitrary relevant information of embedded items, including but not limited to geographic information, ratings, and item attributes. In experimental results, we show that enhancing the neural network to jointly encode click and supplemental structured information, outperforms a skip-gram model that encodes the click information alone. The proposed architecture also naturally handles the cold-start problem for hotels with little or no historical clicks. Specifically, we can infer an embedding for these properties by leveraging their supplemental structured metadata. Compared to previous work on item embeddings, the novel contributions of this paper are as follows: (1) We propose a novel yet straightforward framework for fusing multiple sources of information about an item (such as user click sequences and item-specific information) to learn item embeddings via self-supervised learning. (2) We generate an embedding that consists of three sub-embeddings for clicks, geography, and amenities attributes, which can be employed either as separate component embeddings or a single, unified embedding. (3) We address the cold-start problem by including hotel metadata which are independent of user click-stream interactions and available for all hotels. This helps us to better impute embeddings for sparse items/hotels. (4) We show significant gains over previous work based on click-embeddings in several experimental studies. The structure of the remainder of this paper is as follows. Section 2 gives an overview of some of the recent works on neural embedding. Section 3 provides details of the proposed framework, including the neural network architecture, training methodology, and how the cold-start problem is addressed. In Section 4, we present experimental results on several different tasks and a comparison with previous state-of-the-art work. Section 5 highlights online A/B tests obtained for ranking hotels on a search result page by including as features these embeddings in the search ranking model. Section 6 concludes the paper. 2 RELATED WORK 2.1 Embeddings from user sequences for different application domains Recommendation is an inherently challenging task that requires learning user interests and behavior. There has been a significant body of research on advancing it using various frameworks [3, 13, 25, 30, 42]. Learning a semantic representation/embedding of the items being recommended is a critical piece of most of these frameworks. Building recommender systems for hotels is an specially hard task due to challenges such as balancing between the popular hotels and the newly added ones (without enough clicks), and the very large space of candidates. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Hotel2vec: Learning Hotel Embeddings from User Click Sessions with Side Information Neural network models have been widely used for learning embeddings from user sessions [37, 38]. One prominent use case is learning product embeddings for e-commerce. In [4, 12], the authors develop an approach based on the skip- gram model [28], frequently used in natural language processing. They leverage users’ purchase histories obtained from their e-mail receipts to learn a dense representation of products. Each user’s complete purchase history is represented as a sequence, which is treated as a sentence in which the items are considered words. For music recommendation, authors in [16] define track representations considered as ground truth with a word2vec continuous bag-of-word model and a music session (a collection of tracks) as the average of track embeddings it contains. A session-level user embedding is further learned via a LSTM model to maximize the cosine similarity between the predicted user session-level embedding with the observed ground truth music session representation. In the online travel space, authors in [11] use the skip-gram framework to learn embeddings for vacation rental properties. They extend the ideas in [12] to take into account a user’s click stream data during a session. A key contribution of their method is the modification of the skip-gram model to always include the booked hotels in the context of each target token, so that special attention is paid to bookings. They also improve negative sampling by sampling from the same market, which leads to better within-market listing similarities. Nevertheless, their model relies exclusively on large amounts of historical user engagement data, which is a major drawback when such data are sparse. 2.2 Link to graph approaches The skip-gram loss can be seen also as a graph-based one and this paves the way to graph embedding approaches which share the same similarity assumption. Graph embedding methods aim at learning in an unsupervised manner embeddings for pairs of edge-linked nodes which are more similar to each other than the embeddings of pairs of nodes without an edge between them. In that case, the considered graph is constructed from the co-clicks: nodes are the items and an edge designates a co-click between two items. In this area, graphSAGE [15] is the first approach to work in the inductive mode. By learning aggregation functions, graphSAGE predicts the embedding of a new node without a re-training based on its features and neighborhood. Two recent methods PyTorch-BigGraph (PBG) [24] and Cleora [33] offer more scalability by partitioning the graph. PBG proposes a margin-based ranking objective function between positive and negative pairs of nodes. Cleora relies exclusively on the graph structure and does not use a contrastive learning objective, hence does not require to sample positive or negative examples. Instead Cleora obtains node embeddings by iteratively aggregating each node’s neighbor embeddings followed by an 𝐿2 -normalization. Cleora prevents the embeddings from collapsing through a careful initialization and the normalization step. 2.3 Merging all side information to capture "context" of user activity In another relevant work, [7], authors propose a framework for YouTube video recommendation which fuses multiple features (e.g., video watches, search tokens, geo embeddings) into a unified representation via a neural architecture. They then use these embeddings for candidate generation and ranking. The main limitation of this work is that the individual embeddings are learned separately, and then combined via a neural network to perform classification. There are also several works which try to use attention mechanism as a tool to capture "context" of users’ activities on the basis of actions they have performed recently, such as contextual self-attention network for user sequential recommendation [19], multi-pointer co-attention network [35], multi-order attentive ranking model [40], neural attentive interpretable recommendation system [41], self-attentive sequential recommendation [21, 29]. Similar to our work on hotel2vec, there are also some works which attempt to include explicit item attributes (e.g., size, artist, model, color) within the sequence prediction framework using various strategies. In [36], the item metadata Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Ioannis Partalas and Anne Morvan, et al. is injected into the model as side information to regularize the item embeddings. Their approach only uses one feature (singer id) in the experiments. In addition, it does not accommodate learning independent embedding vectors for each attribute group. Most recently, [34] propose a method where they train separate encoders for text data, click-stream session data, and product image data, and then use a simple weighted average to unify these embeddings. The weights are learned using grid search on the downstream task. While their approach allows for exploring independent embedding vectors, the sub-embeddings of different attribute groups are learned independently rather than jointly. In addition to efforts extending the skip-gram framework, emerging research attempts to extend GloVe [31] by incorporating various attributes. Authors in [20] incorporate attribute information into GloVe by modifying the loss function such that the representation of a location can be learned by combining both text and structural data. 3 THE PROPOSED FRAMEWORK Similar to word2vec [28], by treating the clicks made by users within an interactive web session as words, and sequences of clicks as sentences, we seek to predict the context hotels (words), given a target hotel (word) in the session (sentence). On a high level, this is the approach proposed in [11, 12]. We refer to this approach as a session-only model. As mentioned earlier, one drawback of this approach is that it does not use any information apart from the click data, making it very challenging to make predictions for unseen hotels or hotels with sparse click data. In addition, the model may be forced to learn certain semantic features which capture aspects of user interest, hotel geographic information, hotel attributes, and so on, as latent variables as opposed to leveraging them as explicitly-provided input features. To address these shortcomings, we propose adding more explicit information about the hotel as model input. Intuitively, this should make the model more efficient during training as well as provide information that it can use when making predictions on unseen or sparse hotels. Another major advantage of our model is its use of different projection layers for various hotel/item attributes. This enables us to learn independent embedding vectors representing different facets of the property, in addition to an enriched, unified embedding for each hotel. This model also provides a dynamic framework for updating the embedding of a hotel, once its user-rating or other attribute information changes over time. This is not trivial in session-only models, unless we re-train a new model based on recent click data post attribute changes. In the remainder of the paper, we refer to our proposed hotel2vec model as an enriched model, in contrast to the session-only model introduced above. 3.1 Neural Network Architecture for the enriched hotel2vec Figure 1 illustrates the proposed architecture for an enriched hotel2vec model. Each aspect of the hotel 1) the one-hot encoding of the clicked property ids within the same session, 2) the associated amenities and 3) the geographical information is embedded separately, and these representations are later concatenated and further compressed before being used for context prediction. Formally, a click session is defined as a sequence of hotels (items) {ℎ 1, ℎ 2, · · · , ℎ𝑛 } clicked on by a user during a defined window of time or visit. For a given hotel, we denote the click, amenity, geographic, and enriched embedding vectors with V𝑐 , V𝑎 , V𝑔 , and V𝑒 respectively. These are defined as follows: V𝑐 = 𝑓 (𝐼𝑐 ; W𝑐 ) V𝑎 = 𝑓 (𝐼𝑎 ; W𝑎 ) V𝑔 = 𝑓 (𝐼𝑔 ; W𝑔 ) V𝑒 = ReLU([V𝑐 , V𝑎 , V𝑔 ] ⊤ W𝑒 ) (1) Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Hotel2vec: Learning Hotel Embeddings from User Click Sessions with Side Information Fig. 1. The 3-block-diagram of the enriched hotel2vec model with a single encoding layer. where 𝐼𝑐 is the one-hot encoding of hotels in the click session, and 𝐼𝑔 ∈ R𝑔 is a continuous vector with geographical information of the hotel which we will explain later what type of information contains. 𝑓 (𝑥; W) is a normalized 𝑥W ) where ReLU is the rectified projection layer parameterized with trainable weights W, i.e., 𝑓 (𝑥; W) = ReLU( ∥𝑥W ∥ 2 Linear Unit activation function [10]. 3.1.1 Geographical features. As geographical features we use a spatial encoder [26] (later called space2vec) as well as the H3 hierarchical geo-spatial system1 . H3 maps the world in hexagons which can be defined in different resolutions. Both space2vec and H3 embeddings are concatenated to form 𝐼𝑔 . To obtain space2vec embedding, we encode a spatial point 𝑥 (PE for Point Encoder) as a concatenation of multi-scale representations where 𝑆 is the total number of grid scales [𝑃𝐸𝑘0 (𝑥); . . . ; 𝑃𝐸𝑠𝑘 (𝑥); . . . ; 𝑃𝐸𝑆−1 𝑘 (𝑥)] and each component of 𝑥 is used separately (in our case the latitude and longitude of the hotel) as follows: ∀𝑠 ∈ {0, . . . , 𝑆 − 1}, 𝑃𝐸𝑠𝑘 (𝑥) = [𝑃𝐸𝑠,1 𝑘 𝑘 (𝑥); 𝑃𝐸𝑠,2 (𝑥)] 𝑘 𝑥 [𝑙] 𝑥 [𝑙] ∀𝑙 ∈ {1, 2}, 𝑃𝐸𝑠,𝑙 (𝑥) = [cos ; sin ] 𝜆𝑚𝑖𝑛 · 𝑘 𝑠/(𝑆−1) 𝜆𝑚𝑖𝑛 · 𝑘 𝑠/(𝑆−1) 10 . The multi-scale representation where 𝜆𝑚𝑖𝑛 and 𝜆𝑚𝑎𝑥 are the minimum and maximum grid scale and 𝑘 = 𝜆𝜆𝑚𝑎𝑥 = 1000 𝑚𝑖𝑛 is encoded with a fully connected layer in the hotel2vec model. As activation function we tried both ReLU and linear one and we found out that linear works the best. To obtain H3 embedding, we use the index at resolution 8 which we find reasonable for our use case. From the latitude and longitude of a hotel we can get the unique id (given a resolution) of the H3 hexagon which we embed in the model. 1 https://eng.uber.com/h3/ Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Ioannis Partalas and Anne Morvan, et al. 3.1.2 Amenity features. Amenity features (e.g., PetsAllowed, GuestRating, SpaServices, etc.) can be categorical or numerical with possible missing values. Thus, 𝐼𝑎 ∈ R58 is partitioned per feature, where for numerical features we simply use an element of 𝐼𝑎 assigned with the value of that feature, and for categorical features with 𝑚 categories, we assign 𝑚 elements of 𝐼𝑎 and set the corresponding category to 1 and the others to 0. If the feature is missing, we set everything to 0. 3.1.3 Loss function. We train our model by optimizing the Noise Contrastive Estimation (NCE) loss [14]. More formally, given ℎ𝑡 as the target, we estimate the probability of ℎ𝑐 being a context hotel to be log 𝑃 (ℎ𝑐 |ℎ𝑡 ) = log 𝜎 (V𝑒⊤𝑡 W𝑐,: ) (2) where 𝜎 is the sigmoid function, V𝑒𝑡 is the enriched embedding of ℎ𝑡 , W𝑐,: is the 𝑐 th row of the output projection weights, 𝑊𝑁 𝐶𝐸 . The 𝐼𝑡1 , · · · , 𝐼𝑡𝜔 vectors in Figure 1 represent the one-hot encodings of other hotels in the training window of 𝐼𝑐 (in the window, stride length is 2 before and after the hotel). We find parameters of the model by maximizing the probabilities of correct predictions. We train the model using backpropagation and by minimizing the following loss function (𝐿2 -regularization is also applied): 𝑇   1 Õ Õ J (𝜃 ) = − log 𝑃 (ℎ𝑐𝑡 |ℎ𝑡 ) + log 𝜎 (−V𝑒⊤𝑡 Wℎ𝑖 ,: ) (3) 𝑇 𝑡 =1 ℎ𝑖 ∈N𝑐 where 𝜃 includes all the parameters of the model, 𝑇 is the size of the batch, W𝑖,: is 𝑖 th row of 𝑊𝑁 𝐶𝐸 , N𝑐 = {ℎ𝑖 |1 ≤ 𝑖 ≤ 𝑁 , ℎ𝑖 ∼ 𝑃𝑛 (ℎ𝑐 )} is the set of negative examples, and 𝑃𝑛 (ℎ𝑐 ) is the distribution which we use to pick the negative samples which we discuss in Section 3.2. Finally, we train the model by maximizing Eq. 3 using batch stochastic gradient descent. 3.2 Negative Sampling It is well known [14, 28] that using negative sampling, a version of noise contrastive estimation, significantly decreases the amount of time required to train a classifier with a large number of possible classes. In the case of recommendation, there is typically a large inventory of items available to recommend to the user, and thus we train our skip-gram model using negative sampling. However, it is not uncommon that users frequently search exclusively within a particular subdomain. For example, in hotel search, a customer looking to stay in Miami will focus on that market and rarely across different markets. This motivates a more targeted strategy when selecting negative samples: we select negative samples within the market that the target and context hotels belong to. Throughout this paper, a market is defined as a set of hotels in the same geographic region. It’s worth noting that there may be multiple markets in the same city or other geographical region. 3.3 Cold Start Problem In practice, many hotels/items appear infrequently or never in historical data. Recommender systems typically have difficulty handling these items effectively due to the lack of relevant training data. Apart from the obvious negative impacts on searchability and sales, neglecting these items can introduce a Matthew’s effect where "rich get richer and the poor get poorer". That is, the less these items are recommended, or the more they are recommended in inappropriate circumstances, the more the data reinforces their apparent lack of popularity. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Hotel2vec: Learning Hotel Embeddings from User Click Sessions with Side Information Dealing with such hotels/items and choosing appropriate weights for them is referred to as the "cold start problem." One of the main advantages of the enriched hotel2vec model over session-only approaches is its ability to better handle cold start cases. Although an item might lack sufficient prior user engagement, there are often other attributes available. For example, in our use case, thousands of new properties are added to the lodging platform’s inventory each quarter. While we do not have prior user engagement data from which to learn a click embedding 𝑉𝑐 , we do have other attributes such as geographical location, star rating, amenities, etc. hotel2vec can take advantage of this supplemental information to provide a better cold-start embedding. For newly listed hotels with no click-session information, one can simply choose 𝑉𝑐 for new hotels at random and hotel2vec computes 𝑉𝑒 using the randomly initialized 𝑉𝑐 and the other hotel attributes which are known even for recently listed hotels. In Section 4.4, we compare with the session-only model [11] when setting 𝑉𝑐 as the average of other hotels’ embeddings in the same market and show a 70% gain on Hits@10 for the cold start hotels. 4 EXPERIMENTAL RESULTS In this section, we present several experiments to evaluate the performance of the trained hotel2vec embeddings. We refer the reader to next Section 5 for results on an online A/B testing. Before diving into the details of the experiments, we first describe the dataset and model parameters. 4.1 Experimental Framework 4.1.1 Real-world large-scale dataset from Expedia Group, a leading Online Travel Agency (OTA). Our dataset collected in 2019 (pre-Covid period) contains more than 65 million user click sessions, which includes more than 1.4 million unique hotels. A click session is defined as a span of clicks performed by a user with no gap of more than 7 days for the same destination and search parameters. Data are summarized in Table 1. We randomly split the sessions into training, validation, and test with a ratio of 8:1:1. Table 1. Dataset statistics. Number of user click sessions 65M Number of unique hotels 1.4M Avg # of clicks per user session 6 Min. # of clicks per user session 2 Max. # of clicks per user session 50 4.1.2 Experiment configuration. We use a system with 64GB RAM, 8 CPU cores, and a Tesla V100 GPU. We use Python 3 as the programming language and the Tensorflow [1] library for the neural network architecture and gradient calculations. All weight matrices are initialized with a he_normal [17] initializer, and click embedding vectors are initialized uniformly at random. As mentioned previously, 𝐿2 -regularization is applied on the weights which we add in the loss function. 4.1.3 Downstream tasks. In the following sections, we provide the performance evaluation of our trained embeddings on several experimental tasks. We start with the quantitative results focusing on the next-item prediction task based on model’s output probabilities (Section 4.2.1) and cosine similarity (Section 4.2.2), then present some qualitative results. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Ioannis Partalas and Anne Morvan, et al. We also evaluate each model’s performance on the cold start problem and provide insights on the effect of some of the hyper-parameters. 4.1.4 Comparison against state-of-the-art embedding baselines. For next-item prediction task based on cosine similarity in Section 4.2.2, we compare results against the following embedding methods: • the state-of-the-art session-only model proposed in [11]. As explained in Section 2, this model can only learn from historical user click sessions without direct use of the item’s attributes. • our hotel2vec which combines both hotel/item attributes and the click session data. • Matrix Factorization (MF) approach [23], where we factorize the matrix of the co-occurrence of the clicked hotels. Specifically, we factorize the log of the co-occurrence matrix. • Cleora [33] and, • graphSAGE [15] presented in Section 2. We used the implementation from Stellargraph 2 . For all considered experiments, we tune the hyperparameters of all models on the validation set. In particular for the state-of-the-art baseline session-only model [11], we search for a learning rate from {0.01, 0.1, 0.5, 1.0, 2.5} and embedding dimensions from {32, 128}. To train the model weights, we use stochastic gradient descent (SGD) with exponential decay (power=0.99 and staircase steps per 40k training steps) since it performed better than other optimizers in our case, and a batch size of 1024. We found that a learning rate of 0.5 and an embedding dimension of 32 worked best. Hence, throughout the remainder of the paper, all embeddings will have dimension 32. For hotel2vec, an initial learning rate of 0.05 worked best; for the dimensions of the embedding vectors, we found that letting 𝑉𝑐 , 𝑉𝑒 ∈ R32 , 𝑉𝑎 ∈ R15 and 𝑉𝑔 ∈ R36 worked best. For the multi-scale parameter in space2vec module we tuned 𝑆 to 16 and the output dimension to 28 with linear activation. For H3 layer we set the embedding size to 8 also. These two representations are concatenated to form the geo-embedding part of the model. For both session-only [11] and hotel2vec models, the number of negative samples is 2000. For the MF approach we constructed the co-click matrix using a skip window of size 2 and factorized it with Alternate Least Squares algorithm where we tuned the maximum iterations at 10 and the regularization parameter at 0.02. Note that as the result are two matrices, one for the target hotel and one for the context hotel, we obtained the best results by averaging the two corresponding vectors for each hotel in order to obtain the final representation. To make fair comparison with graphSAGE, we choose then 200 neighbors to sample from the direct neighborhood and 10 from the 2-hop neighborhood. Stellargraph’s implementation of graphSAGE samples as many negative as positive neighbors. We then follow recommendations of the authors and set no dropout, 𝐿2 -regularization and ADAM for the optimizer. The binary cross entropy loss is used for the link prediction task to learn the embeddings as well as the sigmoid function for the activation. 4.2 Quantitative Analysis: Next-item prediction task A robust metric for evaluating a set of hotel embeddings (or, more generally, any set of items displayed to a user in response to an information need) is its ability to predict a user’s next click/selection. In this section, we compare our model based on the Hits@k and MRR@k metrics in various cases. 2 https://github.com/stellargraph/stellargraph Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Hotel2vec: Learning Hotel Embeddings from User Click Sessions with Side Information • Hits@k measures the average number of times the correct selection (i.e. the hotels clicked by the users in a session) appears in the top k predicted hotels (i.e. the hotels with highest predicted probabilities, conditioned on the current hotel). • MRR@k (for Mean Reciprocal Rank) evaluates the average list quality of k items returned by the model (ordered by predicted probabilities) by looking at the rank of the first correctly predicted item. 4.2.1 Next-item prediction task based on model’s output probabilities. We consider two main scenarios: • Raw evaluation: We are given the current hotel clicked by the user, and we try to predict the next clicked hotel among all approximately 1.4M hotels. • Filtered evaluation: The second scenario is identical except we limit the candidates to hotels within the same market. For the last scenario, three simple baselines are also included where we rank the properties according to their average guest rating and their last week and last year popularity which is the raw number of bookings that the property received the last seven days and last twelve months respectively. Table 2 shows Hits@k and MRR@k for 𝑘 ∈ {10, 100} for hotel2vec and Session-32 [11] approach. We also present results for hotel2vec when we drop the geographical features. We notice that hotel2vec outperforms the session-only model by a huge margin, demonstrating the utility of including item attributes when learning embeddings. By removing the geographical part we experience a drop in all the metrics for hotel2vec. We also compare both models in the filtered scenario. This is a more realistic case because limiting hotels to the same market reduces the effect of other information the recommender system can use to provide more relevant suggestions to the user. Table 2b shows predictions results in the filtered scenario. The proposed hotel2vec model outperforms the "highest rated" and "most popular" baselines with a large margin. Table 2. Prediction results of the most likely hotel the user will click next among all possible hotels. (a) Results for the raw evaluation (no restriction on the candidates). Methods Hits@10 Hits@100 MRR@10 MRR@100 Session-32 [11] 0.1565 0.5352 0.0512 0.0689 hotel2vec-no-geo 0.1763 0.5585 0.0587 0.0728 hotel2vec 0.1807 0.5671 0.0604 0.0746 (b) Results for the filtered evaluation, when hotel candidates are restricted to the same market as the current hotel. Methods Hits@10 Hits@100 MRR@10 MRR@100 Highest Rated 0.0158 0.0102 0.0029 0.0032 Most Popular (last year) 0.0739 0.1789 0.0187 0.0230 Most Popular (last week) 0.0928 0.2397 0.0233 0.0292 Session-32 [11] 0.1583 0.5562 0.0605 0.0752 hotel2vec 0.1998 0.5987 0.0675 0.0825 Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Ioannis Partalas and Anne Morvan, et al. As demonstrated by Table 2, the hotel2vec model outperforms the baseline session model from [11] significantly in both scenarios. This shows the effectiveness of hotel2vec in incorporating both click sessions and item/hotel attributes for better recommendations. 4.2.2 Next-item prediction task based on cosine similarity. In this section, rather than using the model’s output probabil- ities to induce a ranking over hotels, we measure Hits@k and MRR@k over the ranking induced using cosine similarity of the embedding vectors. This is useful in scenarios where it is not feasible to directly use the model’s probabilities. In particular, it is easier to compare the different baselines based solely on the embeddings, that is why we present here more competing baselines. Table 3 shows the results for various embeddings: Session-32 [11] and hotel2vec but also Cleora [33], MF [23] and graphSAGE [15]. The simple baselines "Highest rated", "Most Popular (last year)", "Most Popular (last week)" which are not embedding approaches were removed since they were already performing poorly in the previous experiment. For conciseness, we focus only on the raw evaluation scenario. Hotel2vec embeddings achieve the highest performance. We believe that bad performance obtained by graphSAGE is due to its scalability issue. By increasing the number of sampled neighbors, it would probably increase the metrics’results but training time would be prohibitive (about 150h of training were already needed to obtain these results). We also see Table 3. Results of predicting the next click among all possible hotels using cosine similarity of the vectors (raw evaluation scenario). Vector used in cosine similarity Hits@10 Hits@100 MRR@10 MRR@100 graphSAGE [15] 0.0040 0.0212 0.0009 0.0014 MF [23] 0.964 0.4689 0.0297 0.0417 Cleora [33] 0.1360 0.4160 0.0350 0.0449 Session-32 [11] 0.141 0.491 0.0389 0.0512 hotel2vec 0.1676 0.5341 0.0413 0.0546 from Table 3 that using cosine similarity instead of the whole network does not result in a huge decrease in performance. 4.3 Qualitative Analysis The learned hotel embeddings can be used for recommending similar hotels in various situations. In this section, we show examples of how these embeddings are helpful with real examples of hotels from our dataset. 4.3.1 Visualization of embedding clusters. To further illuminate the nature of the embeddings learned by the hotel2vec model, we examine a low-dimensional projection (UMAP [27]) of hotel embeddings in the Miami market (Fig. 2b and 2a). The colors signify the grouping of hotels into various competing subcategories (i.e., similar hotels), manually annotated by a human domain expert. The hotel2vec model is significantly better at clustering similar hotels than the session-only model [11]. 4.3.2 Finding the top-k most similar hotels. A common scenario is finding similar hotels to a target hotel in other destinations. For example, when the user searches for a specific hotel name (e.g., Hotel Beacon, NY) we would like to be able to recommend a few similar hotels. The learned embeddings can be used to find top-k most similar hotels to a given one. Given a target hotel ℎ, we compute the cosine similarity of every other hotels with ℎ and pick the most similar hotels. Rigid evaluation of this system requires A/B testing; here we show a few examples comparing our hotel2vec embeddings and the session-only [11] embeddings in Fig. 3 to provide some intuition for the behavior of the two models. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Hotel2vec: Learning Hotel Embeddings from User Click Sessions with Side Information (a) hotel2vec embeddings (b) Session-32 [11] embeddings Fig. 2. Low-dimensional visualization (UMAP [27]) of hotel embeddings from the Miami area. Different colors represent expert annotations of competing hotels. Our model has successfully captured most of the similarities. Fig. 3. Example of recommendations based on cosine similarity of hotel2vec embedding vectors. Ranking by the session-32 [11] model placed 3rd before 1st (3,1,2), though it is a hostel, cheaper, and has a lower user rating than the target hotel. 4.4 Addressing the Cold Start Problem Here we analyze how well the model learns embeddings for hotels with no presence in the training data. To demonstrate the effectiveness of our model, we compare the hotel2vec’s Hits@k with the session-only model [11]’s Hits@k, for target hotels that were absent during training for same-market predictions (filtered evaluation). For hotel2vec, cold-start concerns the 𝑉𝑐 embedding associated to co-clicked properties within the same session. We use a simple heuristic for cold-start imputation and compare the results with hotel2vec model for cold-start hotels. To impute vectors for cold-start hotels, we borrow the idea in [11] and use price, star rating, geodesic distance, type of the property (e.g., hotel, vacation rental, etc.) size in terms of number of rooms, and the geographic market information. For each imputed property, we collect the top 100 most similar properties in the same market based on the above features, considering Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Ioannis Partalas and Anne Morvan, et al. only those properties that fall within a radius of 5km of the target hotel and for which we have an existing 𝑉𝑐 (resp. session-32) embedding. We then average these embeddings to obtain the 𝑉𝑐 (resp. final) embedding of the hotel to impute. Results are shown in Table 4. The proposed enriched embedding by hotel2vec model significantly outperforms the session-based embeddings for cold-start hotels. Table 4. Cold start experiments: Same-market prediction results when the target hotel is an unseen hotel, click embeddings imputed by averaging the top-100 similar hotel embeddings in market. Imputed Hits@10 Hits@100 MRR@10 MRR@100 Cleora [33] 0.0131 0.0067 0.0019 0.0022 MF [23] 0.0196 0.0380 0.0053 0.0061 Session-32 [11] 0.0296 0.0632 0.0079 0.0093 hotel2vec 0.0513 0.1248 0.0132 0.0162 4.5 Training Convergence Analysis In this section, we first look at the learning curves for both the session-32 [11] and hotel2vec models. Then, we analyse the effect of 𝑁 (number of negative samples), and 𝑙𝑟 (learning rate) alongside the optimization algorithm on the performance of our model. 4.5.1 Learning curves. Fig. 4 shows the overall training progress of both the session-32 [11] and hotel2vec models with their respective best hyperparameters optimized for Hits@100. Our model achieves similar performance with fewer data. Fig. 4. Training progress of both models the session-32 [11] and hotel2vec models with their respective best hyperparameters optimized for negative sampling loss on validation set and Hits@100. 4.5.2 Number of negative samples. An interesting phenomenon is the effect of increasing the number of negative samples on training time and accuracy. Although it takes more time to create a large number of negative samples, as Fig. 5a shows, using more negative samples results in faster training times. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Hotel2vec: Learning Hotel Embeddings from User Click Sessions with Side Information 4.5.3 Learning rate and optimization techniques. We show empirical experiments with various optimization algorithms and learning rates, summarized in Fig. 5b. Surprisingly, we see that SGD with exponential learning rate decay outperforms most optimizers with sophisticated learning rate adaptations. We believe this is due to large variance and overfitting in the early stages of training. These issues have been observed in other tasks such as [8, 32], suggesting the need to use tricks such as warm-up heuristics when using momentum-based optimization algorithms to learn embeddings on large, diverse datasets such as ours. 0.7 100 0.6 0.5 10 1 Validation Hits@100 SGD lr = 0.1 Validation Hits@100 0.4 SGD lr = 0.05 0.3 RMSProp lr = 0.001 10 2 RMSProp lr = 0.05 0.2 RMSProp lr = 0.5 N = 10 Adam lr = 0.001 0.1 N = 100 Adam lr = 0.05 N = 200 10 3 0.0 N = 500 0 2 4 6 8 10 12 14 Training time (hours) 10 4 0 1 2 3 4 5 6 Training time (hours) (a) Effect of negative sampling on prediction. Higher number of negative samples results in faster (b) Various optimization algorithms and learning rates. Sophisticated training times. momentum methods seem to overfit to the early batches too quickly. Fig. 5. Effect of negative sampling and optimization methods. 5 ONLINE A/B TESTS We performed online tests on search ranking in order to evaluate the embeddings. Specifically, we use the embeddings as input features in the ranking model. The model implements a neural architecture and takes as input search features (destination, dates, number of travelers, etc.) and property features (price, geographical information, ratings, etc.). The model is trained in the context of Learning to Rank with a pairwise loss approach and the main off-line metric is NDCG. We performed two tests in order to validate the effectiveness of the embeddings. In the first test, we compared the ranking model with hotel embeddings learned with the MF approach (described in Section 4) against the ranker that was trained without hotel embeddings. This was our initial version of hotel embeddings. In the next test we compared the ranker model trained with hotel2vec embeddings and compared against the ranker that leverages MF embeddings. Note that these tests were ran in a sequential manner because the two approaches have been designed one after the other. Tables 5a and 5b present the results of the tests in terms of the main metrics, that is conversion rate (CVR) and gross profit (GP). The first test had no effect in terms of CVR while it was significantly improving GP3 . Our analysis showed that this was due to the fact that the embeddings would favor better quality hotels which are slightly more expensive. Table 5b shows the results of the second online test in terms of CVR and GP uplift. In this case we have observed a positive uplift in terms of CVR while not hurting GP. Hotel2vec is able to capture better similarities and as a consequence would help the ranker to propose properties with higher utility for the user. Also, the fact that we could impute hotels that were not seen during training of hotel2vec had a positive effect as we observed an uplift on the main metrics for new and less popular hotels. 3 Disclosure of specific numbers was not allowed by Legal Department. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Ioannis Partalas and Anne Morvan, et al. After the tests were completed, as the ranking model is linear, we also looked at feature importances on logged searches in order to understand the different behavior. We found out that hotel2vec features were 4 times more important than the MF approach. This result reinforced the conclusion that hotel2vec can capture better similarities among hotels which are later leveraged by the ranker in order to propose higher utility hotels for the user. Table 5. Results of online tests. (a) MF embeddings approach. (b) hotel2vec approach. Metric CVR GP Metric CVR GP Uplift Neutral Positive Uplift Positive Neutral 6 CONCLUSION In this work, we propose a framework to learn a semantic representation of hotels by jointly embedding hotel click data, geographic information, user rating, and attributes (such as stars, whether it has free breakfast, whether pets are allowed, etc.). Our neural network architecture extends the skip-gram model to accommodate multiple features and encodes each one separately. We then fuse the sub-embeddings to predict hotels in the same session. Through experimental results, we show that enriching the neural network with supplemental, structured hotel information results in superior embeddings when compared to a model that relies solely on click information. Our final embedding is composed from the stack of multiple sub-embeddings, each encoding the representation for a different hotel aspect, resulting in a modular representation. It is also adaptive, in a sense that if one of the attributes or user ratings changes for a hotel, we can feed the updated data to the model and easily obtain a new embedding. Although we mainly focus on learning embeddings for hotels, the same framework can be applied to general item embedding, such as product embedding on Amazon, Ebay, Netflix, or Spotify. ACKNOWLEDGMENTS The authors would like to thank Ion Lesan, Peter Barszczewski, Daniele Donghi, and Ankur Aggrawal for helping us collecting hotel’s attribute, click and geographical data. We would also like to thank Dan Friedman and Thomas Mulc for providing useful comments and feedback. REFERENCES [1] Martin Abadi, Ashish Agarwal, Paul Barham, and et. al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https: //www.tensorflow.org/ Software available from tensorflow.org. [2] Jens Adamczak, Gerard-Paul Leyson, Peter Knees, Yashar Deldjoo, Farshad Bakhshandegan Moghaddam, Julia Neidhardt, Wolfgang Wörndl, and Philipp Monreal. 2019. Session-Based Hotel Recommendations: Challenges and Future Directions. arXiv preprint arXiv:1908.00071 (2019). [3] Deepak Agarwal, Bee-Chung Chen, Pradheep Elango, and Raghu Ramakrishnan. 2013. Content recommendation on web portals. Commun. ACM 56, 6 (2013), 92–101. [4] Oren Barkan and Noam Koenigstein. 2016. Item2vec: neural item embedding for collaborative filtering. In 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 1–6. [5] Mostafa Bayomi, Annalina Caputo, Matthew Nicholson, Anirban Chakraborty, and Sèamus Lawless. 2019. CoRE: a cold-start resistant and extensible recommender system. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. 1679–1682. [6] Hugo Caselles-Dupré, Florian Lesaint, and Jimena Royo-Letelier. 2018. Word2vec Applied to Recommendation: Hyperparameters Matter. In Proceedings of the 12th ACM Conference on Recommender Systems (Vancouver, British Columbia, Canada) (RecSys ’18). Association for Computing Machinery, New York, NY, USA, 352–356. https://doi.org/10.1145/3240323.3240377 Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Hotel2vec: Learning Hotel Embeddings from User Click Sessions with Side Information [7] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (Boston, Massachusetts, USA) (RecSys ’16). ACM, New York, NY, USA, 191–198. https://doi.org/10.1145/2959100.2959190 [8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). [9] Kallirroi Dogani, Matteo Tomassetti, Sofie De Cnudde, Saúl Vargas, and Ben Chamberlain. 2019. Learning Embeddings for Product Size Recommen- dations. (2019). [10] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics. 315–323. [11] Mihajlo Grbovic and Haibin Cheng. 2018. Real-Time Personalization Using Embeddings for Search Ranking at Airbnb. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery Data Mining (London, United Kingdom) (KDD ’18). Association for Computing Machinery, New York, NY, USA, 311–320. https://doi.org/10.1145/3219819.3219885 [12] Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan Bhamidipati, Jaikit Savla, Varun Bhagwan, and Doug Sharp. 2015. E-commerce in your inbox: Product recommendations at scale. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 1809–1818. [13] Rachid Guerraoui, Erwan Le Merrer, Rhicheek Patra, and Jean-Ronan Vigouroux. 2017. Sequences, items and latent links: Recommendation with consumed item packs. arXiv preprint arXiv:1711.06100 (2017). [14] Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 297–304. [15] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf [16] Casper Hansen, Christian Hansen, Lucas Maystre, Rishabh Mehrotra, Brian Brost, Federico Tomasi, and Mounia Lalmas. 2020. Contextual and Sequential User Embeddings for Large-Scale Music Recommendation. In Fourteenth ACM Conference on Recommender Systems (Virtual Event, Brazil) (RecSys ’20). Association for Computing Machinery, New York, NY, USA, 53–62. https://doi.org/10.1145/3383313.3412248 [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision. 1026–1034. [18] Peng Hu, Rong Du, Yao Hu, and Nan Li. 2019. Du., R.: Hybrid item-item recommendation via semi-parametric embedding. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI. 10–16. [19] Xiaowen Huang, Shengsheng Qian, Quan Fang, Jitao Sang, and Changsheng Xu. 2018. Csan: Contextual self-attention network for user sequential recommendation. In Proceedings of the 26th ACM international conference on Multimedia. 447–455. [20] Schockaert Jeawak, Jones. 2018. Embedding Geographic Locations for Modelling the Natural Environment using Flickr Tags and Structured Data. arXiv preprint arXiv:1810.12091 (2018). [21] Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 197–206. [22] Buket Kaya. 2019. Hotel recommendation system by bipartite networks and link prediction. Journal of Information Science (2019), 0165551518824577. [23] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. Computer 42, 8 (Aug. 2009), 30–37. https://doi.org/10.1109/MC.2009.263 [24] Adam Lerer, Ledell Wu, Jiajun Shen, Timothée Lacroix, Luca Wehrstedt, Abhijit Bose, and Alexander Peysakhovich. 2019. PyTorch-BigGraph: A Large-scale Graph Embedding System.. In Proceedings of the 2nd SysML Conference. [25] Jiahui Liu, Peter Dolan, and Elin Rønby Pedersen. 2010. Personalized news recommendation based on click behavior. In Proceedings of the 15th international conference on Intelligent user interfaces. ACM, 31–40. [26] Gengchen Mai, Krzysztof Janowicz, Bo Yan, Rui Zhu, Ling Cai, and Ni Lao. 2020. Multi-Scale Representation Learning for Spatial Feature Distributions using Grid Cells. In International Conference on Learning Representations. https://openreview.net/forum?id=rJljdh4KDH [27] Leland McInnes, John Healy, and James Melville. 2020. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426 [stat.ML] [28] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119. [29] Zhiqiang Pan, Fei Cai, Yanxiang Ling, and Maarten de Rijke. 2020. Rethinking Item Importance in Session-Based Recommendation. Association for Computing Machinery, New York, NY, USA, 1837–1840. https://doi.org/10.1145/3397271.3401274 [30] Michael J Pazzani and Daniel Billsus. 2007. Content-based recommendation systems. In The adaptive web. Springer, 325–341. [31] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532–1543. http://www.aclweb.org/anthology/D14-1162 [32] Martin Popel and Ondřej Bojar. 2018. Training tips for the transformer model. The Prague Bulletin of Mathematical Linguistics 110, 1 (2018), 43–70. [33] Barbara Rychalska, Piotr Bąbel, Konrad Gołuchowski, Andrzej Michałowski, and Jacek Dąbrowski. 2021. Cleora: A Simple, Strong and Scalable Graph Embedding Scheme. arXiv:2102.02302 [cs.LG] [34] Loveperteek Singh, Shreya Singh, Sagar Arora, and Sumit Borar. 2019. One Embedding To Do Them All. arXiv preprint arXiv:1906.12120 (2019). Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Ioannis Partalas and Anne Morvan, et al. [35] Yi Tay, Anh Tuan Luu, and Siu Cheung Hui. 2018. Multi-pointer co-attention networks for recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2309–2318. [36] Flavian Vasile, Elena Smirnova, and Alexis Conneau. 2016. Meta-prod2vec: Product embeddings using side-information for recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 225–232. [37] Shoujin Wang, Liang Hu, Longbing Cao, Xiaoshui Huang, Defu Lian, and Wei Liu. 2018. Attention-based transactional context embedding for next-item recommendation. In Thirty-Second AAAI Conference on Artificial Intelligence. [38] Shu Wu, Yuyuan Tang, Yanqiao Zhu, Liang Wang, Xing Xie, and Tieniu Tan. 2019. Session-based recommendation with graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 346–353. [39] Da Xu, Chuanwei Ruan, Evren Körpeoglu, Sushant Kumar, and Kannan Achan. 2020. Modeling Complementary Products and Customer Preferences with Context Knowledge for Online Recommendation. In WSDM, 2020. [40] Lu Yu, Chuxu Zhang, Shangsong Liang, and Xiangliang Zhang. 2019. Multi-order attentive ranking model for sequential recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5709–5716. [41] Shuai Yu, Yongbo Wang, Min Yang, Baocheng Li, Qiang Qu, and Jialie Shen. 2019. NAIRS: A neural attentive interpretable recommendation system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 790–793. [42] Angelina Ziesemer and J Oliveira. 2011. How to know what do you want? a survey of recommender systems and the next generation. In Proceedings of the Eighth Brazilian Symposium on Collaborative Systems, SBSC. 104–111. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).