RecTour 2019, September 19th, 2019, Copenhagen, Denmark. 22 A Simple Deep Personalized Recommendation System Pavlos Mitsoulis-Ntompos∗ Travis Brady Meisam Hejazinia∗ tbrady@expediagroup.com Serena Zhang∗ Vrbo, part of Expedia Group pntompos@expediagroup.com mnia@expediagroup.com shuazhang@expediagroup.com Vrbo, part of Expedia Group ABSTRACT sparsity issue. In addition, the context of each trip might be Recommender systems are critical tools to match listings and different for travelers within and across different seasons travelers in two-sided vacation rental marketplaces. Such and destinations (e.g. winter trip to mountains with friends, systems require high capacity to extract user preferences for summer trip to the beach with family, etc.). Moreover, such items from implicit signals at scale. To learn those prefer- a personalized recommender system should always be avail- ences, we propose a Simple Deep Personalized Recommen- able and trained based on the most relevant data, allowing dation System to compute travelers’ conditional embeddings. quick test-and-learn iterations, adapting to ever changing Our method combines listing embeddings in a supervised requirements of business. This personalized recommender structure to build short-term historical context to personalize system should suggest handful relevant listings to the mil- recommendations for travelers. Deployed in the production lions of travelers visiting site pages (e.g. home page, landing environment, this approach is computationally efficient and page, or listing detail page), travelers receiving targeted mar- scalable, and allows us to capture non-linear dependencies. keting emails, or travelers faced cancelled bookings due to Our offline evaluation indicates that traveler embeddings various reasons. created using a Deep Average Network can improve the pre- To develop such a recommender system we need to ex- cision of a downstream conversion prediction model by seven tract travelers’ preferences from implicit signals of their in- percent, outperforming more complex benchmark methods teractions using machine learning or statistical-economics for online shopping experience personalization. models. Given the complexity and scale of this problem, we require high capacity models. While powerful, high-capacity KEYWORDS models frequently require prohibitive amounts of comput- travel, recommender system, deep learning, embeddings, e- ing power and memory, particularly for big data problems. commerce Many approaches have been proposed to learn item embed- dings for recommender systems [3, 4, 14, 21], yet learning 1 INTRODUCTION travelers’ preferences from those listing embeddings at scale is still an open problem. Indeed, such a solution needs to Personalizing recommender systems is the cornerstone for capture traveler heterogeneity while being generic and ro- two-sided marketplace platforms in the vacation rental sec- bust to cold start problems. We propose a modular solution tor. Such a system needs to be scalable to serve millions that learns listings and traveler embeddings non-linearly of travelers and listings. On one side, travelers show com- using a combination of shallow and deep networks. We used plex non-linear behavior. For example, during a shopping down-funnel booking signals, in addition to implicit signals cycle travelers might collect and weight different signals (such as listing-page-view), to validate our extracted traveler based on their heterogeneous preferences across various embeddings. We deployed this system in the production en- days, by searching either sequentially or simultaneously. vironment. We compared our model with three benchmark Furthermore, the travelers might forget and revisit items in models, and found that adding these traveler features to the their consideration set [5, 7]. On the other side, marketplace extant feature set in the already-existing Traveler Booking platforms should match each of the travelers with the most Intent model can add significant marginal values. Our find- personalized listing out of millions of heterogeneous listings. ing suggests that this simple approach can outperform LSTM Many of these listings have never been viewed by any trav- models, which have significantly higher time complexity. In eler or have only been recently onboarded, imposing data the next sections we review related work, explain our model, ∗ Equal contribution to this research. review the results, and conclude. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). RecTour 2019, September 19th, 2019, Copenhagen, Denmark. 23 2 RELATED WORKS of listings in the same session signals the similarity of those Representation learning has been widely explored for large- listings. We use a shallow neural network with one hidden scale session-based recommender systems (SBRS), [9, 12, 21], layer with lower dimension for this purpose. The training among which collaborative filtering and content-based set- objective is to find the listing local representation that speci- tings are most commonly used to generate user and item fies surrounding most similar manifold. More formally the representations [9, 14, 18]. Recent works have addressed objective function can be specified by the log probability the cold start and adaptability problems in factorization ma- maximization problem as follows: chine and latent factor based approaches [11, 17, 22]. Other S works have employed non-linear functions and neural mod- 1Õ Õ log p(x i+j |x i ) els to learn the complex relationships and interactions over S s=1 −c ≤j ≤c, j,0 users and items on e-commerce platforms [12, 22]. In par- ticular, word2vec techniques with shallow neural networks where c is the window size representing listing context. [16] from the Natural Language Processing (NLP) commu- The basic skip-gram formulation defines p(x i+j |x i ) using nity have inspired authors to generate non-linear entity em- softmax function as follows: beddings [9] using historical contextual information. State- of-the-art methods have used attention neural networks to exp(ν xTi +j ν x i ) p(x i+j |x i ) = ÍX T x =1 exp(ν x ν x i ) aggregate representations in order to focus on relevant in- puts and select the most important portion of the context where ν x and ν x i are input and output representation [6]. Attention has been found effective in assigning weights vector or neural network weights, and X is the number of to user-item interactions within the encoder-decoder and listings available on our platform. To simplify the task, we Long Short Term Memory (LSTM) architectures and collab- used the sigmoid formula, which makes the model a binary orative filtering framework, capturing both long and short classifier, with negative samples, which we draw randomly term preferences [8, 12, 20]. Similar to the spirit of our work, from the list of all available listings on our platform. Formally, recent studies suggested simple neural networks, showing exp(ν xT νxi ) promising results in terms of performance, computational we use the following formula: p(x i+j |x i ) = 1+exp(ν Ti +j ν for x i +j x i ) efficiency and scalability [2, 10, 26]. positive samples, and the following formula for negative ones: p(x i+j |x i ) = 1+exp(ν1T ν ) . x i +j x i We have two more issues to address, sparsity and hetero- 3 ARCHITECTURE AND MODEL geneity in views per item. It is not uncommon to observe long In this section, we will describe our model, which is based tail distribution of views for the listings. For this purpose on the session based local embedding model. Our model has we leverage approaches mentioned by [16] wherein espe- two modular stages. In the first stage, we train a skip-gram cially frequent items are downsampled using the inverse sequence model to capture a local embedding representa- square root of the frequency. Additionally, we removed list- tion for each listing, we then extrapolate latent embeddings ings with very low frequency. To resolve the cold start issue, for listings subject to the cold start problem. In the second we leverage the contextual information that relates desti- stage, we train a Deep Average Network (DAN) stacked with nations (or search terms) to the listings based on the book- decoder and encoder layers predicting purchase events to ing information. Formally, considering that the destinations capture a given traveler’s embedding or latent preference d 1 , d 2 , ..., d D are driving pid1 , ..., pid D , proportion of the de- for listings embedding. We also mention a couple of alter- mand for a given listing, we form the expectation of the latent natives we evaluated for traveler embeddings. We denote representation for each location using νd = N1 lL=1 pld ν xl , Í each listing by x i , so each traveler session sk (t j ) is defined as where N is the normalizing factor and L is the total number a sequence like x 1 , x 2 , ... for traveler t j . We denote booking of destinations. Then, given latitude and longitude of the event conditional on listings recently viewed by the traveler cold listing (for which we have no data), we form the belief with bk (t j |x j1 , x j2 , , .., x jt ). Our contribution in this paper is about the proportion of demand driven from each of the mainly the second stage which we validate using a down- search terms p jd1 , ..., p jd D . Then, we use our destination em- stream shopping funnel signal. bedding from the previous step to find the expected listing ÍD embedding for the cold listing as follows ν x j = d=1 p jd νd . Skip-gram Sequence Model The skip-gram model [16] in our context attempts to predict Deep Average Network and Alternatives listings x i surrounded by listings x i−c and x i+c viewed in a In the second stage, given the listing’s embedding from traveler session sk , based on the premise that traveler’s view the previous stage we model traveler embeddings using a Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). RecTour 2019, September 19th, 2019, Copenhagen, Denmark. 24 sandwiched encoder-decoder non-linear Relu function. In contrast to relatively weak implicit view signals, in this stage we leverage strong booking signals as a target vari- able based on historical traveler listing interaction. We have various choices for this purpose including Deep Average Net- work with Auto-Encoder-Decoder, Long Short Term Memory (LSTM), and Attention Networks. The simplest approach is to take the point-wise average of the embedding vector and use it directly in the model. The second approach could be to feed the average embedding into a dimensionality expansion and reduction non-linear encoder-decoder architecture, or Deep Average Network to extract the signals [10]. The third approach could incorporate LSTM network [13, 19], testing the hypothesis that the traveler signals information that they gathered by looking at different listings in the shopping fun- nel. The fourth approach could have an attention layer on the top of LSTM [25], hypothesizing that they allocate different weights on various latent features before their booking. We take a probabilistic approach to model traveler book- ing events P(Yj ) based on the embedding vectors of historical Figure 1: Deep Average Network (DAN) on the top of skip- units they have interacted with ν j1 , , .., ν jt . Formally, given gram network. the traveler embeddings (or last layer of the traveler book- ing prediction neural network f (ν j . )), the probability of the booking is defined as: pragmatic stand point, for millions of listings and travelers DAN seems to be more appealing for deployment as depicted P(Yj |ν j1 , ν j1 , , .., ν jt ) = sigmoid(f (ν j . )) (1) in Figure 1. where, the Deep Average Network layers and f are defined We use adaptive stochastic gradient descent method to as: train the binary cross entropy of these neural networks. The last question to answer is how are we planning to combine the traveler and listing embedding for personalized recom- f (ν j . ) = relu(ω1 · h 2 (ν j . ) + β 1 ) (2) mendations. This is a particularly challenging task as traveler h 1 (ν j . ) = relu(ω2 · h 1 (ν j . ) + β 2 ) (3) embeddings is non-linear projection of listings embedding t with a different dimension. As a result, they are not in the 1Õ h 2 (ν j . ) = relu(ω 3 · ν ji ) + β 3 ) (4) same space to compute cosine similarity. We have various k i=1 choices for this solution, including approaches such as fac- Alternatively, we can use an LSTM network with forget, torization machine and svm with kernel that allow modeling input, and output gates as follows: higher level interactions at scale. We defer the study of this approach to our next study. f (ν jt ) = sigmoid(ω f [ht , ν jt ] + β f ) · f (ν jt.−1 ) 4 EXPERIMENTS AND RESULTS + sigmoid(ωi [ht , ν jt ] + βi ) · tanh(ωc [ht −1 , ν jt ] + βc ) (5) In this section we describe the experimental setup, and the And finally, we can also use an attention network on the results obtained when comparing the accuracy uplift of our top of LSTM network as follows: Deep Average Network based approach to various baselines on a downstream conversion prediction model. The Traveler f (ν j ) = softmax(ωT · hT )tanh(hT ) (6) Booking Intent XGBoost model is such a downstream model. where ω . , β . are weight and bias parameters to estimate and It is trained using LightGBM [15] and uses a rich set of ht represents the hidden layer parameter or function to esti- hand-crafted historical session-based product interaction mate. features in order to predict the booking intent probability1 . Among these models, DAN is more consistent with Oc- In order to evaluate offline our proposed methodology, we cam’s razor principle, so it is more parsimonious, and faster to train. However, LSTM and Attention Networks on the top 1 We call it booking intent as our model predicts booking request from of it are more theoretically appealing. As a result, from the travelers, which needs a couple of steps to be confirmed as booking. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). RecTour 2019, September 19th, 2019, Copenhagen, Denmark. 25 concatenated the hand-crafted features with the traveler Table 1: Comparison between Model Settings embeddings, generated by all different model settings. The three baseline methods that we compare against our Performance Metrics proposed Deep Average Network on the top of Skip-Gram Algorithm AUC Precision Recall F-Score include the following: Random 0.973 0.821 0.633 0.715 (1) Random: a heuristic rule that chooses a random list- Averaging Embeddings 0.971 0.816 0.628 0.71 ing embedding, among those listings a traveler has LSTM + Attention 0.976 0.877 0.62 0.727 previously interacted with, in the current session. DAN 0.978 0.888 0.628 0.735 (2) Averaging Embeddings: a simple point-wise aver- aging of listing embeddings a traveler has previously interacted with, in the current session. Moreover, Table 2 shows the performance improvement (3) LSTM with Attention: A recurrent neural network, to the Traveler Booking Intent (TBI) model when the Deep inspired by [13, 19, 23], that uses LSTM units and an Average Network generated traveler embeddings are con- attention mechanism on top of it in order to combine catenated to the initial hand-crafted features. embeddings of listings a user has previously interacted Table 2: Performance Uplift to TBI Model with, in the current session. Datasets Performance Metrics For the experiments, anonymized clickstream data is col- Settings AUC Precision Recall F-Score lected for millions of users from two different seven-day pe- Only Hand-Crafted Feat. 0.975 0.817 0.651 0.724 riods. Specifically, the click stream data includes user views Hand-Crafted + DAN Feat. 0.978 0.888 0.628 0.735 and clicks of listing detail page logs, search requests, re- sponses, views and clicks logs, homepage views and landing We noticed that the Deep Average Network traveler em- page logs, conversion events logs, per visitor and session. The beddings have competitive predictive power compared to the first click-stream dataset was used to generate embeddings hand-crafted ones in the downstream TBI model. Based on using Deep Average Network and the LSTM with Attention. random re-sampling the dataset and re-running the pipeline, The second click-stream dataset was used to evaluate the we find that our results are reproducible. learned embeddings on the Traveler Booking Intent Model. We split each of the data sets into train and test set by 70:30 5 CONCLUSION proportion randomly, based on users. In other words, users We presented a method that combines deep and shallow neu- that are in the train set are excluded from the test set, and ral networks to learn traveler and listing embeddings for a vice versa. large online two-sided vacation rental marketplace platform. We deployed this system in the production environment. Results Our results show Deep Average Networks can outperform We ran our training pipeline on both CPU and GPU pro- more complex neural networks in this context. There are duction systems using Tensorflow [1]. We cleaned up the various avenues to extend our study. First, we plan to test data using Apache Spark [24], and the input data to training attention network without LSTM. Second, we plan to infuse pipeline had observations from millions of traveler sessions. other contextual information into our model. Third, we want The training process for LSTM models typically took 3 full to build a scoring layer that combines traveler and listing days of time, while training DAN took less than 8 hours on embeddings to personalize recommendations. Finally, we CPU. Given that our recommender system needs to be iter- plan to evaluate numerous spatio-temporal features, repre- ated fast for improvement and infer in real-time with high sentational learning approaches, and bidirectional recurrent coverage, DAN model scales better. Moreover, we modified neural networks in our framework. the cost function to give more weight to minority class (i.e. positive booking intent) in order to combat the imbalanced 6 ACKNOWLEDGMENTS classes in the data sets. This project is a collaborative effort between the recommen- We evaluated the performance of the Traveler Booking dation, marketing data science and growth marketing teams. Intent model on the different settings using the test data The authors would like to thank Ali Miraftab, Ravi Divvela, set based on AUC, Precision, Recall and F1 scores. The best Chandri Krishnan and Wenjun Ke for their contribution to results of each model are shown in Table 1. It shows that our this paper. proposed Deep Average Network approach contributes more uplift to the downstream Traveler Booking Intent model. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). RecTour 2019, September 19th, 2019, Copenhagen, Denmark. 26 REFERENCES [13] Tobias Lang and Matthias Rettenmeier. 2017. Understanding consumer [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng behavior with recurrent neural networks. In Workshop on Machine Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Learning Methods for Recommender Systems. Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey [14] Dawen Liang, Jaan Altosaar, Laurent Charlin, and David M Blei. 2016. Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Factorization meets the item embedding: Regularizing matrix factoriza- Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry tion with item co-occurrence. In Proceedings of the 10th ACM conference Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, on recommender systems. ACM, 59–66. Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent [15] Microsoft. 2019. LightGBM. https://lightgbm.readthedocs.io Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete [16] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Dean. 2013. Distributed representations of words and phrases and their compositionality. (2013), 3111–3119. Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Hetero- [17] Andriy Mnih and Ruslan R Salakhutdinov. 2008. Probabilistic matrix geneous Systems. http://tensorflow.org/ factorization. In Advances in neural information processing systems. [2] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2016. A simple but 1257–1264. tough-to-beat baseline for sentence embeddings. (2016). [18] Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. [3] Veronika Bogina and Tsvi Kuflik. 2017. Incorporating Dwell Time in 2015. Autorec: Autoencoders meet collaborative filtering. In Proceed- Session-Based Recommendations with Recurrent Neural Networks.. ings of the 24th International Conference on World Wide Web. ACM, In RecTemp@ RecSys. 57–59. 111–112. [4] Hugo Caselles-Dupré, Florian Lesaint, and Jimena Royo-Letelier. 2018. [19] Humphrey Sheil, Omer Rana, and Ronan G. Reilly. 2018. Predicting Word2vec applied to recommendation: Hyperparameters matter. In Purchasing Intent: Automatic Feature Learning using Recurrent Neural Proceedings of the 12th ACM Conference on Recommender Systems. ACM, Networks. CoRR abs/1807.08207 (2018). 352–356. [20] Chu Wang, Lei Tang, Shujun Bian, Da Zhang, Zuohua Zhang, and [5] Hector Chade, Jan Eeckhout, and Lones Smith. 2017. Sorting through Yongning Wu. 2019. Reference Product Search. arXiv:arXiv:1904.05985 search and matching models in economics. Journal of Economic Liter- [21] Shoujin Wang, Longbing Cao, and Yan Wang. 2019. A Survey on ature 55, 2 (2017), 493–544. Session-based Recommender Systems. arXiv preprint arXiv:1902.04864 [6] Sneha Chaudhari, Gungor Polatkan, Rohan Ramanath, and Varun (2019). Mithal. 2019. An Attentive Survey of Attention Models. arXiv preprint [22] Shu Wu, Yuyuan Tang, Yanqiao Zhu, Liang Wang, Xing Xie, and Tie- arXiv:1904.02874 (2019). niu Tan. 2018. Session-based Recommendation with Graph Neural [7] Babur De los Santos, Ali Hortaçsu, and Matthijs R Wildenbeest. 2012. Networks. arXiv preprint arXiv:1811.00855 (2018). Testing models of consumer search using data on web browsing and [23] Yuan Xia, Jingbo Zhou, Jingjia Cao, Yanyan Li, Fei Gao, Kun Liu, Hais- purchasing behavior. American Economic Review 102, 6 (2012), 2955– han Wu, and Hui Xiong. 2019. Intent-Aware Audience Targeting for 80. Ride-Hailing Service. In Machine Learning and Knowledge Discovery [8] Simen Eide and Ning Zhou. 2018. Deep neural network marketplace in Databases, Ulf Brefeld, Edward Curry, Elizabeth Daly, Brian Mac- recommenders in online experiments. In Proceedings of the 12th ACM Namee, Alice Marascu, Fabio Pinelli, Michele Berlingerio, and Neil Conference on Recommender Systems. ACM, 387–391. Hurley (Eds.). Springer International Publishing, Cham, 136–151. [9] Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan [24] Matei Zaharia, Reynold Xin, Patrick Wendell, Tathagata Das, Michael Bhamidipati, Jaikit Savla, Varun Bhagwan, and Doug Sharp. 2015. Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram E-commerce in your inbox: Product recommendations at scale. In Pro- Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott ceedings of the 21th ACM SIGKDD International Conference on Knowl- Shenker, and Ion Stoica. 2016. Apache Spark: a unified engine for big edge Discovery and Data Mining. ACM, 1809–1818. data processing. Commun. ACM 59 (2016), 56–65. [10] Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal [25] Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Daumé III. 2015. Deep unordered composition rivals syntactic methods Hao, and Bo Xu. 2016. Attention-based bidirectional long short-term for text classification. In Proceedings of the 53rd Annual Meeting of the memory networks for relation classification. In Proceedings of the Association for Computational Linguistics and the 7th International Joint 54th Annual Meeting of the Association for Computational Linguistics Conference on Natural Language Processing (Volume 1: Long Papers), (Volume 2: Short Papers), Vol. 2. 207–212. Vol. 1. 1681–1691. [26] Han Zhu, Xiang Li, Pengye Zhang, Guozheng Li, Jie He, Han Li, and [11] Christopher C Johnson. 2014. Logistic matrix factorization for implicit Kun Gai. 2018. Learning Tree-based Deep Model for Recommender Sys- feedback data. Advances in Neural Information Processing Systems 27 tems. In Proceedings of the 24th ACM SIGKDD International Conference (2014). on Knowledge Discovery & Data Mining. ACM, 1079–1088. [12] Thom Lake, Sinead A Williamson, Alexander T Hawk, Christopher C Johnson, and Benjamin P Wing. 2019. Large-scale Collaborative Filter- ing with Product Embeddings. arXiv preprint arXiv:1901.04321 (2019). Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).