=Paper=
{{Paper
|id=Vol-3177/paper17
|storemode=property
|title=Inverse Reinforcement Learning and Point of Interest Recommendations
|pdfUrl=https://ceur-ws.org/Vol-3177/paper17.pdf
|volume=Vol-3177
|authors=David Massimo,Francesco Ricci
|dblpUrl=https://dblp.org/rec/conf/iir/Massimo022
}}
==Inverse Reinforcement Learning and Point of Interest Recommendations==
Inverse Reinforcement Learning and Point of Interest Recommendations⋆ Discussion Paper David Massimo, Francesco Ricci Free University of Bozen-Bolzano, Italy Abstract We here focus on Points of Interest (POIs) Recommender Systems (RSs), aimed at helping users visiting a city to discover new and relevant POIs. RSs are often assessed in offline settings, hence, measuring the system’s precision in predicting previously observed user behaviour. However, when deployed, the system produced recommendations are often of limited use, because they lack novelty. We conjecture that this phenomenon is primarily due to the limited capability of RSs in extracting from the observed behaviour general characteristics of POIs that are relevant for different classes of users (tourist types). We compare an Inverse Reinforcement Learning (IRL) based RS algorithm with more traditional Nearest Neighbour and Popularity-based ones. Through an offline evaluation, we show that the nearest neighbour and popularity-based RSs excel in precision (offline) and are perceived as not novel by users of a live-user study. On the contrary, despite a lower offline precision, the IRL-based RS, which learns the preferences of tourists for POIs characteristics, can give a better support to a tourist. Keywords recommender systems, tourism, user behaviour learning, evaluation 1. Introduction This paper focuses on Recommender Systems (RSs) in the tourism domain and summarises the results of Massimo and Ricci on Points of Interest (POIs) recommendation [1]. In [2] is introduced a next POI-visit recommendation strategy, named Q - B A S E , which uses Inverse Reinforcement Learning. It consists of three steps. Firstly, it identifies different tourist clusters based on POI-visit sequence observations. Then, by analysing the sequential consump- tion pattern of clustered users’ POI visits, it learns their preferences for POI features and the reward a (generic) user belonging to a cluster obtains in conducting certain POI visits. Finally, Q - B A S E computes the state action-value function 𝑄 which tells how much total reward the user will gain if she selects to visit any POI and keeps choosing POIs according to the optimal visit selection policy which is optimal for their cluster. In an offline evaluation [2] Q - B A S E has been compared to the sequence aware recommendation strategy S K N N [3]. Given a set of users’ POI-visit sequences and a target user (partial) POI-visit sequence, S K N N recommends to the target IIR2022: 12th Italian Information Retrieval Workshop, June 29 - July 1, 2022, Milan, Italy ⋆ This paper presents some of the results published in Massimo, D., Ricci, F. Popularity, novelty and relevance in point of interest recommendation: an experimental analysis. Inf Technol Tourism 23, 473–508 (2021). Envelope-Open davmassimo@unibz.it (D. Massimo); fricci@unibz.it (F. Ricci) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) user next POI visits, which are prevalent in the POI visit sequences of the most similar users. An offline evaluation showed that S K N N offers recommendations that more precisely predict the expected user behaviour, but are not novel. Besides, Q - B A S E excelled in suggesting novel POI visits that are also more rewarding than S K N N . This comes at the cost of a lower precision. It was conjectured that in an online RS, when Q - B A S E is again compared to S K N N , its recommendations will be perceived as relevant because they contain novel POIs with a larger reward. Besides, it was supposed that S K N N ’s higher precision is due to its bias in suggesting popular items. We therefore investigated two research questions. (RQ1) If S K N N achieves higher precision by being biased towards popular items, can Q - B A S E be modified, by biasing its recommendations towards more popular items, to achieve a similar precision of S K N N ? (RQ2) Will online users like the precise recommendations of S K N N more than those generated by Q - B A S E , which are more novel and yet relevant? In order to investigate these questions we also proposed a novel and adaptable RS, called Q - P O P P U S H , derived from Q - B A S E and that generates recommendations that simultaneously optimise two criteria: the reward of the recommendations (as for Q - B A S E ) but also their popularity. Q - P O P P U S H computes the harmonic means of the scores given by Q - B A S E to a POI visit and the popularity of the POI (visit occurrences in the observed data). To weigh the relative importance of Q - B A S E original score and the popularity-based score, Q - P O P P U S H uses a parameter 𝛼. With 𝛼 > 0.5 the popularity of the POI-visit has a higher importance, whereas with 𝛼 < 0.5 the Q - B A S E component is weighted more. Equal importance is given with 𝛼 = 0.5. We have tested Q-POP PUSH in an offline experiment by comparing its performance with Q - B A S E , a popularity baseline and two Nearest Neighbour next-item RSs: S K N N and s - S K N N [3, 4]. Finally, we have assessed the user perception of the recommendations generated by the best performing (offline) models in a user study. The rest of this paper summarises the offline experiment, the user study and concludes with a discussion. 2. Experimental analysis The offline experiment was conducted along the classical Machine Learning evaluation approach, with train and test sets [5, 6, 7]. The train set is used to train the models 1 and the test set is used for recommendations generation and evaluation. In particular, 70% of a test POI-visit sequence is used to identify the recommendations and the remaining 30% to assess the recommendation performance. We used a dataset of 1663 POI-visit sequences (over 500 POIs) in the Italian city of Florence, done by 1163 anonymous users. The metrics used to assess Q - B A S E , Q - P O P P U S H , S K N N , s - S K N N and the popularity baseline POP are: reward as in [2]; popularity intended as a proxy of novelty; precision; and similarity of a recommendation list to the list generated by S K N N . Since S K N N is very precise but also affected by a popularity bias, it is interesting to measure how much Q - B A S E and Q - P O P P U S H deviates from S K N N . All the metrics range in [0, 1], where values close to 0 (1) indicate a low (high) performance. For more details about procedures and metrics, we refer to [1]. 1 Nearest neighbour models and POP do not need any training, but use train data at test time. In Table 1 we report the metric values for Top-5 recommendations, obtained in the offline evaluation. S K N N and s - S K N N perform essentially the same on the reward, precision and popularity Table 1 Performance (top-5) of the considered RSs. Best value for each metric is shown in boldface. Q-BASE Q-POP PUSH SKNN s-SKNN POP 𝛼 = 0.009 𝛼 = 0.1 𝛼 = 0.5 𝛼 = 0.8 Reward 0.032* 0.020 -0.001 -0.009 -0.009 -0.010 -0.010 -0.015 Precision 0.045 0.062 0.063 0.060 0.060 0.068 0.063 0.050 Popularity 0.319* 0.517 0.634 0.643 0.643 0.528 0.570 0.733 SimKNN 0.061 0.307 0.441 0.451 0.450 - 0.530 0.352 * There is a significant difference (𝑝 < 0.05) between the best performing model and S K N N (two-tailed paired t-test). The test is run for the metrics Rew, Prec and Pop for 5 repeated train-test split. metrics. Not surprisingly, s - S K N N produces, on average, recommendations lists that overlap up to 53% to the list of SKNN (SimKNN ). When comparing SKNN to POP, we note that the overlap of the recommendations is quite low, but still, POP has a reasonably good precision, considering the simplicity of the approach. Q - B A S E suggests much less popular items than SKNN and also with higher reward. The analysis of Q - P O P P U S H performance allows to address the research question RQ1. By intro- ducing a popularity bias to Q - B A S E , as it is done in Q - P O P P U S H , the generated recommendations become more similar to S K N N . In fact, when 𝛼 is increased, the popularity of the recommendations produced by Q - P O P P U S H rises, it reaches and passes that of S K N N . However, it is interesting to note that with a rather small value of 𝛼 = 0.009 Q - P O P P U S H obtains a precision of 0.062, which is very close to the precision of S K N N (0.068) and s - S K N N (0.063). With this setting Q - P O P P U S H has still a positive reward (0.02), while S K N N and s - S K N N have negative rewards. This means that a small popularity bias in Q - B A S E could be beneficial to improve the system’s precision. We must also note that with larger values of 𝛼 the reward of Q - P O P P U S H is becoming smaller and smaller and approaching that of the two nearest neighbour methods. Clearly, Q - P O P P U S H can be tuned to balance two objectives: precision and reward. To answer the second research question RQ2 we have designed a live user study aimed at measuring the user’s perceived novelty and appreciation of the recommendations generated by Q - B A S E , Q - P O P P U S H , with parameter 𝛼 = 0.5 (to give equal importance to POIs’ popularity and reward) and the S K N N baseline. We have developed an online system to assess the quality of next-POI recommendations offered to users who have already visited some POIs. The system initially asks to declare visited POIs, which are used to create a hypothetical itinerary the user is supposed to have already visited. Then, the system generates next-POI recommendations with the three tested RS algorithms, combines them in a unique list, and asks the user to evaluate them based on a description of the recommended items. The user does not know which algorithm recommends the displayed recommendations. The training data of the online system is the same we have used in the offline study. The user evaluates each POI by marking it with one or more of the following labels: “I already visited it”, “I like it” for a next visit and “I didn’t know it”. The online user study participants were recruited via social media and mailing lists. Out of 202 users that accessed the application, we identified 158 reliable recommendation sessions. Table 2 Probability to evaluate a POI recommendation as visited, novel, liked and both novel and liked. Recommender System Visited Novel Liked Liked & Novel Q-BASE 0.165* 0.517* 0.361* 0.091 Q-POP PUSH 0.245 0.376 0.464 0.076 SKNN 0.238 0.371 0.466 0.082 * indicates significant difference from the other two RSs perf. (two proportion z-test, 𝑝 < 0.05) Table 2 shows the estimated probability that a user marks as “visited”, “novel”, “liked” or both “liked” and “novel” a POI recommended by a specific RS. Q - B A S E recommends POIs that are less likely to have been already visited by the user and more likely to be novel than those suggested by Q - P O P P U S H and S K N N . As in the offline experiment, Q - P O P P U S H and S K N N performs similarly. Interestingly, Q - B A S E suggests fewer POIs that are liked when compared to the other two strategies. Hence, apparently a more precise RS, based on an offline test, also recommends online items that the user will like more. Besides, the obtained results falsify our hypothesis that optimising the reward of a recommendation, as Q - B A S E do, will produce recommendations that the user will like more. However, Q - B A S E suggests more novel POIs and, interestingly, more recommendations that are both liked and novel (last column in Table 2). In a successive analysis we computed the probability that a user will like a recommendation given the following three conditions: she knows the POI but has not yet visited it; she has already seen it; the item is novel (unknown). We have derived the following conclusions. The users liked more the novel POI-visit suggestions generated by S K N N and Q - P O P P U S H than those produced by Q - B A S E . This is due to the tendency of Q - B A S E to suggest POI visits that even if they have the properties typically liked by the user, e.g., they are of the same type of the POIs liked by the user, they are also not popular POIs. Hence, these suggested POIs are hard to be appreciated. The conclusion we derived from the user study is that users tend to like more the items they are familiar with, e.g., previously visited items or items that are not novel. 3. Discussion The results of our experiments seem to confirm that RSs that precisely predict the user choices (offline) are also liked most by real users. In our case, this means that Q - P O P P U S H and S K N N are better RSs than Q - B A S E . Our explanation of this result is that both high offline precision and large probability of liked recommendations (online) are influenced by the popularity of the recommended items. In fact, these popular items are often in the users’ test sets, and users are likely to be familiar with them. Despite its lower offline precision performance and a lower extent of liked recommendations in the user study, Q - B A S E is the RS that may better accomplish the true goal of a tourism RS: it suggests more next-POIs that are both liked and novel. So, by optimising the reward Q - B A S E is capable of discovering novel items that are also appreciated (when users are able to assess them). References [1] D. Massimo, F. Ricci, Popularity, novelty and relevance in point of interest recommendation: an experimental analysis, J. Inf. Technol. Tour. 23 (2021) 473–508. URL: https://doi.org/10. 1007/s40558-021-00214-5. doi:1 0 . 1 0 0 7 / s 4 0 5 5 8 - 0 2 1 - 0 0 2 1 4 - 5 . [2] D. Massimo, F. Ricci, Harnessing a generalised user behaviour model for next-poi recom- mendation, in: S. Pera, M. D. Ekstrand, X. Amatriain, J. O’Donovan (Eds.), Proceedings of the 12th ACM Conference on Recommender Systems, RecSys 2018, Vancouver, BC, Canada, October 2-7, 2018, ACM, 2018, pp. 402–406. [3] D. Jannach, I. Kamehkhosh, L. Lerche, Leveraging multi-dimensional user models for personalized next-track music recommendation, in: A. Seffah, B. Penzenstadler, C. Alves, X. Peng (Eds.), Proceedings of the Symposium on Applied Computing, SAC 2017, Marrakech, Morocco, April 3-7, 2017, ACM, 2017, pp. 1635–1642. URL: https://doi.org/10.1145/3019612. 3019756. doi:1 0 . 1 1 4 5 / 3 0 1 9 6 1 2 . 3 0 1 9 7 5 6 . [4] M. Ludewig, D. Jannach, Evaluation of session-based recommendation algorithms, User Model. User-Adapt. Interact. 28 (2018) 331–390. [5] A. Bellogín, A. Said, Recommender Systems Evaluation, Springer New York, New York, NY, 2018, pp. 2095–2112. URL: https://doi.org/10.1007/978-1-4939-7131-2_110162. doi:1 0 . 1 0 0 7 / 978- 1- 4939- 7131- 2_110162. [6] J. L. Herlocker, J. A. Konstan, L. G. Terveen, J. T. Riedl, Evaluating collaborative filtering recommender systems, ACM Transactions on Information Systems (TOIS) 22 (2004) 5–53. [7] P. Cremonesi, Y. Koren, R. Turrin, Performance of recommender algorithms on top-n recommendation tasks, in: Proceedings of the Fourth ACM Conference on Recommender Systems, RecSys ’10, Association for Computing Machinery, New York, NY, USA, 2010, p. 39–46. URL: https://doi.org/10.1145/1864708.1864721. doi:1 0 . 1 1 4 5 / 1 8 6 4 7 0 8 . 1 8 6 4 7 2 1 .