Finding the best ranking model for spatial objects Hadi Fanaee Tork1 Abstract.1 Top-k spatial preference queries has a wide range of applications in service recommendation and decision support Table 1 Candidate Selection Criteria systems. In this work we first introduce three state of the art Method algorithms and apply them on a real data set which includes geographic coordinates and quality data of over 355 hotels, 276 Nearest Neighbor min(d (np, f mn )) point of interests and 563 restaurants in Lisbon, Portugal extracted Range Score d ( p, f m ) < R from well-known TripAdvisor2. This is the first time that Influence Score All mentioned algorithms are evaluated on a real data set. We also use some optimization tasks for the estimation of algorithms As we can see, Nearest Neighbor, from each type m retrieves n-th parameters. Finally we rank the hotels using the best obtained n element of that ( f m ) which has the minimum distance with p. ranking model. Result reveals that influence score with a particular Range score retrieves a list of items which have at least distance(d) radius is able to rank spatial objects very near to the real rankings. of pre-defined R with P. Influence score retrieves all the items for further computation. Afterwards, We define Score of point P 1 INTRODUCTION according to the following equation: m There exists an wide range of location-based applications that rely S p = ∑ Agg{wCim × α Cim } (1) on spatial preference queries. For instance, the tourist species a 1 spatial constraint (for instance the range around a hotel) to retrieve the facilities around the hotel. Then, if the eligible facilities are Where, Agg denotes the aggregation function which can be rated, the result of the query might be the top-k hotels which have maximum or sum. w is equal to the weight or quality of item(e.g. the best ranked facilities [3]. Top-k spatial preference query hotel with 5 star can have weight of 5 and hotel with one star can answers such kind of questions. It returns a ranked set of the k best have weight of 1) and i is an index of retrieved candidates. α is data objects based on the non-spatial score (quality) of feature influence function which is equal to 1 for Nearest Neighbor and objects and spatial score (distance) in its spatial neighborhood Range score and is equal to the equation 2 for Influence score. [1,2]. Several approaches have been proposed for ranking spatial data objects based on defining the score of a spatial data object p − ( d p , f mi ) based on the scores of feature objects that have p as their nearest α =2 R (2) neighbor. In the rest of the paper we first introduce a general framework of three algorithms entitled Range Score, Nearest Where d denotes the distance between point P and facility i of neighbor (NN) and Influence Score. Then in section 3 we present category m. and R is a pre-defined radius. the data set used in the paper. In the section 4 we explain our performed experiments. Later in section 5 we express the results. in Then the result of Top-K spatial preference query is a sorted list section 6 we show how we rank hotels of Lisbon based on the best of Sp for all point of interests (P). ranking model obtained and finally in section 7 we discuss the results and bring the conclusion of the paper. 3 DATA SET Data set is extracted from a well-known online tourism information 2 TOP-K SPATIAL FRAMEWORK source TripAdvisor which is the most biggest and richest source for A Spatial preference query, ranks the spatial objects based on travelers around the world to find the relevant information and quality of its neighbor facilities. For instance a tourist might other user feedbacks about hotels, restaurants and point of interests. retrieve a sorted list of hotels based on the facilities around that One of interesting service of TripAdvisor is providing a raking of (e.g. restaurant, hospital , market, etc.). Assume that p is our point all tourism locations. The ranking criteria are not visible to the of interest (e.g. a hotel) and we have m type of facilities(e.g. users but in general is a combination of on users opinions and restaurant means m=1 and park means m=2). Then assume that ratings and other sources. Nowadays many users around the world f mn is n-th facility from type m (e.g. Restaurant A). First we choose their destination, hotels and places to visit based on this retrieve a list of candidates for P according to Table 1. Table 1 ranking. shows how one of the methods choose the primary candidates. We extracted all hotels and all near restaurants and point of interests(POI) corresponding to city of Lisbon, Portugal. All GPS 1 LIAAD-INESC Porto, University of Porto , hadi.fanaee@fe.up.pt 2 http://www.tripadvisor.com 58 coordinates and quality factors were extracted from the Raw crawled HTML pages 1 Spearman Rank Correlation 0.8 We then transferred extracted records to the MySQL databases for Coefficient 0.6 NN further process. Finally we had three tables hotels, restaurants and RNG attractions with 355, 563 and 276 records respectively. 0.4 INFSUM INFMAX0 0.2 INFMAX1 Since for some locations , the GPS coordinates were not available, 0 we employed Google Map API[5] and Yahoo Map API[6] 0 5000 10000 15000 20000 25000 Geocoding service to fetch GPS coordinates. Then we removed the -0.2 R places which their coordinate was not available after the Geocoding step. We also removed those hotels which for them Figure 2. Spearman's rank correlation for different R from 100m to ranking was not available in TripAdvisor. 20000m for 4 rankers NN (nearest neighbor), range score(RNG), influence score with sum module(INFSUM), influence score with max module with considering the rating of attrac-tions(INFMAX0), influence score with max module and not considering the rating of attractions(INFMAX1) On of the important problem regarding the Influence Score approach is determination of R. In order to estimate the best R we generated 5 ranking set for R from 100 to 19900 meter by granularity of 100 meter. For both RNG and NN we used maximum aggregation while for INF we tested both maximum(INFMAX0 and INFMAX1) and sum function (INFSUM). Then we compute spearman rank correlation coefficient for each 5 generated rankings sets to the TripAdvisor Real ranking set. Figure 1. Experiment overview Results are shown in figure 2. The vertical axis represents the 4 EXPERIMENTS spearman rank correlation coefficient and the horizontal axis shows the R value. The best rankers are those that have the biggest area Two significant problems regarding the Top-k spatial preference under their curve. Therefore green curve which is related to the query is that first no evaluation on the ranking results is presented INFMAX0 would be identified as the best model. INFMAX1 which yet and second there is not any solution for estimating the radius do not consider the facilities quality is also placed at the second value in two of algorithms range score and influence score. In other place. The maximum correlation (73.4%) is obtained at R=700m words, when a ranking is made how we can make sure about the for INFMAX0 ranker and for INFMAX1 77% correlation is correctness of that, or better say how the ranking model correctly obtained at R=7500m. In terms of RNG have a constant behavior assign the spatial objects to the true ranks. between 0.522 and 0.526 very near to NN which is always equal to 0.519 and doesn’t change by the increasing of R. Solving this problem is impossible unless we could compare two generated and real ranking sets together. TripAdvisor real ranking set enable us to perform such comparison and measurement. Our performed experiments are illustrated in figure 1. we first apply Top-K spatial preference query algorithms on the data set and generate three ranking set namely NN, RNG and INF which stands for Nearest Neighbor , Range Score and Influence score respectively. Then in order to evaluate the ranking model we benefit from Spearman's rank correlation coefficient[7]. After this step we find out that which model with which parameters is the best model for predicting the ranking of a hotel. Thus in the next step we employ our best model to rank all the hotels in Lisbon. As mentioned in the section 2, Nearest neighbor is not dependant Figure 3. Spearman's rank correlation for R=7500m and Influence on radius R, so this algorithm doesn’t have any input parameters, Score with Max module and considering the attractions rating instead, two other algorithms Range score and Influence score has radius R as their input. In order to study the impact of quality weight on Influence Score method, we defined two kind of 5 Ranking of Lisbon Hotels Influence score, INFMAX0 and INFMAX1 so that in the latter one, We used our best ranking model (INFMAX0 with R=7500m) to w is considered to be equal to 1. it means INFMAX1 just consider rank all hotels in Lisbon. Figure 6 shows Spearman's rank the spatial property of place and ignores the weight(w). correlation computed between all generated rankings sets. P and R at the end of titles stands for attractions and restaurants respectively. For instance InfMaxR means that the corresponding 59 generated ranking set is obtained by just taking into account the RNG is affected by the rating of them and thus doesn’t change a restaurants and by using Influence score method. Best column lot. Because always it is possible to find one high quality attraction represents our best ranking model. The columns that doesn’t have near to the hotel. any R or P at the end of their title are those which both restaurants and attractions are considered in the ranking generation. Also The reason why influence score with sum aggregation gets another two columns review and TPrank denote the number of negative correlation is this fact that it counts all attractions and thus reviews done for that item in TripAdvisor and the corresponding consider very far attractions and thus distance in equation 2 goes rank in TripAdvisor respectively. upper and deduct the overall score. Some interesting facts can be extracted from this table. For instance intersection of InfMax and TPrank shows that generated REFERENCES ranking set by InfMax has +0.77 correlation to the real ranking [1] Man Lung Yiu; Hua Lu; Mamoulis, N.; Vaitis, M.; , "Ranking Spatial provided by TripAdvisor. Also some other interesting results can Data by Quality Preferences", IEEE Transactions on Knowledge and be obtained from this table. For example we can realize that Data Engineering, vol.23, no.3, pp.433-446, March 2011 Influence score with max aggregation if applied on just restaurant [2] J.B. Rocha-Junior , A.Vlachou , C.Doulkeridis , K.Nørvåg , Efficient data set has +0.94 correlation with ranking set generated with processing of top-k spatial preference queries , Journal Proceedings Nearest Neighbor. We also understand that influence score with of the VLDB Endowment ,Volume 4 Issue 2, November 2010 [3] Hauke J., Kossowski T., Comparison of values of Pearson’s and sum aggregation never performs good and always show a negative Spearman’s correlation coefficient on the same sets of data. correlation to TPrank. If we look the correlation between NN and Quaestiones Geographicae 30(2), Bogucki Wy-dawnictwo Naukowe, RNG we discover an interesting fact. It reveals that by using [4] 4. Barcelona Field Studies Centre S.L. Spearman's Rank R=7500m ranking set get highly correlated to nearest neighbor Correlation Coefficient ranker with 99.9% confidence. http://geographyfieldwork.com/SpearmansRank.htm [5] https://developers.google.com/maps/ [6] http://developer.yahoo.com/maps/ 6 DISCUSSIONS & CONCLUSION [7] C. Spearman, The Proof and Measurement of Association between Two Things, The American Journal of Psychology, Vol. 15, No. 1 In this paper we presented a new method for evaluation of Top-k (Jan., 1904), pp. 72-101 spatial preference query. One of the direct result we obtained was the high performance of original influence score ranker with max aggregation function that shows 77% correlation to real ranking of TripAdvisor. It means that when there is no ranking set available, this method can be a good alternative since it generates close ranking set. Second we proved that despite by a first glance, influence score with sum aggregation could have a wide cover on all attractions and thus could have a better ranking result, the opposite happened and it generally didn’t provide a good result. When we are dealing with very large data set, the computation cost will be the most important factor to choose a solution. Nearest neighbor and Range score can be a good choice since provide constant correlation of approximately 50%. As we also observed there is not considerable difference between INFMAX0 and INFMAX1 them. Even in R<700m not considering INFMAX1 that doesn’t consider the quality of facilities performs better. It reveal an important fact. Tourist usually use to visits close attractions to their hotel without considering the quality of them. However when distance goes upper than 700m the quality of that attraction gets important and they pay attention to the rating of that place with the goal of not wasting their time and money in transfer. In other words, tolerance threshold of travelers is the intersection of two curves InfMax0 and InfMax1 which is 2700m. It means that by increasing the distance from 700m to 2700m from the hotel, the motivation of travelers to look for rating of the attractions is increased. The reason why RNG and NN show a constant value is this fact that most hotel owners establish their hotel in a place that is close to at least some attractions. Except some minor cases, no hotel company invests on a place that is very far from all attractions. So when there is for example 4-5 attractions near to the hotels, their NN and 60