Finding the best ranking model for spatial objects
                                                              Hadi Fanaee Tork1

Abstract.1 Top-k spatial preference queries has a wide range of
applications in service recommendation and decision support                                       Table 1 Candidate Selection Criteria
systems. In this work we first introduce three state of the art
                                                                              Method
algorithms and apply them on a real data set which includes
geographic coordinates and quality data of over 355 hotels, 276               Nearest Neighbor                            min(d (np, f mn ))
point of interests and 563 restaurants in Lisbon, Portugal extracted          Range Score                                 d ( p, f m ) < R
from well-known TripAdvisor2. This is the first time that                     Influence Score                             All
mentioned algorithms are evaluated on a real data set. We also use
some optimization tasks for the estimation of algorithms                      As we can see, Nearest Neighbor, from each type m retrieves n-th
parameters. Finally we rank the hotels using the best obtained                                     n
                                                                              element of that ( f m ) which has the minimum distance with p.
ranking model. Result reveals that influence score with a particular
                                                                              Range score retrieves a list of items which have at least distance(d)
radius is able to rank spatial objects very near to the real rankings.
                                                                              of pre-defined R with P. Influence score retrieves all the items for
                                                                              further computation. Afterwards, We define Score of point P
1           INTRODUCTION                                                      according to the following equation:
                                                                                                        m
There exists an wide range of location-based applications that rely                              S p = ∑ Agg{wCim × α Cim }                (1)
on spatial preference queries. For instance, the tourist species a                                       1
spatial constraint (for instance the range around a hotel) to retrieve
the facilities around the hotel. Then, if the eligible facilities are         Where, Agg denotes the aggregation function which can be
rated, the result of the query might be the top-k hotels which have           maximum or sum. w is equal to the weight or quality of item(e.g.
the best ranked facilities [3]. Top-k spatial preference query                hotel with 5 star can have weight of 5 and hotel with one star can
answers such kind of questions. It returns a ranked set of the k best         have weight of 1) and i is an index of retrieved candidates. α is
data objects based on the non-spatial score (quality) of feature              influence function which is equal to 1 for Nearest Neighbor and
objects and spatial score (distance) in its spatial neighborhood              Range score and is equal to the equation 2 for Influence score.
[1,2]. Several approaches have been proposed for ranking spatial
data objects based on defining the score of a spatial data object p                                             −
                                                                                                                      (
                                                                                                                    d p , f mi   )
based on the scores of feature objects that have p as their nearest                                     α =2              R
                                                                                                                                     (2)
neighbor. In the rest of the paper we first introduce a general
framework of three algorithms entitled Range Score, Nearest                      Where d denotes the distance between point P and facility i of
neighbor (NN) and Influence Score. Then in section 3 we present               category m. and R is a pre-defined radius.
the data set used in the paper. In the section 4 we explain our
performed experiments. Later in section 5 we express the results. in             Then the result of Top-K spatial preference query is a sorted list
section 6 we show how we rank hotels of Lisbon based on the best              of Sp for all point of interests (P).
ranking model obtained and finally in section 7 we discuss the
results and bring the conclusion of the paper.
                                                                              3         DATA SET
                                                                              Data set is extracted from a well-known online tourism information
2           TOP-K SPATIAL FRAMEWORK                                           source TripAdvisor which is the most biggest and richest source for
A Spatial preference query, ranks the spatial objects based on                travelers around the world to find the relevant information and
quality of its neighbor facilities. For instance a tourist might              other user feedbacks about hotels, restaurants and point of interests.
retrieve a sorted list of hotels based on the facilities around that          One of interesting service of TripAdvisor is providing a raking of
(e.g. restaurant, hospital , market, etc.). Assume that p is our point        all tourism locations. The ranking criteria are not visible to the
of interest (e.g. a hotel) and we have m type of facilities(e.g.              users but in general is a combination of on users opinions and
restaurant means m=1 and park means m=2). Then assume that                    ratings and other sources. Nowadays many users around the world
 f mn is n-th facility from type m (e.g. Restaurant A). First we              choose their destination, hotels and places to visit based on this
retrieve a list of candidates for P according to Table 1. Table 1             ranking.
shows how one of the methods choose the primary candidates.
                                                                              We extracted all hotels and all near restaurants and point of
                                                                              interests(POI) corresponding to city of Lisbon, Portugal. All GPS
1
    LIAAD-INESC Porto, University of Porto , hadi.fanaee@fe.up.pt
2
    http://www.tripadvisor.com


                                                                         58
coordinates and quality factors were extracted from the Raw
crawled HTML pages                                                                                             1


                                                                                 Spearman Rank Correlation
                                                                                                             0.8
We then transferred extracted records to the MySQL databases for


                                                                                        Coefficient
                                                                                                             0.6                                                   NN
further process. Finally we had three tables hotels, restaurants and                                                                                               RNG
attractions with 355, 563 and 276 records respectively.                                                      0.4                                                   INFSUM
                                                                                                                                                                   INFMAX0
                                                                                                             0.2                                                   INFMAX1
Since for some locations , the GPS coordinates were not available,
                                                                                                               0
we employed Google Map API[5] and Yahoo Map API[6]                                                                  0   5000   10000       15000   20000   25000

Geocoding service to fetch GPS coordinates. Then we removed the                                              -0.2
                                                                                                                                       R
places which their coordinate was not available after the
Geocoding step. We also removed those hotels which for them                     Figure 2. Spearman's rank correlation for different R from 100m to
ranking was not available in TripAdvisor.                                    20000m for 4 rankers NN (nearest neighbor), range score(RNG), influence
                                                                             score with sum module(INFSUM), influence score with max module with
                                                                             considering the rating of attrac-tions(INFMAX0), influence score with max
                                                                             module and not considering the rating of attractions(INFMAX1)

                                                                             On of the important problem regarding the Influence Score
                                                                             approach is determination of R. In order to estimate the best R we
                                                                             generated 5 ranking set for R from 100 to 19900 meter by
                                                                             granularity of 100 meter. For both RNG and NN we used
                                                                             maximum aggregation while for INF we tested both
                                                                             maximum(INFMAX0 and INFMAX1) and sum function (INFSUM).
                                                                             Then we compute spearman rank correlation coefficient for each 5
                                                                             generated rankings sets to the TripAdvisor Real ranking set.
                     Figure 1. Experiment overview
                                                                             Results are shown in figure 2. The vertical axis represents the
4         EXPERIMENTS                                                        spearman rank correlation coefficient and the horizontal axis shows
                                                                             the R value. The best rankers are those that have the biggest area
Two significant problems regarding the Top-k spatial preference              under their curve. Therefore green curve which is related to the
query is that first no evaluation on the ranking results is presented        INFMAX0 would be identified as the best model. INFMAX1 which
yet and second there is not any solution for estimating the radius           do not consider the facilities quality is also placed at the second
value in two of algorithms range score and influence score. In other         place. The maximum correlation (73.4%) is obtained at R=700m
words, when a ranking is made how we can make sure about the                 for INFMAX0 ranker and for INFMAX1 77% correlation is
correctness of that, or better say how the ranking model correctly           obtained at R=7500m. In terms of RNG have a constant behavior
assign the spatial objects to the true ranks.                                between 0.522 and 0.526 very near to NN which is always equal to
                                                                             0.519 and doesn’t change by the increasing of R.
   Solving this problem is impossible unless we could compare
two generated and real ranking sets together. TripAdvisor real
ranking set enable us to perform such comparison and
measurement. Our performed experiments are illustrated in figure
1. we first apply Top-K spatial preference query algorithms on the
data set and generate three ranking set namely NN, RNG and INF
which stands for Nearest Neighbor , Range Score and Influence
score respectively. Then in order to evaluate the ranking model we
benefit from Spearman's rank correlation coefficient[7]. After this
step we find out that which model with which parameters is the
best model for predicting the ranking of a hotel. Thus in the next
step we employ our best model to rank all the hotels in Lisbon.

As mentioned in the section 2, Nearest neighbor is not dependant                           Figure 3. Spearman's rank correlation for R=7500m and Influence
on radius R, so this algorithm doesn’t have any input parameters,                          Score with Max module and considering the attractions rating
instead, two other algorithms Range score and Influence score has
radius R as their input. In order to study the impact of quality
weight on Influence Score method, we defined two kind of                     5                                 Ranking of Lisbon Hotels
Influence score, INFMAX0 and INFMAX1 so that in the latter one,              We used our best ranking model (INFMAX0 with R=7500m) to
w is considered to be equal to 1. it means INFMAX1 just consider             rank all hotels in Lisbon. Figure 6 shows Spearman's rank
the spatial property of place and ignores the weight(w).                     correlation computed between all generated rankings sets. P and R
                                                                             at the end of titles stands for attractions and restaurants
                                                                             respectively. For instance InfMaxR means that the corresponding


                                                                        59
generated ranking set is obtained by just taking into account the               RNG is affected by the rating of them and thus doesn’t change a
restaurants and by using Influence score method. Best column                    lot. Because always it is possible to find one high quality attraction
represents our best ranking model. The columns that doesn’t have                near to the hotel.
any R or P at the end of their title are those which both restaurants
and attractions are considered in the ranking generation. Also                  The reason why influence score with sum aggregation gets
another two columns review and TPrank denote the number of                      negative correlation is this fact that it counts all attractions and thus
reviews done for that item in TripAdvisor and the corresponding                 consider very far attractions and thus distance in equation 2 goes
rank in TripAdvisor respectively.                                               upper and deduct the overall score.

   Some interesting facts can be extracted from this table. For
instance intersection of InfMax and TPrank shows that generated                 REFERENCES
ranking set by InfMax has +0.77 correlation to the real ranking                 [1]   Man Lung Yiu; Hua Lu; Mamoulis, N.; Vaitis, M.; , "Ranking Spatial
provided by TripAdvisor. Also some other interesting results can                      Data by Quality Preferences", IEEE Transactions on Knowledge and
be obtained from this table. For example we can realize that                          Data Engineering, vol.23, no.3, pp.433-446, March 2011
Influence score with max aggregation if applied on just restaurant              [2]   J.B. Rocha-Junior , A.Vlachou , C.Doulkeridis , K.Nørvåg , Efficient
data set has +0.94 correlation with ranking set generated with                        processing of top-k spatial preference queries , Journal Proceedings
Nearest Neighbor. We also understand that influence score with                        of the VLDB Endowment ,Volume 4 Issue 2, November 2010
                                                                                [3]   Hauke J., Kossowski T., Comparison of values of Pearson’s and
sum aggregation never performs good and always show a negative
                                                                                      Spearman’s correlation coefficient on the same sets of data.
correlation to TPrank. If we look the correlation between NN and
                                                                                      Quaestiones Geographicae 30(2), Bogucki Wy-dawnictwo Naukowe,
RNG we discover an interesting fact. It reveals that by using                   [4]   4. Barcelona Field Studies Centre S.L. Spearman's Rank
R=7500m ranking set get highly correlated to nearest neighbor                         Correlation                                               Coefficient
ranker with 99.9% confidence.                                                         http://geographyfieldwork.com/SpearmansRank.htm
                                                                                [5]   https://developers.google.com/maps/
                                                                                [6]   http://developer.yahoo.com/maps/
6         DISCUSSIONS & CONCLUSION                                              [7]   C. Spearman, The Proof and Measurement of Association between
                                                                                      Two Things, The American Journal of Psychology, Vol. 15, No. 1
In this paper we presented a new method for evaluation of Top-k
                                                                                      (Jan., 1904), pp. 72-101
spatial preference query. One of the direct result we obtained was
the high performance of original influence score ranker with max
aggregation function that shows 77% correlation to real ranking of
TripAdvisor. It means that when there is no ranking set available,
this method can be a good alternative since it generates close
ranking set. Second we proved that despite by a first glance,
influence score with sum aggregation could have a wide cover on
all attractions and thus could have a better ranking result, the
opposite happened and it generally didn’t provide a good result.

When we are dealing with very large data set, the computation cost
will be the most important factor to choose a solution. Nearest
neighbor and Range score can be a good choice since provide
constant correlation of approximately 50%.

As we also observed there is not considerable difference between
INFMAX0 and INFMAX1 them. Even in R<700m not considering
INFMAX1 that doesn’t consider the quality of facilities performs
better. It reveal an important fact. Tourist usually use to visits close
attractions to their hotel without considering the quality of them.
However when distance goes upper than 700m the quality of that
attraction gets important and they pay attention to the rating of that
place with the goal of not wasting their time and money in transfer.
In other words, tolerance threshold of travelers is the intersection
of two curves InfMax0 and InfMax1 which is 2700m. It means that
by increasing the distance from 700m to 2700m from the hotel, the
motivation of travelers to look for rating of the attractions is
increased.

The reason why RNG and NN show a constant value is this fact that
most hotel owners establish their hotel in a place that is close to at
least some attractions. Except some minor cases, no hotel company
invests on a place that is very far from all attractions. So when
there is for example 4-5 attractions near to the hotels, their NN and


                                                                           60