<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Finding the best ranking model for spatial objects</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hadi Fanaee Tork</string-name>
          <email>hadi.fanaee@fe.up.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LIAAD-INESC Porto, University of Porto</institution>
        </aff>
      </contrib-group>
      <fpage>58</fpage>
      <lpage>60</lpage>
      <abstract>
        <p>1 Top-k spatial preference queries has a wide range of applications in service recommendation and decision support systems. In this work we first introduce three state of the art algorithms and apply them on a real data set which includes geographic coordinates and quality data of over 355 hotels, 276 point of interests and 563 restaurants in Lisbon, Portugal extracted from well-known TripAdvisor2. This is the first time that mentioned algorithms are evaluated on a real data set. We also use some optimization tasks for the estimation of algorithms parameters. Finally we rank the hotels using the best obtained ranking model. Result reveals that influence score with a particular radius is able to rank spatial objects very near to the real rankings. There exists an wide range of location-based applications that rely on spatial preference queries. For instance, the tourist species a spatial constraint (for instance the range around a hotel) to retrieve the facilities around the hotel. Then, if the eligible facilities are rated, the result of the query might be the top-k hotels which have the best ranked facilities [3]. Top-k spatial preference query answers such kind of questions. It returns a ranked set of the k best data objects based on the non-spatial score (quality) of feature objects and spatial score (distance) in its spatial neighborhood [1,2]. Several approaches have been proposed for ranking spatial data objects based on defining the score of a spatial data object p based on the scores of feature objects that have p as their nearest neighbor. In the rest of the paper we first introduce a general framework of three algorithms entitled Range Score, Nearest neighbor (NN) and Influence Score. Then in section 3 we present the data set used in the paper. In the section 4 we explain our performed experiments. Later in section 5 we express the results. in section 6 we show how we rank hotels of Lisbon based on the best ranking model obtained and finally in section 7 we discuss the results and bring the conclusion of the paper.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
    </sec>
    <sec id="sec-2">
      <title>TOP-K SPATIAL FRAMEWORK</title>
      <p>A Spatial preference query, ranks the spatial objects based on
quality of its neighbor facilities. For instance a tourist might
retrieve a sorted list of hotels based on the facilities around that
(e.g. restaurant, hospital , market, etc.). Assume that p is our point
of interest (e.g. a hotel) and we have m type of facilities(e.g.
restaurant means m=1 and park means m=2). Then assume that
fmn is n-th facility from type m (e.g. Restaurant A). First we
retrieve a list of candidates for P according to Table 1. Table 1
shows how one of the methods choose the primary candidates.
min(d ( p, fmn ))
d ( p, f mn ) &lt; R
All
As we can see, Nearest Neighbor, from each type m retrieves n-th
element of that ( fmn ) which has the minimum distance with p.
Range score retrieves a list of items which have at least distance(d)
of pre-defined R with P. Influence score retrieves all the items for
further computation. Afterwards, We define Score of point P
according to the following equation:</p>
      <p>m
S p = ∑ Agg{wCmi ×α Cmi } (1)</p>
      <p>1
Where, Agg denotes the aggregation function which can be
maximum or sum. w is equal to the weight or quality of item(e.g.
hotel with 5 star can have weight of 5 and hotel with one star can
have weight of 1) and i is an index of retrieved candidates. α is
influence function which is equal to 1 for Nearest Neighbor and
Range score and is equal to the equation 2 for Influence score.
α = 2
−
d (p, f mi )</p>
      <p>R
(2)</p>
      <p>Where d denotes the distance between point P and facility i of
category m. and R is a pre-defined radius.</p>
      <p>Then the result of Top-K spatial preference query is a sorted list
of Sp for all point of interests (P).
3
Data set is extracted from a well-known online tourism information
source TripAdvisor which is the most biggest and richest source for
travelers around the world to find the relevant information and
other user feedbacks about hotels, restaurants and point of interests.
One of interesting service of TripAdvisor is providing a raking of
all tourism locations. The ranking criteria are not visible to the
users but in general is a combination of on users opinions and
ratings and other sources. Nowadays many users around the world
choose their destination, hotels and places to visit based on this
ranking.</p>
      <p>We extracted all hotels and all near restaurants and point of
interests(POI) corresponding to city of Lisbon, Portugal. All GPS
coordinates and quality factors were extracted from the Raw
crawled HTML pages
We then transferred extracted records to the MySQL databases for
further process. Finally we had three tables hotels, restaurants and
attractions with 355, 563 and 276 records respectively.
Since for some locations , the GPS coordinates were not available,
we employed Google Map API[5] and Yahoo Map API[6]
Geocoding service to fetch GPS coordinates. Then we removed the
places which their coordinate was not available after the
Geocoding step. We also removed those hotels which for them
ranking was not available in TripAdvisor.</p>
    </sec>
    <sec id="sec-3">
      <title>EXPERIMENTS</title>
      <p>Two significant problems regarding the Top-k spatial preference
query is that first no evaluation on the ranking results is presented
yet and second there is not any solution for estimating the radius
value in two of algorithms range score and influence score. In other
words, when a ranking is made how we can make sure about the
correctness of that, or better say how the ranking model correctly
assign the spatial objects to the true ranks.</p>
      <p>Solving this problem is impossible unless we could compare
two generated and real ranking sets together. TripAdvisor real
ranking set enable us to perform such comparison and
measurement. Our performed experiments are illustrated in figure
1. we first apply Top-K spatial preference query algorithms on the
data set and generate three ranking set namely NN, RNG and INF
which stands for Nearest Neighbor , Range Score and Influence
score respectively. Then in order to evaluate the ranking model we
benefit from Spearman's rank correlation coefficient[7]. After this
step we find out that which model with which parameters is the
best model for predicting the ranking of a hotel. Thus in the next
step we employ our best model to rank all the hotels in Lisbon.
As mentioned in the section 2, Nearest neighbor is not dependant
on radius R, so this algorithm doesn’t have any input parameters,
instead, two other algorithms Range score and Influence score has
radius R as their input. In order to study the impact of quality
weight on Influence Score method, we defined two kind of
Influence score, INFMAX0 and INFMAX1 so that in the latter one,
w is considered to be equal to 1. it means INFMAX1 just consider
the spatial property of place and ignores the weight(w).
NN
RNG
INFSUM
INFMAX0
INFMAX1
5000
10000
15000
20000</p>
      <p>25000</p>
      <p>R
On of the important problem regarding the Influence Score
approach is determination of R. In order to estimate the best R we
generated 5 ranking set for R from 100 to 19900 meter by
granularity of 100 meter. For both RNG and NN we used
maximum aggregation while for INF we tested both
maximum(INFMAX0 and INFMAX1) and sum function (INFSUM).
Then we compute spearman rank correlation coefficient for each 5
generated rankings sets to the TripAdvisor Real ranking set.
Results are shown in figure 2. The vertical axis represents the
spearman rank correlation coefficient and the horizontal axis shows
the R value. The best rankers are those that have the biggest area
under their curve. Therefore green curve which is related to the
INFMAX0 would be identified as the best model. INFMAX1 which
do not consider the facilities quality is also placed at the second
place. The maximum correlation (73.4%) is obtained at R=700m
for INFMAX0 ranker and for INFMAX1 77% correlation is
obtained at R=7500m. In terms of RNG have a constant behavior
between 0.522 and 0.526 very near to NN which is always equal to
0.519 and doesn’t change by the increasing of R.
generated ranking set is obtained by just taking into account the
restaurants and by using Influence score method. Best column
represents our best ranking model. The columns that doesn’t have
any R or P at the end of their title are those which both restaurants
and attractions are considered in the ranking generation. Also
another two columns review and TPrank denote the number of
reviews done for that item in TripAdvisor and the corresponding
rank in TripAdvisor respectively.</p>
      <p>Some interesting facts can be extracted from this table. For
instance intersection of InfMax and TPrank shows that generated
ranking set by InfMax has +0.77 correlation to the real ranking
provided by TripAdvisor. Also some other interesting results can
be obtained from this table. For example we can realize that
Influence score with max aggregation if applied on just restaurant
data set has +0.94 correlation with ranking set generated with
Nearest Neighbor. We also understand that influence score with
sum aggregation never performs good and always show a negative
correlation to TPrank. If we look the correlation between NN and
RNG we discover an interesting fact. It reveals that by using
R=7500m ranking set get highly correlated to nearest neighbor
ranker with 99.9% confidence.
6</p>
    </sec>
    <sec id="sec-4">
      <title>DISCUSSIONS &amp; CONCLUSION</title>
      <p>In this paper we presented a new method for evaluation of Top-k
spatial preference query. One of the direct result we obtained was
the high performance of original influence score ranker with max
aggregation function that shows 77% correlation to real ranking of
TripAdvisor. It means that when there is no ranking set available,
this method can be a good alternative since it generates close
ranking set. Second we proved that despite by a first glance,
influence score with sum aggregation could have a wide cover on
all attractions and thus could have a better ranking result, the
opposite happened and it generally didn’t provide a good result.
When we are dealing with very large data set, the computation cost
will be the most important factor to choose a solution. Nearest
neighbor and Range score can be a good choice since provide
constant correlation of approximately 50%.</p>
      <p>As we also observed there is not considerable difference between
INFMAX0 and INFMAX1 them. Even in R&lt;700m not considering
INFMAX1 that doesn’t consider the quality of facilities performs
better. It reveal an important fact. Tourist usually use to visits close
attractions to their hotel without considering the quality of them.
However when distance goes upper than 700m the quality of that
attraction gets important and they pay attention to the rating of that
place with the goal of not wasting their time and money in transfer.
In other words, tolerance threshold of travelers is the intersection
of two curves InfMax0 and InfMax1 which is 2700m. It means that
by increasing the distance from 700m to 2700m from the hotel, the
motivation of travelers to look for rating of the attractions is
increased.</p>
      <p>The reason why RNG and NN show a constant value is this fact that
most hotel owners establish their hotel in a place that is close to at
least some attractions. Except some minor cases, no hotel company
invests on a place that is very far from all attractions. So when
there is for example 4-5 attractions near to the hotels, their NN and</p>
      <p>RNG is affected by the rating of them and thus doesn’t change a
lot. Because always it is possible to find one high quality attraction
near to the hotel.</p>
      <p>The reason why influence score with sum aggregation gets
negative correlation is this fact that it counts all attractions and thus
consider very far attractions and thus distance in equation 2 goes
upper and deduct the overall score.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Man</given-names>
            <surname>Lung Yiu</surname>
          </string-name>
          ; Hua Lu; Mamoulis,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Vaitis</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ; ,
          <article-title>"Ranking Spatial Data by Quality Preferences"</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          , vol.
          <volume>23</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>433</fpage>
          -
          <lpage>446</lpage>
          ,
          <year>March 2011</year>
          J.B.
          <string-name>
            <surname>Rocha-Junior</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Vlachou</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doulkeridis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Nørvåg</surname>
          </string-name>
          ,
          <article-title>Efficient processing of top-k spatial preference queries</article-title>
          ,
          <source>Journal Proceedings of the VLDB Endowment</source>
          ,Volume
          <volume>4</volume>
          Issue 2, November 2010
          <string-name>
            <given-names>Hauke J.</given-names>
            ,
            <surname>Kossowski</surname>
          </string-name>
          <string-name>
            <surname>T.</surname>
          </string-name>
          ,
          <article-title>Comparison of values of Pearson's and Spearman's correlation coefficient on the same sets of data</article-title>
          .
          <source>Quaestiones Geographicae</source>
          <volume>30</volume>
          (
          <issue>2</issue>
          ),
          <article-title>Bogucki Wy-dawnictwo Naukowe, 4</article-title>
          .
          <string-name>
            <given-names>Barcelona</given-names>
            <surname>Field Studies Centre S.L.</surname>
          </string-name>
          <article-title>Spearman's Rank Correlation Coefficient http://geographyfieldwork</article-title>
          .com/SpearmansRank.htm https://developers.google.com/maps/ http://developer.yahoo.com/maps/ C. Spearman,
          <article-title>The Proof and Measurement of Association between Two Things</article-title>
          ,
          <source>The American Journal of Psychology</source>
          , Vol.
          <volume>15</volume>
          , No.
          <volume>1</volume>
          (
          <issue>Jan</issue>
          .,
          <year>1904</year>
          ), pp.
          <fpage>72</fpage>
          -
          <lpage>101</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>