<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LearNext: Learning to Predict Tourists Movements</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ranieri Baraglia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristina Ioana Muntean</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Franco Maria Nardini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabrizio Silvestri</string-name>
          <email>silvestr@yahoo-inc.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ISTI-CNR</institution>
          ,
          <addr-line>Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Yahoo! Research Labs</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <fpage>75</fpage>
      <lpage>79</lpage>
      <abstract>
        <p>In this paper, we tackle the problem of predicting the \next" geographical position of a tourist given her history (i.e., the prediction is done accordingly to the tourist's current trail) by means of supervised learning techniques, namely Gradient Boosted Regression Trees and Ranking SVM. The learning is done on the basis of an object space represented by a 68 dimension feature vector, speci cally designed for tourism related data. Furthermore, we propose a thorough comparison of several methods that are considered state-of-the-art in touristic recommender and trail prediction systems as well as a strong popularity baseline. Experiments show that the methods we propose outperform important competitors and baselines thus providing strong evidence of the performance of our solutions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>LearNext works by predicting touristic places according to the current position
of a tourist that is visiting a city and a history of previously visited places (i.e.,
visit patterns ) from other users. For the selection of tourist sites, the system uses a
set of Points of Interest (PoI s) identi ed a priori. In particular, the contributions
of this paper are the following:</p>
    </sec>
    <sec id="sec-2">
      <title>Our Solution</title>
      <p>
        The LearNext problem can be cast into a learning to rank formulation that
allows to build models able to order PoIs following their decreasing likelihood
of being visited as the next PoI for a given user. A trail, initially a sequence of
PoIs visited by a user, is represented in a 68-dimension feature space, described
in Table 1. The aim is to learn from data the function that minimizes the error
of a given loss function. This way, the LearNext problem becomes a supervised
Machine Learning problem that is solved by building a model that ranks highest
the PoI with the highest likelihood of being visited next. We build the ranking
models by relying on two well-know techniques: Ranking SVM [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and Gradient
Boosted Regression Trees (GBRT) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>Features of PoIs and tourist trails. An important aspect to take into account
for an accurate solution of the LearNext problem using learning to rank consists
of carefully designing the feature space so that the main characteristics of the
dataset are captured. When visiting a city, in fact, a tourist takes into account
the popularity of a PoI, the distance of a given PoI with respect to her current
position, how much a particular PoI matches her interests, the time needed to
reach it, the time needed to visit it, etc. To model all these dimensions of tourist
behavior we de ne a set of 68 di erent features, broadly classi ed in two main
categories, namely \Session" and \PoI". Session features are meant to model the
tourist behavior and capture concepts like groups of PoI visited, distances among
PoIs, etc. It is based on the characteristics of each PoI within that user session
(trail). On the other hand, PoI features model the characteristics of a candidate
PoI, also taking into account the past activities of the tourist. Accordingly, PoI
features model the characteristics of the PoI to be suggested.</p>
      <p>Session features</p>
      <p>Candidate PoI features
User preferences userSessLen Avg/Max/Min/Tot ratioPoIInUserPhotos
&amp; type userSessRatio photosPerUser
Distance &amp; time actualTransferTime distFromFirstPoI Eucl
features actualVisitTime distFromLastPoI Eucl
euclidDist Avg/Max/Min/Tot visitTime Avg/Max/Min
uniqueCategsPerSess
phPoISess Avg/Max/Min/Tot
Session
characteristics
PoI
characteristics
Frequent
sequences
Popularity</p>
      <p>Session features are based on the current trail of the user; they can be, for
example, the transfer time and the actual visit time spent by a tourist in her
session, the number of unique categories for all PoIs in that session, the euclidean
and latitude/longitude distance of consecutively visited PoIs in a session (average,
max, min, total), time and length of the current session, number of photos per
PoI in a session (average, max, min, total), length of the sessions belonging to
the same tourist (average, max, min, total) making the current visit.</p>
      <p>On the other hand, PoI features are based on the next PoI to be suggested
and model the distance of the next PoI from the rst PoI of the session, whether
the PoI belongs to the top ten categories visited by users, the number of times a
tourist visits that PoI in the training set, the conditional probability of observing
that PoI given the last PoI visited by a user, the probability of observing the PoI
as rst (resp. last) PoI in the training set, number of photos of the PoI (average,
max, min), number of past photos of the PoI from the same user, and the visit
time of the PoI (average, max, min, total).
3</p>
    </sec>
    <sec id="sec-3">
      <title>Experimental Evaluation</title>
      <p>To assess the e ectiveness of our proposed techniques, we use three di erent
datasets built in a fully automatic process by exploiting both photos from Flickr,
a photo sharing portal, and Wikipedia pages. We build three datasets containing
tourist movements covering three Italian cities, chosen so as to guarantee a variety
of topologies and sizes: small (Pisa), medium (Florence), and large (Rome, i.e., a
capital city). The datasets have been made available for download1.</p>
      <p>To devise tourist trails in an area of interest we query Flickr to retrieve the
metadata (user id, timestamp, tags, geographic coordinates, etc.) of the photos
taken in the given area. The assumption we are making is that photo albums
made by Flickr users implicitly represent touristic itineraries within a given city.
To strengthen the assumption and thus the accuracy of our method, we retrieve
only photos having the highest geo-referenced precision in the given area of
interest. Then, we collect geo-tagged photo albums from Flickr users. We discard
photo albums containing only one photo and those containing photos with no
GPS information associated. Eventually, photos are mapped to the set of PoIs
previously collected from Wikipedia. This is done by associating a photo to a PoI
if that photo is in the ball having the PoI as its center and r = 100 meters as its
radius. Moreover, since several photos by the same user are usually taken close to
the same PoI, we collapse them by considering the timestamps associated with
the rst and last of these photos as the starting and ending time of the user visit
to the PoI. The results of the assignment above produce, for each Flickr user, a
stream of PoIs she visited.</p>
      <p>Finally, in order to build the trail sets, we need a way to split the stream of PoIs
visited by each user in a meaningful and realistic time-wise set of trails. We employ
a time-based cutting method that produces the list of trails a user performed,
1 Links to the trail datasets: http://hpc.isti.cnr.it/~muntean/datasets/LearNext.
tar.gz
by considering the inter-arrival time of each pair of sequential photos in her
stream. To do so, for each city, we compute the distribution of probability of the
inter-arrival time x to be less then a given time threshold k, i.e., P (x k). Then
for each dataset we devise the time threshold k corresponding to P (x k) = 0:9.
Regarding Rome, it corresponds to 5 hours, for Florence 6 hours, while for Pisa
3 hours.</p>
      <p>
        For each of the three cities, we generate a training set (80%) and a test
set (20%) of trails. The e ectiveness of the methods is assessed by means of
Success@k (i.e., the percentage of times that the correct answer is in the top-k
ranked PoIs), MRR (@k), and total MRR [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Moreover, we compare our solutions
against a probability baseline and two important state-of-the-art techniques:
{ \Prob" uses the training set to build a directed graph where nodes are PoIs
of the given city and edges are transactions from a source PoI to a destination
PoI.Given the PoI currently visited by a tourist, Prob predicts the most likely
PoI to be visited next.
{ \WhereNext" [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] uses trajectory pattern mining to devise T-Patterns, i.e.,
frequent behaviors of movement in the city, from data. T-Patterns constitute
the knowledge model used to compute the prediction.
{ \Random Walk" [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] employs a graph-based representation of the PoIs in a city.
      </p>
      <p>Authors named it \itinerary graph" and exploit it by using a random walk
with restart to select the most relevant PoIs for a given tourist.</p>
      <p>The evaluation strategy we use to assess how the proposed techniques behave
in terms of e ectiveness is the following: each model for the three cities has been
trained on the corresponding training set. A training set contains positive and
negative examples of candidate next PoI, represented by its features. Given a trail
of length N , training set contains both session features (computed on the rst
N 1 PoIs of the trail) and PoI features. The latter are computed considering
both the actual next PoI visited by the tourist, i.e., the N -th PoI of the trail (as
a positive example) and a few negative examples, with PoIs di erent from the
ones seen in the actual trail. Negative examples have been selected on a distance
basis. Two negative examples have been selected from PoIs close to the N -th
one while one has been selected far from the N -th one. For building the test set
we adopt the following process. Given a trail of length N in the test set, we use
the rst N 1 PoIs of the trail to pro le the tourist history and re-rank all nal
PoIs observed in the training, according to the prediction model.</p>
      <p>Table 2 shows the results of the experiments. WhereNext and Random
Walk never outperform Prob in terms of Success@1. Instead, the techniques we
propose consistently outperform all the baselines. For Pisa, in terms of Success@1,
Ranking SVM scores 32:66% and GBRT scores 40:70%, while Prob scores 16:08%.
Important results should be highlighted also for Success@2. Here, our methods
are able to score 49:74% (Ranking SVM), and 55:27% (GBRT). Roughly speaking,
in half of the cases our methods are able to rank the actual next PoI in the two
highest positions of the list, as global MRR shows.</p>
      <p>GBRT is the technique showing the best performance, while Ranking SVM is
second, and both techniques perform considerably better than the baselines we
City</p>
      <p>
        Predictor
Prob 16.08% - -
WhereNext [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] 12.56% - -
Random Walk [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] 15.07% (0.15) 20.60% (0.17) 25.12% (0.19)
Ranking SVM 32.66% (0.32) 49.74% (0.41) 55.77% (0.43) 0.47
GBRT 40.70% (0.40) 55.27% (0.47) 63.81% (0.50) 0.56
Prob 12.93% - -
      </p>
      <p>
        WhereNext [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] 6.37% - -
Rome Random Walk [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] 8.43% (0.08) 13.76% (0.11) 19.22% (0.12)
Ranking SVM 21.88% (0.21) 30.24% (0.26) 36.37% (0.28) 0.33
      </p>
      <p>GBRT 30.95% (0.30) 40.07% (0.34) 47.15% (0.38) 0.42
Table 2. E ectiveness (%) in terms of Success@k, MRR@k, and total MRR of the
proposed techniques along with the competitors.
chose. The same behavior can be observed for Florence and Rome where both
Ranking SVM and GBRT are always outperforming the baselines.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>We proposed to apply machine learning techniques to tackle the problem of
predicting the \next" touristic attraction a user will visit on the basis of her
visit history (i.e., the prediction is done accordingly to what the user has already
visited in the touristic attraction). We modeled the problem as an instance of
learning to rank and we de ned a feature space composed of 68 features capturing
both the touristic behavior and the peculiar characteristics of candidate PoIs.
GBRT and Ranking SVM outperform baselines in terms of prediction accuracy.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Baeza-Yates</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ribeiro-Neto</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , et al.:
          <article-title>Modern information retrieval</article-title>
          , vol.
          <volume>463</volume>
          . ACM press New York. (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Training linear svms in linear time</article-title>
          .
          <source>In: Proc. SIGKDD</source>
          . ACM, New York, NY, USA (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Lucchese</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perego</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silvestri</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vahabi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Venturini</surname>
          </string-name>
          , R.:
          <article-title>How random walks can help tourism</article-title>
          .
          <source>In: Proc. ECIR</source>
          .
          <string-name>
            <surname>LNCS</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Monreale</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pinelli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trasarti</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giannotti</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Wherenext: a location predictor on trajectory pattern mining</article-title>
          .
          <source>In: Proc. SIGKDD</source>
          . ACM (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zha</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Chapelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Sun</surname>
          </string-name>
          , G.:
          <article-title>A general boosting method and its application to learning ranking functions for web search</article-title>
          .
          <source>ANIPS 20</source>
          ,
          <issue>1697</issue>
          {
          <fpage>1704</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>