<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Georeferencing Wikipedia pages using language models from Flickr</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chris De Rouck</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olivier Van Laere</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Steven Schockaert</string-name>
          <email>steven.schockaert@ugent.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bart Dhoedt</string-name>
          <email>bart.dhoedtg@ugent.be</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Applied Mathematics and Computer Science, Ghent University</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Information Technology, IBBT, Ghent University</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The task of assigning geographic coordinates to web resources has recently gained in popularity. In particular, several recent initiatives have focused on the use of language models for georeferencing Flickr photos, with promising results. Such techniques, however, require the availability of large numbers of spatially grounded training data. They are therefore not directly applicable for georeferencing other types of resources, such as Wikipedia pages. As an alternative, in this paper we explore the idea of using language models that are trained on Flickr photos for nding the coordinates of Wikipedia pages. Our experimental results show that the resulting method is able to outperform popular methods that are based on gazetteer look-up.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The geographic scope of a web resource plays an increasingly important role for
assessing its relevance in a given context, as can be witnessed by the popularity
of location-based services on mobile devices. When uploading a photo to Flickr,
for instance, users can explicitly add geographical coordinates to indicate where
it has been taken. Similarly, when posting messages on Twitter, information may
be added about the user's location at that time. Nonetheless, such coordinates
are currently only available for a minority of all relevant web resources, and
techniques are being studied to estimate geographic location in an automated
way.</p>
      <p>
        For example, several authors have applied language modeling techniques to
nd out where a photo was taken, by only looking at the tags that its owner
has provided [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]. The main idea is to train language models for di erent
areas of the world, using the collection of already georeferenced Flickr photos,
and to subsequently use these language models for determining in which area a
given photo was most likely taken. In this way, implicit geographic information is
automatically derived from Flickr tags, which is potentially much richer than the
information that is found in traditional gazetteers. Indeed, the latter usually do
not contain information about vernacular place names, lesser-known landmarks,
or non-toponym words with a spatial dimension (e.g. names of events), among
others. For the task of assigning coordinates to Flickr photos, this intuition
seems to be con rmed, as language modeling approaches have been found to
outperform gazetteer based methods [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        For other types of web resources, spatially grounded training data may not
be (su ciently) available to derive meaningful language models, in which case
it seems that gazetteers would again be needed. However, as language models
trained on Flickr data have already proven useful for georeferencing photos, we
may wonder whether they could be useful for nding the coordinates of other
web resources. In this paper, we test this hypothesis by considering the task of
assigning geographical coordinates to Wikipedia pages, and show that language
models from Flickr are indeed capable of outperforming popular methods for
georeferencing web pages. The interest of our work is twofold. From a practical
point of view, the proposed method paves the way for improving location-based
services in which Wikipedia plays a central role. Second, our results add
further support to the view that georeferenced Flickr photos can provide a valuable
source of geographical information as such, which relates to a recent trend where
traditional geographic data is more and more replaced or extended by user
contributed data from Web 2.0 initiatives [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>The paper is structured as follows. Section 2 brie y reviews the idea of
georeferencing tagged resources using language models from Flickr. In Section 3 we
then discuss how a similar idea could be applied to Wikipedia pages. Section
4 contains our experimental results, after which we discuss related work and
conclude.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Language models from Flickr</title>
      <p>In this section, we recall how georeferenced Flickr photos can be used to train
language models, and how these language models subsequently allow to nd the
area that most likely covers the geographical scope of some resource. Throughout
this section, we will assume that resources are described as sets of tags, while
the next section will discuss how the problem of georeferencing Wikipedia pages
can be cast into this setting.</p>
      <p>
        As training data, we used a collection of around 8.5 million publicly available
photos on Flickr with known coordinates. In addition to these coordinates, the
associated metadata contains tags attributed to each photo, providing us with
a textual description of their content, as well as an indication of the accuracy
of the coordinates as a number between 1 (world-level) and 16 (street level).
As in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], we only retrieved photos with a recorded accuracy of at least 12
and we removed photos that did not contain any tags or whose coordinates
were invalid. Also, following [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] photos from bulk uploads were removed. The
resulting dataset contained slightly over 3.25 million photos. In a subsequent
step, the training data was clustered into disjoint areas using the k-medoids
algorithm with geodesic distance. Considering a varying number of clusters k,
this resulted in di erent sets of areas Ak. For each clustering, a vocabulary Vk
was compiled, using 2 feature selection, as the union of the m most important
tags (i.e. the tags with the highest 2 value) for each area.
      </p>
      <p>The problem of georeferencing a resource x, in this setting, consists of
selecting the area a from the set of areas Ak (for a speci c clustering k) that is most
likely to cover the geographic scope of the resource (e.g. the location of where
the photo was taken, when georeferencing photos). Using a standard language
modeling approach, this probability can be estimated as</p>
      <p>P (ajx) / P (a)</p>
      <p>Y P (tja)
t2x
where we identify the resource x with its set of tags. The prior probability P (a)
of area a can be estimated as the percentage of photos in the training data that
belong to that area (i.e. a maximum likelihood estimation). To obtain a reliable
estimate of P (tja) some form of smoothing is needed, to avoid a zero probability
when encountering a tag t that does not occur with any of the photos in area a.
In this paper, we use Jelinek-Mercer smoothing ( 2 [0; 1])]:</p>
      <p>P (tja) =</p>
      <p>P</p>
      <p>Ota
a02Ak
+ (1</p>
      <p>P</p>
      <p>
        a02Ak Ota0
) Pa02Ak Ota0 Pt02Vk Ot0a0
Ota is the number of occurrences of tag t in area a while Vk is the vocabulary,
after feature selection. In the experiments, we used = 0:7 although we obtained
good results for a wide range of values. The area that is most likely to contain
resource x can then be found by maximizing the right-hand side of (1). To
convert this area into a precise location, the area a can be represented as its
medoid med(a):
(1)
(2)
med (a) = arg min X d(x; y)
x2a
y2a
with d(x; y) being the geodesic distance. Another alternative, which was
proposed in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] but which we do not consider in this paper, is to assign the location
of the most similar photo from the training data which is known to be located
in a.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Wikipedia pages</title>
      <p>The idea of geographic scope can be interpreted in di erent ways for Wikipedia
pages. A page about a person, for instance, might geographically be related to
the places where this person has lived throughout his life, but perhaps also to
those parts of the world which this person's work has in uences (e.g. locations
of buildings that were designed by some architect). In this paper, however, we
exclusively deal with nding the coordinates of a Wikipedia page about a
speci c place, such as a landmark or a city. It is then natural to assume that the
geographic scope of the page corresponds to a point.</p>
      <p>While several Wikipedia pages already have geographic coordinates, it does
not seem feasible to train area-speci c language models from Wikipedia pages
with a known location, as we did in Section 2 for Flickr photos. The reason is
that typically there is only one Wikipedia page about a given location, so either
its location is already known or its location cannot be found by using other
georeferenced pages. Moreover, due to the smaller number of georeferenced pages
(compared to the millions of Flickr photos) and the large number of spatially
irrelevant terms on a typical Wikipedia page, the process further complicates.
One possibility to cope with these issues might be to explicitly look for toponyms
in pages, and link these to gazetteer information. However, as we already have
rich language models from Flickr, in this paper we pursue a di erent strategy,
and investigate the possibility of using these models to nd the locations of
Wikipedia pages.</p>
      <p>The rst step consists of representing a Wikipedia page as a list of Flickr tags.
This can be done by scanning the Wikipedia page and identifying occurrences
of Flickr tags. As Flickr tags cannot contain spaces, however, it is important
that concatenations of word sequences in Wikipedia pages are also considered.
Moreover capitalization should be ignored. For example, an occurrence of \Ei el
tower" on a page is mapped to the Flickr tags \ei eltower", \ei el" and \tower".</p>
      <p>Let us write n(t; d) for the number of times tag t was thus found in the
Wikipedia page d. We can then assign to d the area a which maximizes
P (ajd) / P (a)</p>
      <p>Y P (tja)n(t;d)
t2Vk
(3)
where Vk is de ned as before and the probabilities P (a) and P (tja) are estimated
from our Flickr data, as explained in the previous section. Again (2) can be used
to convert the area a to a precise location.</p>
      <p>Some adaptations to this scheme are possible, where the scores n(t; d) are
de ned in alternative ways. As Wikipedia pages often contain a lot of context
information, which does not directly describe the location of the main subject,
we propose two techniques for restricting which parts of an article are scanned.
The rst idea is to only look at tags that occur in section titles (identi ed using
HTML tags of the form &lt;h1&gt;), in anchor text (&lt;a&gt;) or in emphasized regions
(&lt;strong&gt; and &lt;b&gt;). This variant is referred to as keywords below. The second
idea is to only look at the abstract of the Wikipedia page, which is de ned as
the part of the page before the rst section heading. As this abstract is supposed
to summarize its content, it is less likely to contain references to places that are
outside the geographical scope of the page. This second variant is referred to as
abstract. Note that in both variants, the value of n(t; d) will be lower than when
using the basic approach.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experimental results</title>
      <p>mentioned as a \spot" in the GeoNames gazetteer. This resulted in a set of 7537
georeferenced Wikipedia pages, whose coordinates we used as our gold standard.</p>
      <p>Using the techniques outlined in the previous section, for each page the most
likely area from Ak is determined (for di erent values of k). To evaluate the
performance of our method, we calculate the accuracy, de ned as the percentage
of the test pages that were classi ed in the correct area, i.e. the area actually
containing the location of page d. In addition, we also look at how many of the
Wikipedia pages are correctly georeferenced within a 1km radius, 5km radius,
etc.</p>
      <p>Our main interest is in comparing the methods proposed in Section 3 with the
performance of Yahoo! Placemaker, a freely available popular webservice capable
of geoparsing entire documents and webpages. Provided with free-form text,
Placemaker identi es places mentioned in text, disambiguates those places and
returns the corresponding locations. It is important to note that this approach
uses external geographical knowledge such as gazetteers and other undocumented
sources of information. In contrast, our approach uses only the knowledge derived
from the tags of georeferenced Flickr photos.</p>
      <p>
        In a rst experiment, we compare the results of language models trained
at di erent resolutions, i.e. di erent numbers of clusters k. Table 1 shows the
results for k varying from 50 to 17 500, where we consider the basic variant
in which the entire Wikipedia page is scanned for tag occurrences. There is a
trade-o to be found, where ner-grained areas lead to more precise locations,
provided that the correct area is found, while coarse-grained areas lead to a
higher accuracy and to an increased likelihood that the found location is within
a certain broad radius. In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], it was found that the optimal number of clusters
for georeferencing Flickr photos was somewhere between 2500 and 7500, with the
optimum being higher for photos with more informative tags. In contrast, the
results from Table 1 reveal that in the case of Wikipedia pages, its is bene cial
to further increase the number of clusters. This nding seems to be related to
the intuition that Wikipedia pages contain more informative descriptions than
Flickr photos. Comparing our results with Placemaker, we nd a substantial
improvement in all categories, which is most pronounced in the 1km range,
where the number of correct locations for our language modeling approach is 3
to 4 times higher than for the Placemaker.
      </p>
      <p>In a second experiment, we analyzed the e ect of only looking at certain
regions of a Wikipedia page, as discussed in Section 3. As the results in Table 2
show, when using the abstract, comparable results are obtained, which is
interesting as this method only uses a small portion of the page. When only looking
at the emphasized words (method keywords ), the results are even considerably
better. Especially for the 50km and 100km categories, the improvement is
substantial. This seems to con rm the intuition that tag occurrences in section titles,
anchor text and emphasized words are more likely to be spatially relevant.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Related work</title>
      <p>
        Techniques to (automatically) determine the geographical scope of web resources
commonly use resources such as gazetteers (geographical indexes), and tables
with locations corresponding to IP addresses, zipcodes or telephone pre x codes.
These resources are often handcrafted, which is time-consuming and expensive,
although this results in accurate geographical information. Unfortunately, many
of these sources are not freely available and their coverage varies highly from
country to country. If su ciently accurate resources are available, one of the
main problems in georeferencing web pages is dealing with the high ambiguity
of toponyms [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. For example, when an occurrence of Paris is encountered, one
rst needs to disambiguate between a person and a place, and in the case it
refers to a place, between di erent locations with that name (e.g. Paris, France
and Paris, Texas).
      </p>
      <p>
        It is only recently that alternative ways have been proposed to georeference
resources. In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], names of places are extracted from Flickr tags on a subset
of around 50000 photos. Also, as studied by L. Hollenstein in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], collaborative
tagging-based systems are also useful to acquire information about the location
of vernacular places names. In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], methods based on Wordnet and Naive Bayes
classi cation are compared for the automatic detection of toponyms within
articles. To the best of our knowledge, however, approaches for georeferencing
Wikipedia pages, or webpages in general, without using a gazetteer or other
forms of structured geographic knowledge have not yet been proposed in the
literature.
      </p>
      <p>
        An interesting line of work aims at automatically completing the infobox
of a Wikipedia page by analyzing the content of that page [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. This work is
related to ours in the sense that semantic information about Wikipedia pages
is made explicit. Such a strategy can be used to improve semantic knowledge
bases, such as YAGO2 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which now contains over 10 million entities derived
from Wikipedia, WordNet and GeoNames. Similarly, in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], a gazetteer was
constructed based on geotagged Wikipedia pages. In particular, relations between
pages are extracted from available geographical information (e.g. New York is
part of the United States). Increasing the number of georeferenced articles may
thus lead to better informed gazetteers.
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>In this paper, we investigated the possibility of using language models trained
on georeferenced Flickr photos for nding the coordinates of Wikipedia pages.
Our experiments show that for Wikipedia pages about speci c locations, the
proposed approach can substantially outperform Yahoo! Placemaker, a popular
approach for nding the geographic scope of a webpage. This is remarkable as
the Placemaker crucially depends on gazetteers and other forms of structured
geographic knowledge, and is moreover based on advanced techniques for
dealing with issues such as ambiguity. Our method, on the other hand, only uses
information that was obtained from freely available, user-contributed data, in
the form of georeferenced Flickr photos, and uses standard language modeling
techniques.</p>
      <p>These results suggest that the implicit spatial information that arises from
the tagging behavior of users may have a stronger role to play in the eld of
geographic information retrieval, which is currently still dominated by
gazetteerbased approaches. Moreover, as the number of georeferenced Flickr photos is
constantly increasing, the spatial models that could be derived are constantly
improving. Further work is needed to compare the information contained implicitly
in such language models with the explicit information contained in gazetteers.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>D.</given-names>
            <surname>Buscaldi</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <article-title>A comparison of methods for the automatic identi cation of locations in wikipedia</article-title>
          .
          <source>In Proceedings of the 4th ACM Workshop on Geographical Information Retrieval</source>
          , pages
          <volume>89</volume>
          {
          <fpage>92</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M.</given-names>
            <surname>Goodchild</surname>
          </string-name>
          .
          <article-title>Citizens as sensors: the world of volunteered geography</article-title>
          .
          <source>GeoJournal</source>
          ,
          <volume>69</volume>
          :
          <fpage>211</fpage>
          {
          <fpage>221</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. J. Ho art,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Berberich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Lewis-Kelham</surname>
          </string-name>
          , G. de Melo, and
          <string-name>
            <surname>G. Weikum.</surname>
          </string-name>
          <article-title>YAGO2: exploring and querying world knowledge in time, space, context, and many languages</article-title>
          .
          <source>In Proceedings of the 20th International Conference on World Wide Web</source>
          , pages
          <volume>229</volume>
          {
          <fpage>232</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>L.</given-names>
            <surname>Hollenstein</surname>
          </string-name>
          .
          <article-title>Capturing vernacular geography from georeferenced tags</article-title>
          .
          <source>Master's thesis</source>
          , University of Zurich,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Serdyukov</surname>
          </string-name>
          .
          <article-title>Automatic tagging and geotagging in video collections and communities</article-title>
          .
          <source>In Proceedings of the ACM International Conference on Multimedia Retrieval</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>J.</given-names>
            <surname>Leidner</surname>
          </string-name>
          .
          <article-title>Toponym Resolution in Text: Annotation, Evaluation and Applications of Spatial Grounding of Place Names</article-title>
          .
          <source>PhD thesis</source>
          , University of Edinburgh,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Grefenstette</surname>
          </string-name>
          .
          <article-title>Spatiotemporal mapping of Wikipedia concepts</article-title>
          .
          <source>In Proceedings of the 10th Annual Joint Conference on Digital Libraries</source>
          , pages
          <volume>129</volume>
          {
          <fpage>138</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>T.</given-names>
            <surname>Rattenbury</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Naaman</surname>
          </string-name>
          .
          <article-title>Methods for extracting place semantics from ickr tags</article-title>
          .
          <source>ACM Transactions on the Web</source>
          ,
          <volume>3</volume>
          (
          <issue>1</issue>
          ):1{
          <fpage>30</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>P.</given-names>
            <surname>Serdyukov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Murdock</surname>
          </string-name>
          , and R. van Zwol.
          <article-title>Placing Flickr photos on a map</article-title>
          .
          <source>In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , pages
          <volume>484</volume>
          {
          <fpage>491</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>O. Van Laere</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Schockaert</surname>
            , and
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Dhoedt</surname>
          </string-name>
          .
          <article-title>Finding locations of Flickr resources using language models and similarity search</article-title>
          .
          <source>In Proceedings of the ACM International Conference on Multimedia Retrieval</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Weld</surname>
          </string-name>
          .
          <article-title>Autonomously semantifying wikipedia</article-title>
          .
          <source>In Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management</source>
          , pages
          <volume>41</volume>
          {
          <fpage>50</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>