<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ambiguous Place Names on the Web?</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Natural Language Engineering Lab., ELiRF Research Group, Dpto. de Sistemas Informaticos y Computacion (DSIC), Universidad Politecnica de Valencia</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Geographical information is achieving an increasing importance in the World Wide Web. Everyday, the number of users looking for geographically constrained information is growing. Map-based services, such as Google or Yahoo Maps provide users with a graphical interface, visualizing results on maps. However, most of the geographical information contained in web documents is represented by means of toponyms, which in many cases are ambiguous. Therefore, it is important to properly disambiguate toponyms in order to improve the accuracy of web searches. The advent of the semantic web will allow to overcame this issue by labelling documents with geographical IDs. In this paper we discuss the problems of using toponyms in web documents instead of identifying places using tools such as Geonames RDF, focusing on the errors that a ect a prototype geographical web search engine, Geooreka!, currently under development.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The interest of users for geographically constrained information in the Web has
increased over the past years, boosted by the availability of services such as
Google Maps1. Sanderson and Kohler [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] showed that 18:6% of the queries
submitted to the Excite search engine contained at least a geographic term, while
Gan et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] estimated that 12:94% of queries submitted to the AOL search
engine expressed a geographically constrained information need. Most of the
geographical information contained in the Web and unstructured text is composed
by toponyms, or place names. There are two main problems that derive from
using toponyms to represent geographical information. The rst one is the
polysemy of toponyms, or toponym ambiguity: a toponym may be used to represent
more than one place, such as \Puebla" which may be used to indicate the city
at 19o30N, 98o120W, the state in which it is contained, a suburb of Mexicali in
the state of Baja California, or three more small towns in Mexico. The second
problem is that the mere inclusion of a toponym in a document does not always
mean that the document is geographically relevant with respect to the region or
? We would like to thank the TIN2009-13391-C04-03 research project for partially
supporting this work.
1 http://maps.google.com
area represented by the toponym. In the rst case, the solution is constituted
by the Toponym Disambiguation (TD) task, also named toponym grounding
or resolution; in the second case, the solution is to carry out Geographic Scope
Resolution, which is also a ected by the problem of toponym ambiguity [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        The Geonames ontology2 provide users with RDF description of more than
6 million places. The use of this ontology would allow to include geospatial
semantic information in the Web, eliminating the need of toponym disambiguation.
Unfortunately, as noted by [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], in the Web \references to geographical locations
remain unstructured and typically implicit in nature", determining a \lack of
explicit spatial knowledge within the Web" which \makes it di cult to service
user needs for location-speci c information". In this paper, with the help of the
Geooreka!3 system [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a prototype web search engine developed at the
Universidad Politecnica of Valencia in Spain, we will the problems that users interested
in geographically constrained information may found because of the ambiguity
of toponyms in the web.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Geooreka!: a Geographical Web Search Engine</title>
      <p>
        Geooreka! is a search engine developed on the basis of our experiences at
GeoCLEF4 [
        <xref ref-type="bibr" rid="ref6 ref7">6,7</xref>
        ], which suggested us that the use of term-based queries could not be
the optimal method to express a geographically constrained information need.
For instance, it is common for users to employ vernacular names that have vague
spatial extent and which do not correspond to the o cial administrative place
name terminology. Another issue is the use of vague geographical constraints that
are di cult to automatically translate from the natural language to a precise
query. For instance, the query \Cultivos de tabaco al este de Puebla" (\Tobacco
plantations East of Puebla") presents a double problem because of the
ambiguity of the place name and the fact that the geographical constraint \East of" is
vague (for instance, it does not specify if the search should be constrained within
Mexico or extend to other countries).
      </p>
      <p>
        These issues are addressed in Geooreka! by allowing the user to specify his
geographical information needs using a map-based interface. The user writes a
natural language query in order to represent the query theme (e.g., \Cultivos
de tabaco") and selects a rectangular map in a box (Figure 1), representing
the query geographical footprint. All toponyms in the box are retrieved using a
PostGIS database, and then the Web is queried in order to check the maximum
Mutual Information (MI) between the thematic part of the query and all the
places retrieved. The complete architecture of the system can be observed in
Figure 2. Web counts and MI are used in order to determine which combinations
theme-toponym are most relevant with respect to the information need expressed
by the user (Selection of Relevant Queries ). In order to speed-up the process,
2 http://www.geonames.org/ontology/
3 http://www.geooreka.eu
4 http://ir.shef.ac.uk/geoclef/
web counts are calculated using the static Google 1T Web database5, indexed
using the jWeb1T interface [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], whereas Yahoo! Search is used to retrieve the
results of the queries composed by the combination of a theme and a toponym.
The key issue in the selection of the relevant queries is to obtain a relevance
model that is able to select pairs theme-toponym that are most promising to
satisfy the user's information need. On the basis of the theory of probability,
we assume that the two component parts of a query, theme T and a place G,
are independent if their conditional probabilities are independent, i.e., p(T jG) =
p(T ) and p(GjT ) = p(G), or, equivalently, their joint probability is the product
of their probabilities:
p^(T \ G) = p(G)p(T )
(1)
      </p>
      <p>If probabilities are calculated using page counts, that is, as the number of
pages in which the term (or phrase) representing the theme or toponym appears,
divided by Fmax = 2; 147; 436; 244 which is the maximum term frequency
contained in the Google Web 1T database, then p^(T \ G) is the expected probability
of co-occurrence of T and G in the same web page. It is clear that this represents
a rough estimation of the fact that T occurred in G, since the mere inclusion
of G in a page where T is mentioned does not guarantee the semantic relation
between G and T .</p>
      <p>
        Considering this model for the independence of theme and place, we can
measure the divergence of the expected probability p^(T \ G) from the observed
probability p(T \ G): the more the divergence, the more informative is the result
5 http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13
of the query. The Kullback-Leibler measure [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is commonly used in order to
determine the divergence of two probability distributions.
      </p>
      <p>DKL(p(T \ G)jjp^(T \ G)) = p(T \ G) log
p(T \ G)
p(T )p(G)
(2)
This formula is exactly one of the formulations of the Mutual Information (MI)
of T and G, usually denoted as (I (T ; G)).
3</p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>
        Geooreka! has been evaluated over the GeoCLEF 2005 test set, in order to
compare the results that could be obtained by specifying the geographic footprint by
means of keywords and those that could be obtained using a map-based interface
to de ne the geographic footprint of the query. With this setup, topic title only
was used as input for the Geooreka! thematic part, while the area
corresponding to the geographic scope of the topic was manually selected. Probabilities
were calculated using the number of occurrences in the GeoCLEF collection.
Occurrences for toponyms were calculated by taking into account only the geo
index. The results were calculated over the 25 topics of GeoCLEF-2005, minus
the queries in which the geographic footprint was composed of disjoint areas (for
instance, \Europe" and \USA" or \California" and \Australia"), which could
not be processed by Geooreka!. Mean Reciprocal Rank (MRR) was used as a
measure of accuracy. The GIR system GeoWorSE, where queries are speci ed
by text, was used as a baseline [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Table 1 displays the obtained results.
      </p>
      <p>The results show that the web-based results are sensibly worse than those
obtained on the static collection. This is due primarily to two reasons: in the rst
place, because topics were tailored on the GeoCLEF collection. Therefore, some
topics refer explicitly to events that are particularly relevant in the collection
and are easier to retrieve. For instance, query GC-005 \Japanese Rice Imports"
targets documents regarding the opening of the Japanese rice market for the rst
time to other countries; \Japan" and \Rice" in the document collection appear
together only in such documents, therefore it is easier to retrieve the relevant
documents when searching the GeoCLEF collection.</p>
      <p>The second factor a ecting the results for the Web-based system is the
ambiguity of toponyms, which does not allow to correctly estimate the probabilities
for places. For instance, in the results obtained for topic GC-008 (\Milk
Consumption in Europe"), the MI obtained for \Turkey" was abnormally high with
respect to the expected value for this country. The reason is that in most
documents, the name \turkey" was referring to the animal and not to the country.
This kind of ambiguity represents one of the most important issue at the time
of estimating the probability of occurrence of places. Ambiguity (or, better, the
polysemy of toponyms) grows together with the size and the scope of the
collection being searched. The GeoCLEF collection was also semantically tagged
using WordNet and Geonames IDs to identify the places referenced by toponyms,
while Web content is rarely tagged using precise IDs, therefore increasing the
chance of error in the estimation of probabilities for places which share the same
name.</p>
      <p>
        There are three kind of toponym ambiguity that can be recognised (after the
two main types identi ed by [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]:
{ Geo / Non-Geo ambiguity: in this case, a toponym is ambiguous with respect
to another class of name (such as \Turkey" which may be the animal or the
country);
{ Geo / Geo ambiguity of di erent class: for instance, \Puebla" the city or the
state;
{ Same class Geo / Geo ambiguity.
      </p>
      <p>The solution in all cases would be to use an ontology to precisely identify places
in documents; the only di erence is the amount of information that the ontology
should include. For the rst type of ambiguity, the only information needed is
whether the name represents a place or not. In the second case, we would also
need to know the class of the place. Finally, in the Geo / Geo ambiguity, we may
di erentiates places using their coordinates or by knowing the including entity,
or both. The Geonames ontology contains all these information and represents
the best option at the time of geographically tag place names.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>The results obtained with Geooreka! over a static, semantically-labelled (at least
from a geographical viewpoint) collection compared to the results obtained in
the Web showed that the imprecise identi cation of places is a problem for
search engines destined to users who are interested in searching for geographically
constrained information. The use of precise semantically tagging schemes for
toponyms, such as Geonames RDF, would allow these search engines to produce
more reliable results. Spreading the use of geographical tagging for the Semantic
Web would also allow users to mine information using geographical constraints
in a more e ective way. In this sense, we would like to encourage the use of
Geonamen in order to produce accurate geographically tagged Web content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Sanderson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kohler</surname>
          </string-name>
          , J.:
          <article-title>Analyzing geographic queries</article-title>
          .
          <source>In: Proceedings of Workshop on Geographic Information Retrieval (GIR04)</source>
          . (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Gan</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Attenberg</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markowetz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suel</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Analysis of geographic queries in a search engine log</article-title>
          .
          <source>In: LOCWEB '08: Proceedings of the rst international workshop on Location and the web</source>
          , New York, NY, USA, ACM (
          <year>2008</year>
          )
          <volume>49</volume>
          {
          <fpage>56</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Andogah</surname>
          </string-name>
          , G.:
          <article-title>Geographically Constrained Information Retrieval</article-title>
          .
          <source>PhD thesis</source>
          , University of Groningen (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Boll</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kansa</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kishor</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naaman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Purves</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scharl</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilde</surname>
          </string-name>
          , E.:
          <article-title>Location and the web</article-title>
          (locweb
          <year>2008</year>
          ).
          <source>In: Proceeding of the 17th international conference on World Wide Web. WWW '08</source>
          , New York, NY, USA, ACM (
          <year>2008</year>
          )
          <volume>1261</volume>
          {
          <fpage>1262</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Buscaldi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Geooreka:
          <article-title>Enhancing Web Searches with Geographical Information</article-title>
          .
          <source>In: Proc. Italian Symposium on Advanced Database Systems SEBD2009</source>
          , Camogli, Italy (
          <year>2009</year>
          )
          <volume>205</volume>
          {
          <fpage>212</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Buscaldi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanchis</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Using the WordNet Ontology in the GeoCLEF Geographical Information Retrieval Task</article-title>
          . In Peters,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Gey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.C.</given-names>
            ,
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Mller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.J.</given-names>
            ,
            <surname>Kluck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Magnini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>de Rijke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Giampiccolo</surname>
          </string-name>
          , D., eds.:
          <source>Accessing Multilingual Information Repositories. Volume 4022 of Lecture Notes in Computer Science</source>
          . Springer, Berlin (
          <year>2006</year>
          )
          <volume>939</volume>
          {
          <fpage>946</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Buscaldi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>On the relative importance of toponyms in geoclef</article-title>
          .
          <source>In: Advances in Multilingual and Multimodal Information Retrieval, 8th Workshop of the Cross-Language Evaluation Forum</source>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2007</year>
          , Budapest, Hungary,
          <source>September 19-21</source>
          ,
          <year>2007</year>
          , Revised Selected Papers, Springer (
          <year>2007</year>
          )
          <volume>815</volume>
          {
          <fpage>822</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Giuliano</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>jWeb1T: a library for searching the Web 1T 5-gram corpus</article-title>
          . (
          <year>2007</year>
          ) Software available at http://tcc.itc.it/research/textec/toolsresources/jweb1t.html.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kullback</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leibler</surname>
            ,
            <given-names>R.A.</given-names>
          </string-name>
          :
          <article-title>On Information and Su ciency</article-title>
          .
          <source>Annals of Mathematical Statistics</source>
          <volume>22</volume>
          (
          <issue>1</issue>
          ) (
          <year>1951</year>
          ) pp.
          <volume>79</volume>
          {
          <fpage>86</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Buscaldi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Using GeoWordNet for Geographical Information Retrieval</article-title>
          .
          <source>In: Evaluating Systems for Multilingual and Multimodal Information Access, 9th Workshop of the Cross-Language Evaluation Forum</source>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2008</year>
          , Aarhus, Denmark,
          <source>September 17-19</source>
          ,
          <year>2008</year>
          , Revised Selected Papers. (
          <year>2009</year>
          )
          <volume>863</volume>
          {
          <fpage>866</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Amitay</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harel</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sivan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>So</surname>
            <given-names>er</given-names>
          </string-name>
          , A.:
          <article-title>Web-a-where: Geotagging web content</article-title>
          .
          <source>In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <article-title>She eld</article-title>
          ,
          <source>UK</source>
          (
          <year>2004</year>
          )
          <volume>273</volume>
          {
          <fpage>280</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>