<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The UPV at GeoCLEF 2008: The GeoWorSE System</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Davide Buscaldi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Rosso</string-name>
          <email>prossog@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Natural Language Engineering Lab</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Geographical Information Retrieval, Index Term Expansion, Map-based Filtering</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dpto. de Sistemas Informaticos y Computacion</institution>
          ,
          <addr-line>DSIC</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad Politecnica de Valencia</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This year our system was complemented with a map-based lter. During the indexing phase, all places are disambiguated and assigned their coordinates on the map. These coordinates are stored in a separate index. The search process is carried out in two phases: in the rst one, we search the collection with the same method applied in 2007, which exploits the expansion of index terms by means of WordNet synonyms and holonyms. The next phase consists in a re-ranking of the results of the previous phase depending on the distance of document toponyms from the toponyms in the query, or depending on the fact that the document contains toponyms that are included in an area de ned by the query. The area is calculated from the toponyms in the query and their meronyms. This is the rst attempt to use GeoWordNet, a resource that includes the geographical coordinates of the places listed in WordNet, for the Geographical Information Retrieval task. The results show that map-based ltering allows to improve the results obtained by the base system, based only on the textual information.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>for information about Spain to nd also documents containing Valencia, Madrid or Barcelona,
although the original document does not contain the word \Spain".</p>
      <p>
        For our 2008 participation, we attempted to improve the method by introducing map-based
ltering. The most succesful methods in 2006 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and 2007 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] both combined textual retrieval with
geographical-based ltering and ranking. This observation prompted us to introduce a similar
feature in our system. The main obstacle was determined by the fact that we use WordNet,
which did not provide us with geographical coordinates for toponyms. Therefore, we rst had to
develop GeoWordNet [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a georeferenced version of WordNet. By combining this resource with
the WordNet-based toponym disambiguation algorithm in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], we are able to assign to the place
names in the collection their actual geographical coordinates and to perform some geographical
reasoning. We called the resulting system GeoWorSE (an acronym for Geographical Wordnet
Search Engine).
      </p>
      <p>In the following section, we describe the GeoWorSE system. In section 3 we describe the
characteristics of our submissions and the obtained results.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The GeoWorSE System</title>
      <p>
        The core of the system is constituted by the Lucene1 open source search engine, version 2:1.
Named Entity Recognition and classi cation is carried out by the Stanford NER system based
on Conditional Random Fields [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The access to WordNet is done by the MIT Java WordNet
Interface 2. The toponym disambiguator is based on the method presented in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
2.1
      </p>
      <sec id="sec-2-1">
        <title>Indexing</title>
        <p>During the indexing phase, the documents are examined in order to nd location names (toponym)
by means of the Stanford NER system. When a toponym is found, the disambiguator determines
the correct reference for the toponym. Then, a modi ed lucene indexer adds to the geo index the
toponym coordinates (retrieved from GeoWordNet); nally, it stores in the wn index the toponym
together with its holonyms and synonyms. All document terms are stored in the text index. In
Figure 1 we show the architecture of the indexing module.
1http://lucene.apache.org/
2http://www.mit.edu/ markaf/projects/wordnet/</p>
        <p>The indices are then used in the search phase, although the geo index is not used for search:
it is used only to retrieve the coordinates of the toponyms in the document.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Searching</title>
        <p>The architecture of the search module is shown in Figure 2.</p>
        <p>The topic text is searched by Lucene in the text index. All the toponyms are extracted by the
Stanford NER and searched for by Lucene in the wn index with a weight 0:25 with respect to the
content terms. The result of the search is a list of documents ranked using the Lucene's weighting
scheme (basically, this is the output that the system presented in 2007 would have returned).
At the same time, the toponyms are passed to the GeoAnalyzer, which creates a geographical
constraint that is used to re-rank the document list. The GeoAnalyzer may return two types of
geographical constraints:
a distance constraint, corresponding to a point in the map: the documents that contain
locations closer to this point will be ranked higher;
an area constraint, correspoinding to a polygon in the map: the documents that contain
locations included in the polygon will be ranked higher;</p>
        <p>
          For instance, in topic 10:2452=58 GC there is a distance constraint: \Travel problems at
major airports near to London". Topic 10:2452=76 GC contains an area constraint: \Riots in
South American prisons". The GeoAnalyzer determines the area using WordNet meronyms: South
America is expanded to its meronyms: Argentina, Bolivia, Brazil, Chile, Colombia, Ecuador,
Guyana, Paraguay, Peru, Uruguay, Venezuela. The area is obtained by calculating the convex
hull of the points associated to the meronyms using the Graham algorithm [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>The topic narrative allows to increase the precision of the considered area, since the toponyms
in the narrative are also expanded to their meronyms (when possible). Figure 3 shows the convex
hulls of the points corresponding to the meronyms of \South America", using only topic and
description (left) or all the elds, including narrative (right).</p>
        <p>The objective of the GeoFilter module is to re-rank the documents retrieved by Lucene,
according to geographical information. If the constraint extracted from the topic is a distance constraint,
the weights of the documents are modi ed according to the following formula:
w(doc) = wLucene(doc) (1 + exp( min d(q; p)))
p2P
(1)</p>
        <p>Where wLucene is the weight returned by Lucene for the document doc, P is the set of points
in the document, and q is the point extracted from the topic.</p>
        <p>If the constraint extracted from the topic is an area constraint, the weights of the documents
are modi ed according to formula 2:
w(doc) = wLucene(doc) 1 + jPqj (2)
jP j
where Pq is the set of points in the document that are contained in the area extracted from
the topic.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>We submitted a total of 6 runs at GeoCLEF 2008. Two runs were used as \benchmarks": they were
obtained by using the base Lucene system, without index term expansion, in one case considering
only topic title and description, and all elds in the other case. One run was generated with
the system we presented in 2007 (without the re-ranking by the geo lter module). For the three
remaining submissions we used the new system with topic and description only, topic, description
and narrative, and a con guration that do not use wordnet information during the search phase.</p>
      <p>In Table 1 we show the results obtained in terms of Mean Average Precision and R-Precision
for all the submitted runs.</p>
      <p>The obtained results show that the runs that used only the information contained in the
Title and Description elds were considerably better than runs that included also the narrative,
inverting the trend of the past GeoCLEF exercises, where TDN runs usually were better than TD
ones. We analyzed the results topic by topic and compared the performance of runs that used TD
only and runs that used also narrative. The topics that present the greatest di erence between
the two types of runs are 10:2452=GC 76, 10:2452=GC 77 and 10:2452=GC 91, in which the
use of narrative makes the results worse. Figure 4 shows in detail the average di erence between
the two types of runs.</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Further Work</title>
      <p>We introduced a map-based ltering method that allowed us to improve the results obtained with
our WordNet-based method. The best results were obtained with the map-based method, taking
into account only topic title and description. We believe that topic narrative could be used more
e ciently to improve the map-based ltering rather than using it directly during the search phase.
We plan to carry out some experiments with this con guration in order to verify our hypothesis.
We would like to thank the TIN2006-15265-C06-04 research project for partially supporting this
work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Davide</given-names>
            <surname>Buscaldi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <article-title>A conceptual density-based approach for the disambiguation of toponyms</article-title>
          .
          <source>International Journal of Geographical Information Systems</source>
          ,
          <volume>22</volume>
          (
          <issue>3</issue>
          ):
          <volume>301</volume>
          {
          <fpage>313</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Davide</given-names>
            <surname>Buscaldi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <article-title>Geo-wordnet: Automatic georeferencing of wordnet</article-title>
          .
          <source>In Proc. 5th Int. Conf. on Language Resources and Evaluation, LREC-2008</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Davide</given-names>
            <surname>Buscaldi</surname>
          </string-name>
          , Paolo Rosso, and
          <string-name>
            <given-names>Emilio</given-names>
            <surname>Sanchis</surname>
          </string-name>
          .
          <article-title>A wordnet-based indexing technique for geographical information retrieval</article-title>
          . In Carol Peters, Fredric C. Gey, Julio Gonzalo, Henning Mller,
          <string-name>
            <given-names>Gareth J.F.</given-names>
            <surname>Jones</surname>
          </string-name>
          , Michael Kluck, Bernardo Magnini, Maarten de Rijke, and Danilo Giampiccolo, editors,
          <source>Lecture Notes in Computer Sciences</source>
          , volume
          <volume>4730</volume>
          of Lecture Notes in Computer Science, pages
          <volume>954</volume>
          {
          <fpage>957</fpage>
          . Springer, Berlin,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Horacio</given-names>
            <surname>Rodr guez Daniel</surname>
          </string-name>
          <article-title>Ferres</article-title>
          . TALP at GeoCLEF 2007:
          <article-title>Using Terrier with Geographical Knowledge Filtering</article-title>
          . In C. Peters, editor,
          <source>CLEF 2007 Working Notes</source>
          , Budapest, Hungary,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Jenny</given-names>
            <surname>Rose</surname>
          </string-name>
          <string-name>
            <surname>Finkel</surname>
          </string-name>
          , Trond Grenager, and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Incorporating non-local information into information extraction systems by gibbs sampling</article-title>
          .
          <source>In Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL</source>
          <year>2005</year>
          ), pages
          <fpage>363</fpage>
          {370, U. of Michigan - Ann Arbor,
          <year>2005</year>
          . ACL.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Ronald</surname>
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Graham</surname>
          </string-name>
          .
          <article-title>An e cient algorith for determining the convex hull of a nite planar set</article-title>
          .
          <source>Information Processing Letters</source>
          ,
          <volume>1</volume>
          (
          <issue>4</issue>
          ):
          <volume>132</volume>
          {
          <fpage>133</fpage>
          ,
          <year>1972</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Bruno</given-names>
            <surname>Martins</surname>
          </string-name>
          , Nuno Cardoso, Marcirio Silveira Chaves, Leonardo Andrade, and
          <string-name>
            <given-names>Mario J.</given-names>
            <surname>Silva</surname>
          </string-name>
          . The university of lisbon at geoclef 2006. In C. Peters, editor,
          <source>CLEF 2006 Working Notes</source>
          , Alicante, Spain,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>George. A.</given-names>
            <surname>Miller</surname>
          </string-name>
          .
          <article-title>Wordnet: A lexical database for english</article-title>
          .
          <source>In Communications of the ACM</source>
          , volume
          <volume>38</volume>
          , pages
          <fpage>39</fpage>
          {
          <fpage>41</fpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>