<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TALP at GeoCLEF-2006: Experiments Using JIRS and Lucene with the ADL Feature Type Thesaurus</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel Ferr´es</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Horacio Rodr´ıguez</string-name>
          <email>horacio@lsi.upc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>TALP Research Center</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Design</institution>
          ,
          <addr-line>Performance, Experimentation</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>NN wine regions rivers european Europe</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Software Department</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Universitat Polit`ecnica de Catalunya</institution>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>wine european Europe</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes our experiments in Geographical Information Retrieval (GIR) in the context of our participation in the GeoCLEF 2006 Monolingual English task. The TALPGeoIR system follows a similar architecture of the GeoTALP-IR system presented at GeoCLEF 2005 [2] with some changes in the Retrieval modes and the Geographical Knowledge Base. The system has four phases performed sequentially: i) a Keyword Selection algorithm based on a Linguistic and Geographical Analysis of the topics, ii) a Geographical Document Retrieval with Lucene, iii) a Document Retrieval task with the JIRS Passage Retrieval (PR) software, and iv) a Document Ranking phase. A Geographical Thesaurus (GT) has been build using a set of publicly available Geographical Gazetteers and the Alexandria Digital Library (ADL) Feature Type Thesaurus. In our experiments we have used JIRS, a state-of-the-art PR system for Question Answering (QA), for the GIR task. We also have experimented with an approach using both JIRS and Lucene. In this approach JIRS was used only for Textual Document Retrieval and Lucene was used tor detect the geographically relevant documents. These experiments show that applying only JIRS we obtain better results than combining JIRS and Lucene.</p>
      </abstract>
      <kwd-group>
        <kwd>H</kwd>
        <kwd>3 [Information Storage and Retrieval]</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>1 Content Analysis and Indexing</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>3 Information Search and Retrieval</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>4 Systems and Software</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>This paper describes our experiments on Geographical Information Retrieval (GIR) in the context
of our participation in the GeoCLEF 2006 Monolingual English task.</p>
      <p>
        GeoCLEF is a cross-language geographic retrieval task at the CLEF 2006 campaign. Like
the first GIR task in GeoCLEF 2005 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the goal of the GeoCLEF task is to find as many
relevant documents as possible from the document collections, using a topic set. Topics are textual
descriptions with the following fields: title, description, narrative, location (e.g. geographical
places like continents, regions, countries, cities, etc.) and a geographical operator (e.g. spatial
relations like in, near, north of, etc.).
      </p>
      <p>
        Our GIR system is a modified version of the system presented in GeoCLEF 2005 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] with
some changes in the Retrieval modes and the Geographical Knowledge Base. The system has
four phases performed sequentially: i) a Keyword Selection algorithm based on a Linguistic and
Geographical Analysis of the topics, ii) a Geographical Document Retrieval with Lucene, iii) a
Document Retrieval task with the JIRS Passage Retrieval (PR) software, and iv) a Document
Ranking phase. A Geographical Thesaurus (GT) has been build using a set of publicly available
Geographical Gazetteers and the Alexandria Digital Library (ADL) Feature Type Thesaurus.
      </p>
      <p>In this paper we present the overall architecture of our Geographical IR system and we describe
briefly its main components. We also present the experiments, results and conclusions in the
context of the GeoCLEF 2006 Monolingual English.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>System Description</title>
      <sec id="sec-2-1">
        <title>Overview</title>
        <p>The system architecture has two phases that are performed sequentially: Topic Analysis (TA) and
Document Retrieval (DR). Previously, a Collection Pre-processing phase has been applied over
the textual collections.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Collection Pre-processing</title>
        <p>We pre-processed the entire English collections: Glasgow Herald 1995 (GH95) and Los Angeles
Times 1994 (LAT94) (i.e. 169,477 documents) with linguistic tools (described in the next
subsection) to mark the part-of-speech (POS) tags, lemmas and Named Entities (NE). After this
process the collection is analyzed with a Geographical Thesaurus (described in the next
subsection). This information was used to built two indexes: one with the Geographical information
of the documents and another with the Textual and Geographical information of the documents.
We have used two Information Retrieval (IR) systems to index: Lucene1 for the Geographical
Index and JIRS for the Textual and Geographical Index. These indexes are described below:
• Geographical Index: this index contains the geographical information of the documents
and its Named Entities. The Geographical index contains the following fields for each
document:
– docid: this field stores the document identifier.
– ftt: this field indexes the feature type of each geographical name and the Named Entity
classes of all the NEs appearing in the document.
– geo: this field indexes the geographical names and the Named Entities of the
document. It also stores the geographical information (feature type and geo-ontology path
information and coordinates) about the place names. Even if the place is ambiguous
all the possible referents are indexed.
• Textual and Geographical Index: this index stores the lemmatized content of the
document and adds geographical information (feature type and geo-ontology path information
and coordinates) about the Geographical Places of the text. If the geographical place is
ambiguous this information is not added to the indexed content.</p>
        <sec id="sec-2-2-1">
          <title>See below an example of the two indexes:</title>
          <p>IR System</p>
          <p>Lucene
JIRS</p>
          <p>Indexed Content
docid GH950102000000
regions@@land regions@@continents
administrative areas@@political areas@@countries 1st order divisions
ftt administrative areas@@populated places@@cities
administrative areas@@political areas@@countries
. . .</p>
          <p>Europe</p>
          <p>
            Asia@@Western Asia@@Saudi Arabia@@Hejaz@@24.5 38.5
geo America@@Northern America@@United States@@South Carolina
@@Lodge@@32.9817 -80.952
America@@Northern America@@United States@@38.91 -96.19
. . .
. . . the role of the wheel in lamatrekking , and where be the good place to air your
string vest . pity the crew who accompany him on his travel as sayle of
Arabia countries 1st order divisions Asia Western Asia Kuwait Arabia 25.0 45.0
along the Hejaz countries 1st order divisions Asia Western Asia Saudi Arabia
Hejaz 24.5 38.5 railway line from Aleppo countries 1st order divisions
Asia Middle East Syria Aleppo 36.0 37.0 in Northern Syria countries Asia
Middle East Syria 35.0 38.0 to Aqaba cities Asia Western Asia Jordan Maa´n
Aqaba 29.517 35 in Jordan countries Asia Western Asia Jordan 31.0 36.0.
as he journey through the searing heat in an age East German ‘ biscuit tin ‘ , his good
humour be sorely test . . .
This process extracts lexico-semantic and syntactic information using the following set of Natural
Language Processing tools: i) TnT an statistical POS tagger [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ], ii) WordNet lemmatizer
(version 2.0), iii) Spear2 (a modified version of the Collins parser [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]), and iv) A Maximum
Entropy based NERC trained with the CONLL-2003 shared task English data set.
2.3.2
          </p>
          <p>Geographical Analysis
The Geographical Analysis is applied to the Named Entities from the Title and Description and
Narrative tags that have been classified as LOCATION or ORGANIZATION by the NERC
module. This analysis has two main components:
• Geographical Thesaurus: this component has been built joining four gazetteers that
contain entries with places and their geographical class, coordinates, and other information:
2Spear. http://www.lsi.upc.edu/~surdeanu/spear.html
1. GEOnet Names Server (GNS)3: a gazetteer covering worldwide excluding the United</p>
          <p>
            States and Antarctica, with 5.3 million entries.
2. Geographic Names Information System (GNIS)4, contains 2.0 million entries about
geographic features of the United States and its territories. We used a subset of 39,906
entries of the most important geographical names.
3. GeoWorldMap5 World Gazetteer: a gazetteer with approximately 40,594 entries of the
most important countries, regions and cities of the world.
4. World Gazetteer6: a gazetteer with approximately 171,021 entries of towns,
administrative divisions and agglomerations with their features and current population. From
this gazetteer we added only the 29,924 cities with more than 5,000 unhabitants.
Each one of these gazetteers have a different set of classes. We have mapped these sets to
the ADL Feature Type Thesaurus.
• Feature Type Thesaurus. The feature type thesaurus of our Geographical Thesaurus
is the ADL Feature Type Thesaurus (ADLFTT). The ADL Feature Type Thesaurus is a
hierarchical set of geographical terms used to type named geographic places in English [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ].
Both GNIS and GNS gazetteers have been mapped to the ADLFTT, with a resulting set of
575 geographical types. Our GNIS mapping is similar to the one exposed in [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ].
2.3.3
          </p>
          <p>
            Topic Keywords Selection
This algorithm extracts the most relevant keywords of each topic. The algorithm was designed for
GeoCLEF 2005 [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]. The algorithm is applied after the Linguistic and Geographical analysis and
has the following steps:
1. All the punctuation symbols and stopwords are removed from the analysis of the title,
description and narrative tags.
          </p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2. All the words from the title tag are obtained.</title>
          <p>3. All the Noun Phrase base chunks from the description and narrative tags that contain a
word with a lemma that appears in one or more words from the title are extracted
4. The words that pertain to the chunks extracted in the previous step and haven’t a lemma
appearing in the words of the title are extracted.</p>
          <p>Once the keywords are extracted three different keyword sets are created:
• All: all the keywords extracted from the topic tags (title, description, and narrative).
• Geo: geographical places or geographical types appearing in the topic tags.
• NotGeo: all the keywords extracted from the topic tags that are not geographical place
names or geographical types.
3GNS. http://gnswww.nima.mil/geonames/GNS/index.jsp
4GNIS. http://geonames.usgs.gov/geonames/stategaz
5Geobytes Inc.: Geoworldmap database containing cities, regions and countries of the world with geographical</p>
          <p>Wine regions around rivers in Europe
Topic</p>
          <p>EN-desc</p>
          <p>Documents about wine regions along the banks of European rivers.</p>
          <p>Extracted
Keywords</p>
          <p>Set</p>
          <p>EN-narr
Not Geo</p>
          <p>Geo
All</p>
          <p>Relevant documents describe a wine region along a major river in European
countries. To be relevant the document must name the region and the river
2.5</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Document Retrieval using the JIRS Passage Retriever</title>
        <p>
          The JAVA Information Retrieval System (JIRS) software [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] is used to retrieve relevant documents
related to a GIR query. JIRS7 is a PR software specially designed for Question Answering (QA).
This system gets passages with a high similarity between the largests n-grams of the question
and the ones in the passage. It has 3 modes: simple n-gram model, term weight n-gram model,
and distance n-gram model. We used the distance n-gram model. In this model, the weight of a
passage is computed using the larger n-gram structure of the question that can be found in the
passage itself and the distances among the different n-grams of the question found in the passage.
        </p>
        <p>We used JIRS considering a topic keyword set as a question. We retrieved passages using
the n-gram distance model of JIRS with a length of 11 sentences per passage. We obtained a
maximum of 100.000 passages per topic. Finally a process selects the relevant documents from
the set of retrieved passages. We used two document scoring strategies in order to perform the
document selection:
• Best: this mode sets as a document score the score of its top-ranked passage from the set
of the retrieved passages that belong to this document.
• Accumulative: this mode sets as a document score the sum of all the scores of its retrieved
passages.
2.6</p>
      </sec>
      <sec id="sec-2-4">
        <title>Document Ranking</title>
        <p>This component ranks the documents retrieved by Lucene and JIRS. First, the top-scored
documents retrieved by JIRS that appear in the document set retrieved by Lucene are selected. Then,
if the set of selected documents is less than 1,000, the top-scored documents of JIRS that not
appear in the document set of Lucene are selected with a lower priority than the previous ones.
Finally, the first 1,000 top-scored documents are selected. On the other hand, when the system
uses only JIRS for retrieval only selects the first 1,000 top-scored documents by JIRS.
7JIRS. http://leto.dsic.upv.es:8080/jirs
Automatic Runs
TALPGeoIRTD1
TALPGeoIRTD2
TALPGeoIRTDN1
TALPGeoIRTDN2</p>
        <p>TALPGeoIRTDN3
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results 3</title>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>We designed a set of five experiments that consist in applying different IR systems, query keyword
sets, and tags to an automatic GIR system (see Table 1). Basically, these experiments can be
divided in two groups depending on the retrieval engines used:
• Only JIRS. Two baseline experiments have been done in this group: the runs
TALPGeoIRTD1 and TALPGeoIRTDN1. These runs differ uniquely in the use of the Narrative
tag in the second one. Both runs use one retrieval system, JIRS, and they use all the
keywords to perform the query. The experiment TALPGeoIRTDN3 is similar to the previous
experiments but uses a Cumulative scoring strategy to select the documents with JIRS.
• JIRS &amp; Lucene. The runs TALPGeoIRTD2 and TALPGeoIRTDN2 use JIRS for Textual
Document Retrieval and Lucene for Geographical Document Retrieval. Both runs use the
Geo keywords set for Lucene and the NotGeo set for JIRS.</p>
      <p>In these experiments we can expect to see the difference of these strategies: only JIRS for
Geographical and Textual search and JIRS with Lucene for a separated Textual and Geographical
Search.</p>
      <p>The results of the TALPGeoIR system at the GeoCLEF 2006 Monolingual English task are
summarized in Table 2. This table has the following IR measures for each run: Average Precision,
R-Precision, and Recall.</p>
      <p>The results show a substantial difference between the two sets of experiments. The runs that
use only JIRS have a better Average Precision, R-Precision, and Recall than the ones that use
JIRS and Lucene. The run with the best Average Precision is TALPGeoIRTD1 with 0.1342.
The best Recall measure is obtained by the run TALPGeoIRTDN1 with a 68.78% of the relevant
documents retrieved. This run has the same configuration of theTALPGeoIRTD1 run but uses
the Narrative tag. Finally, we obtained poor results in comparison with the mean average
precision (0.1975) obtained by all the systems that participated in the GeoCLEF 2006 Monolingual
English task.</p>
      <p>Run
TALPGeoIRTD1
TALPGeoIRTD2
TALPGeoIRTDN1
TALPGeoIRTDN2
TALPGeoIRTDN3
We have applied JIRS, an state-of-the-art PR system for QA, to the GeoCLEF 2006 Monolingual
English task. We also have experimented with an approach using both JIRS and Lucene. In this
approach JIRS was used only for Textual Document Retrieval and Lucene was used to detect the
Geographical relevant documents. The approach with only JIRS was better than the one with
JIRS and Lucene combined.</p>
      <p>Comparatively with the mean average precision of all runs our Average Precision is a bit low.
This fact can be due to several reasons: i) the JIRS PR system may be was not used appropiately
or is not suitable for the GIR task, ii) our system is not dealing with geographical ambiguities, iii)
our system lacks of query expansion methods, iv) the need of relevance feedback methods, and v)
errors in the Topic Analysis phase.</p>
      <p>As a future work we propose the following improvements to the system: i) the resolution of
geographical ambiguity problems applying toponym resolution algorithms, ii) apply some query
expansion methods, iii) study the effect of blind feedback.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work has been partially supported by the European Commission (CHIL, IST-2004-506909).
Daniel Ferr´es is supported by a UPC-Recerca grant from Universitat Polit`ecnica de Catalunya
(UPC). TALP Research Center is recognized as a Quality Research Group (2001 SGR 00254) by
DURSI, the Research Department of the Catalan Government.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brants. TnT -</surname>
          </string-name>
          <article-title>a statistical part-of-speech tagger</article-title>
          .
          <source>In Proceedings of the 6th Applied NLP Conference (ANLP-2000)</source>
          , Seattle, WA,
          <string-name>
            <surname>United</surname>
            <given-names>States</given-names>
          </string-name>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ferr</surname>
          </string-name>
          <article-title>´es, A. Ageno, and</article-title>
          <string-name>
            <given-names>H.</given-names>
            <surname>Rodr</surname>
          </string-name>
          <article-title>´ıguez. The GeoTALP-IR System at GeoCLEF-2005: Experiments Using a QA-based IR System, Linguistic Analysis, and a Geographical Thesaurus</article-title>
          . In Peters et al. [
          <volume>6</volume>
          ].
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Ferr</surname>
          </string-name>
          ´es, Samir Kanaan, Edgar Gonz´alez, Alicia Ageno, Horacio Rodr´ıguez, Mihai Surdeanu, and
          <string-name>
            <given-names>Jordi</given-names>
            <surname>Turmo</surname>
          </string-name>
          .
          <article-title>TALP-QA System at TREC 2004: Structural and Hierarchical Relaxation Over Semantic Constraints</article-title>
          .
          <source>In Proceedings of the Text Retrieval Conference (TREC2004)</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Fredric</given-names>
            <surname>Gey</surname>
          </string-name>
          , Ray Larson, Mark Sanderson, Hideo Joho, Paul Clough, and Vivien Petras.
          <article-title>GeoCLEF: the CLEF 2005 Cross-Language Geographic Information Retrieval Track Overview</article-title>
          . In Peters et al. [
          <volume>6</volume>
          ].
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Linda</surname>
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Hill</surname>
          </string-name>
          .
          <article-title>Core Elements of Digital Gazetteers: Placenames, Categories, and Footprints</article-title>
          .
          <source>In ECDL '00: Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries</source>
          , pages
          <fpage>280</fpage>
          -
          <lpage>290</lpage>
          , London, UK,
          <year>2000</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. C.</given-names>
            <surname>Gey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J.F.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kluck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          , H. Mu¨ller, and M. de Rijke., editors.
          <source>Accessing Multilingual Information Repositories: 6th Workshop of the CrossLanguage Evaluation Forum</source>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2005</year>
          , Vienna, Austria, Revised Selected Papers., volume
          <volume>4022</volume>
          of Lecture Notes in Computer Science. Springer,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7] Jos´e Manuel G´omez Soriano, Manuel Montes y G´omez, Emilio Sanchis Arnal, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <article-title>A Passage Retrieval System for Multilingual Question Answering</article-title>
          . In V´aclav Matousek, Pavel Mautner, and Tom´as Pavelka, editors,
          <source>TSD</source>
          , volume
          <volume>3658</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>443</fpage>
          -
          <lpage>450</lpage>
          . Springer,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>