<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TALP at GikiCLEF 2009</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Daniel Ferr ́es and Horacio Rodr ́ıguez TALP Research Center Software Department Universitat Polit`ecnica de Catalunya</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes our experiments in Geographical Information Retrieval with the Wikipedia collection in the context of our participation in the GikiCLEF 2009 Multilingual task in English and Spanish. Our system, called gikiTALP, follows a very simple approach that uses standard Information Retrieval with the Sphinx full-text search engine and some Natural Language Processing techniques without Geographical Knowdledge.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>GikiCLEF collections
The Wikipedia collections for all GikiCLEF languages are available in three formats, HTML dump,
SQL dump, and XML version. Most of the collections are from June 2008. We used the SQL
dump version of the English and Spanish collections.
The system architecture has three phases that are performed sequentially: Collection Indexing,
Topic Analysis, and Information Retrieval. The textual Collection Indexing has been applied over
the textual collections with MySQL and the full-text engine Sphinx using the Wikipedia SQL
dumps.</p>
      <p>Sphinx 1 is a full-text search engine that provides fast, size-efficient and relevant full-text
search functions to other applications. The indexes created with Sphinx do not have any language
processing. Sphinx has two types of weighting functions:
• Phrase rank: based on a length of longest common subsequence (LCS) of search words
between document body and query phrase.
• Statistical rank: based on classic BM25 function which only takes word frequencies into
account.</p>
      <p>We used two types of search modes in Sphinx:
• MATCH ALL: the final weight is a sum of weighted phrase ranks.
• MATCH EXTENDED: the final weight is a sum of weighted phrase ranks and BM25 weight,
multiplied by 1000 and rounded to integer.</p>
      <p>
        The Topic Analysis phase extracts some relevant keywords (with its analysis) from the topics.
These keywords are then used by the Document Retrieval phases. This process extracts
lexicosemantic information using the following set of Natural Language Processing tools: TnT (POS
tagger) and [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] WordNet lemmatizer (version 2.0) for English, and Freeling [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. for Spanish.
      </p>
      <p>The retrieval is done with Sphinx and then the final results are filtered. The Wikipedia entries
without Categories are discarded.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Experiments</title>
      <p>For the GikiCLEF 2009 evaluation we designed a set of three experiments that consist in
applying different baseline configurations (see Table 2) to retrieve Wikipedia entries (answers) of 50
geographically challenging topics.</p>
      <p>The three baseline runs were designed changing two parameters of the system: the IR Sphinx
search mode and the Natural Language Processing techniques applied over the query. The first
run (gikiTALP1) do not uses any NLP processing technique over the query and the Sphinx match
mode used is MATCH ALL. The second run (gikiTALP2) uses stopwords filtering and the lemmas
of the remaining words as a query and the Sphinx match mode used is MATCH ALL. The third
run (gikiTALP3) uses stopwords filtering and the lemmas of the remaining words as a query and
the Sphinx match mode used is MATCH EXTENDED.
The results of the gikiTALP system at the GikiCLEF 2009 Monolingual English and Spanish task
are summarized in Table 3. This table has the following IR measures for each run: number of
correct answers (#Correct Answers), Precision, and Score.</p>
      <p>The run gikiTALP1 obtained the following scores for English, Spanish and Global: 0.6684,
0.0280, and 0.6964. Due to an unexpected error we did not produced answers for the Spanish
topics in run 2 (gikiTALP2), then the results for English and global were 1,3559. The results of
the scores of the run gikiTALP3 for English, Spanish and Global were 1.635, 0.2667, and 1.9018
respectively.
This is our first approach for a Wikipedia-based retrieval task. We have used the Sphinx full-text
search engine with limited Natural Language Processing processing and without using
Geographical Knowledge. We obtained the best results when we have used all the NLP techniques (lemmas
in the queries and stopwords filtered) and the Sphinx mode MATCH EXTENDED. Geographical
Knowledge as baseline algorithms. As a future work we plan to: 1) detect the Expected Answer
Type and use the wordnet synsets to improve the results, 2) use Geographical Knowledge in the
Topic Analysis, 3) increase the use of the Wikipedia links.</p>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <p>This work has been supported by the Spanish Research Dept. (TEXT-MESS,
TIN2006-15265C06-05). Daniel Ferr´es is supported by a UPC-Recerca grant from Universitat Polit`ecnica de
Catalunya (UPC). TALP Research Center is recognized as a Quality Research Group (2001 SGR
00254) by DURSI, the Research Department of the Catalan Government.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Jordi</given-names>
            <surname>Atserias</surname>
          </string-name>
          , Bernardino Casas, Elisabet Comelles, Meritxell Gonz´alez,
          <source>Lluis Padr´o, and Muntsa Padr´o. FreeLing 1</source>
          .
          <article-title>3: Syntactic and Semantic Services in an Open-Source NLP Library</article-title>
          .
          <source>In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'06)</source>
          , pages
          <fpage>48</fpage>
          -
          <lpage>55</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brants. TnT - A Statistical</surname>
          </string-name>
          Part-
          <article-title>Of-Speech Tagger</article-title>
          .
          <source>In Proceedings of the 6th Applied NLP Conference (ANLP-2000)</source>
          , Seattle, WA,
          <string-name>
            <surname>United</surname>
            <given-names>States</given-names>
          </string-name>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Diana</given-names>
            <surname>Santos</surname>
          </string-name>
          , Nuno Cardoso, Paula Carvalho, Iustin Dornescu, Sven Hartrumpf, Johannes Leveling, and
          <string-name>
            <given-names>Yvonne</given-names>
            <surname>Skalban</surname>
          </string-name>
          .
          <article-title>Getting Geographical Answers from Wikipedia: the GikiP pilot at CLEF</article-title>
          . In Francesca Borri and
          <string-name>
            <given-names>Alessandro</given-names>
            <surname>Nardi</surname>
          </string-name>
          and Carol Peters, editor,
          <source>Working notes for the CLEF 2008 Workshop</source>
          , Aarhus, Denmark,
          <year>September 2008</year>
          .
          <article-title>CLEF 2008 Organizing Committee</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>