<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Report of MIRACLE team for the Ad-Hoc track in CLEF 2007</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>José Miguel Goñi-Menoyo</string-name>
          <email>josemiguel.goni@upm.es</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>José C. González-Cristóbal</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julio Villena-Román</string-name>
          <email>julio.villena@uc3m.es</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sara Lana-Serrano</string-name>
          <email>sara.lana@upm.es</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Universidad Politécnica de Madrid</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Universidad Carlos III de Madrid</string-name>
          <email>josecarlos.gonzalez@upm.es</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>DAEDALUS - Data</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Decisions</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Language</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>This paper presents the 2007 MIRACLE's team approach to the AdHoc Information Retrieval track. The work carried out for this campaign has been reduced to monolingual experiments, in the standard and in the robust tracks. No new approaches have been attempted in this campaign, following the procedures established in our participation in previous campaigns. For this campaign, runs were submitted for the following languages and tracks: - Monolingual: Bulgarian, Hungarian, and Czech. - Robust monolingual: French, English and Portuguese.</p>
      </abstract>
      <kwd-group>
        <kwd>Linguistic Engineering</kwd>
        <kwd>Information Retrieval</kwd>
        <kwd>Trie Indexing</kwd>
        <kwd>more keywords</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>The MIRACLE team is made up of three university research groups located in Madrid (UPM, UC3M and UAM)
along with DAEDALUS, a company founded in 1998 as a spin-off of two of these groups. DAEDALUS is a
leading company in linguistic technologies in Spain and is the coordinator of the MIRACLE team. This is our
fifth participation in CLEF. As well as monolingual and robust multilingual tasks, the team has participated in
the ImageCLEF, Q&amp;A, and GeoCLEF tracks.</p>
      <p>The MIRACLE Information Retrieval toolbox is made of basic components: stemming, transformation
(transliteration, elimination of diacritics and conversion to lowercase), filtering (elimination of stop and frequent
words), proper nouns detection and extracting, and paragraph extracting, among others. Some of these basic
components can be used in different combinations and order of application for document indexing and for query
processing. Through our participation in previous campaigns, the integration procedure of the different modules
is stable and, to some point, optimized.</p>
      <p>
        MIRACLE makes use of its own indexing and retrieval engine, which is based on the trie data structure 0. Tries
have been successfully used by the MIRACLE team for years, as an efficient storage and retrieval of huge
lexical resources, combined with a continuation-based approach to morphological treatment [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. For this
campaign, runs were submitted for the following languages and tracks:
• Monolingual: Bulgarian, Hungarian, and Czech.
      </p>
      <p>• Robust monolingual: French, English and Portuguese.</p>
    </sec>
    <sec id="sec-2">
      <title>Description of the MIRACLE Toolbox</title>
      <p>
        MIRACLE toolbox has already been described in previous campaigns papers [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We will say here that
document collections and topics were pre-processed before feeding the indexing and retrieval engine, using
different combinations of elementary processes. We will repeat here some relevant facts about these:
      </p>
      <p>Extraction: The extraction treatment has a special filter for extracting topic queries in the case of the
use of the narrative field: some patterns that were obtained from the topics of the past campaigns are
eliminated, since they are recurrent and misleading in the retrieval process. For example, for English,
we can mention patterns as “… are not relevant.”, or “…are to be excluded”. All the sentences that
contain such patterns are filtered out.</p>
      <p>Paragraphs extraction: We have not used paragraph indexing this year, since the results we have
obtained in this campaign and past ones have been disappointing.</p>
      <p>Tokenization: This process extracts basic text components, detecting and isolating punctuation
symbols. Some basic entities are also treated, such as numbers, initials, abbreviations, years, and some
proper nouns (see next item). The outcomes of this process are only single words, years that appear as
numbers in the text (e.g. 1995, 2004, etc.), or entities.</p>
      <p>Entities: We consider that entities detection and normalization plays a central role in Information
Retrieval, but it is a difficult task. For this year we have integrated a special module in the tokenization
process that detects and marks some entities that have been previously collected from several sources
into a lexical database for entities. These entities, which can be people names, place names, initials,
abbreviations, etc., can consist of one or more words and special symbols, and their correct treatment is
integrated into the tokenizer. For now, no entity normalization is done, so the same entity can appear in
different forms and these are treated as different entities.</p>
      <p>
        Filtering: Stopwords lists in the target languages were initially obtained from [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], but were extended
using several other sources and our own knowledge and resources. We have also compiled other lists of
words to exclude from the indexing and querying processes, which were obtained from the topics of
past CLEF editions and from our own background. We consider that such words have no semantics in
the type of queries used in CLEF. As example, we can mention some of the English list: find, appear,
relevant, document, report, etc.
      </p>
      <p>
        Transformation: The items that resulted from tokenization were normalized by converting all
uppercase letters to lowercase, and accents eliminated. This has not been done for Bulgarian.
Stemming: We used standard stemmers from Porter [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for English, and from Neuchatel [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] for
Hungarian, Bulgarian and Czech.
      </p>
      <p>
        Indexing: When all the documents processed through a combination of the former steps are ready for
indexing, they are fed into our indexing trie engine to build the document collection index.
Retrieval: When all the documents processed by a combination of the former steps are topic queries,
they are fed to an ad-hoc front-end of the retrieval trie engine to search the previously built document
collection index. In the 2007 experiments, only OR combinations of the search terms were used. The
retrieval model used is the well-known Robertson’s Okapi BM-25 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] formula for the probabilistic
retrieval model, without relevance feedback.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Results for the monolingual and robust tasks</title>
      <p>The following table and graphic representation summarize the performance of our official experiments in the
monolingual tasks (using the topic fields title/description).</p>
      <sec id="sec-3-1">
        <title>Precision figures for monolingual experiments</title>
        <p>lang
bg
cz
hu
In the case of the monolingual robust task, only the results for English will be shown, as our PT and FR runs did
not match the interpretation made in the assessment concerning the available collections and topics included in
each experiment. In the results, the mean average precision figures are given.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Precision figures for robust monolingual experiments</title>
        <p>lang
en</p>
        <p>Average Precision
0.3778</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and future work</title>
      <p>This year we have not changed our previous processing scheme, using the same improvements incorporated last
year regarding proper nouns and entities detection and indexing. For this reason, the obtained results must be
quite similar to previous ones. The only element that makes the processing of each language different has to do
with the stemming components and stopword lists.</p>
      <p>It is clear that the quality of the tokenization step is of paramount importance for precise document processing.
We still think that a high-quality entity recognition (proper nouns or acronyms for people, companies, countries,
locations, and so on) could improve the precision and recall figures of the overall retrieval, as well as a correct
recognition and normalization of dates, times, numbers, etc. Although we have introduced some improvements
in our processing scheme during the last years, a good multilingual entity recognition and normalization tool is
still missing.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This work has been partially supported by the Spanish R+D National Plan, by means of the project RIMMEL
(Multilingual and Multimedia Information Retrieval, and its Evaluation), TIN2004-07588-C03-01; and by the
Madrid’s R+D Regional Plan, by means of the project MAVIR (Enhancing the Access and the Visibility of
Networked Multilingual Information for Madrid Community), S-0505/TIC/000267.</p>
      <p>Special mention to our other colleagues of the MIRACLE team should be done (in alphabetical order): Ana
María García-Serrano, Ana González-Ledesma, José Mª Guirao-Miras, José Luis Martínez-Fernández, Paloma
Martínez-Fernández, Antonio Moreno-Sandoval and César de Pablo-Sánchez.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Aoe</surname>
            , Jun-Ichi; Morimoto, Katsushi; Sato,
            <given-names>Takashi.</given-names>
          </string-name>
          <article-title>An Efficient Implementation of Trie Structures</article-title>
          .
          <source>Software Practice and Experience</source>
          <volume>22</volume>
          (
          <issue>9</issue>
          ):
          <fpage>695</fpage>
          -
          <lpage>721</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Goñi-Menoyo</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>González-Cristóbal</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Villena-Román</surname>
            ,
            <given-names>J. MIRACLE</given-names>
          </string-name>
          at
          <string-name>
            <surname>Ad-Hoc</surname>
            <given-names>CLEF</given-names>
          </string-name>
          2005:
          <article-title>Merging and Combining without Using a Single Approach</article-title>
          .
          <source>Accessing Multilingual Information Repositories: 6th Workshop of the Cross Language Evaluation Forum</source>
          <year>2005</year>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2005</year>
          , Vienna, Austria, Revised Selected Papers (Peters,
          <string-name>
            <surname>C.</surname>
          </string-name>
          et al.,
          <source>Eds.). Lecture Notes in Computer Science</source>
          , vol.
          <volume>4022</volume>
          , Springer (to appear).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Goñi-Menoyo</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>González</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Villena-Román</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>Miracle's 2005 Approach to Monolingual Information Retrieval</article-title>
          .
          <source>Working Notes for the CLEF 2005 Workshop</source>
          . Vienna, Austria,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Goñi-Menoyo</surname>
          </string-name>
          , José M; González, José C.;
          <string-name>
            <surname>Martínez-Fernández</surname>
          </string-name>
          , José L.; and
          <string-name>
            <surname>Villena</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>MIRACLE's Hybrid Approach to Bilingual and Monolingual Information Retrieval</article-title>
          .
          <article-title>Multilingual Information Access for Text, Speech and Images: 5th Workshop of the Cross-Language Evaluation Forum</article-title>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2004</year>
          ,
          <article-title>Bath</article-title>
          , UK,
          <source>September 15-17</source>
          ,
          <year>2004</year>
          ,
          <string-name>
            <given-names>Revised</given-names>
            <surname>Selected Papers (Carol Peters</surname>
          </string-name>
          , Paul Clough,
          <string-name>
            <given-names>Julio</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          , et al.,
          <source>Eds.). Lecture Notes in Computer Science</source>
          , vol.
          <volume>3491</volume>
          , pp.
          <fpage>188</fpage>
          -
          <lpage>199</lpage>
          . Springer,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Goñi-Menoyo</surname>
          </string-name>
          , José M.;
          <string-name>
            <surname>González</surname>
          </string-name>
          , José C.;
          <string-name>
            <surname>Martínez-Fernández</surname>
          </string-name>
          , José L.;
          <string-name>
            <surname>Villena-Román</surname>
          </string-name>
          , Julio; GarcíaSerrano, Ana; Martínez-Fernández, Paloma; de Pablo-Sánchez,
          <article-title>César;</article-title>
          and
          <string-name>
            <surname>Alonso-Sánchez</surname>
          </string-name>
          ,
          <article-title>Javier. MIRACLE's hybrid approach to bilingual and monolingual Information Retrieval</article-title>
          .
          <source>Working Notes for the CLEF 2004 Workshop (Carol Peters and Francesca Borri, Eds.)</source>
          , pp.
          <fpage>141</fpage>
          -
          <lpage>150</lpage>
          . Bath, United Kingdom,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Goñi-Menoyo</surname>
          </string-name>
          , José Miguel;
          <article-title>González-Cristóbal, José Carlos</article-title>
          and
          <string-name>
            <surname>Fombella-Mourelle</surname>
            ,
            <given-names>Jorge.</given-names>
          </string-name>
          <article-title>An optimised trie index for natural language processing lexicons</article-title>
          .
          <source>MIRACLE Technical Report</source>
          . Universidad Politécnica de Madrid,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>González</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Goñi-Menoyo</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Villena-Román</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>Miracle's 2005 Approach to Cross-lingual Information Retrieval</article-title>
          .
          <source>Working Notes for the CLEF 2005 Workshop</source>
          . Vienna, Austria,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Porter</surname>
            ,
            <given-names>Martin.</given-names>
          </string-name>
          <article-title>Snowball stemmers and resources page</article-title>
          . On line http://www.snowball.tartarus.
          <source>org [Visited</source>
          <volume>18</volume>
          /07/2006].
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S.E.</given-names>
          </string-name>
          et al.
          <article-title>Okapi at TREC-3. In Overview of the Third Text REtrieval Conference (TREC-3</article-title>
          ). D.K. Harman (Ed.). Gaithersburg, MD: NIST,
          <year>April 1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Savoy</surname>
          </string-name>
          ,
          <source>Jacques. Report on CLEF-2003 Multilingual Tracks. Comparative Evaluation of Multilingual Information Access Systems</source>
          (Peters,
          <string-name>
            <surname>C</surname>
          </string-name>
          ; Gonzalo,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; Brascher,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ; and Kluck, M., Eds.).
          <source>Lecture Notes in Computer Science</source>
          , vol.
          <volume>3237</volume>
          , pp.
          <fpage>64</fpage>
          -
          <lpage>73</lpage>
          . Springer,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11] University of Neuchatel.
          <article-title>Page of resources for CLEF (Stopwords, transliteration</article-title>
          , stemmers …). On line http://www.unine.ch/info/clef
          <source>[Visited</source>
          <volume>18</volume>
          /07/2006].
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>