<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TPIRS: A System for Document Indexing Reduction on WebCLEF</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Measurement, Performance, Experimentation</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>David Pinto &amp; Hector Jimenez-Salazar Faculty of Computer Science</institution>
          ,
          <addr-line>BUAP</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Paolo Rosso &amp; Emilio Sanchis Department of Information Systems and Computation</institution>
          ,
          <addr-line>UPV</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we present the results of BUAP/UPV universities in WebCLEF, a particular task of CLEF 2005. Particularly, we evaluate our information retrieval system in the bilingual English to Spanish track. Our system uses a term reduction process based on the Transition Point technique. Our results show that it is possible to reduce the number of terms to index, thereby improving the performance of our system. We evaluate di erent percentages of reduction over a subset of EuroGOV, in order to determine the best one. We observed that after reducing the 82.55% of the corpus, a Mean Reciprocal Rank of 0.0844 was obtained, compared with 0.0465 of such evaluation with full documents.</p>
      </abstract>
      <kwd-group>
        <kwd>H</kwd>
        <kwd>3 [Information Storage and Retrieval]</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>1 Content Analysis and Indexing</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>3 Information Search and Retrieval</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>4 Systems and Software</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>7 Digital Libraries</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>High volume of information in Internet leds to developed novel techniques for managing of data,
specially when we deal with information in multiple languages. There are su cient example
scenarios in which users may be interested in information which is in a language other than their
own native language. A common language scenario is where a user has some comprehension ability
for a given language but s/he is not su ciently pro cient to con dently specify a search request
in that language. Thus, a search system that can deal with this problem should be of a high
bene t. The World Wide Web (WWW) is a natural setting for cross-lingual information retrieval;
the European Union is a typical example of a multilingual scenario, where multiple users have to
deal with information published in at least 20 languages.</p>
      <p>
        In order to reinforce research in this area, CLEF (Cross-Language Evaluation Forum) has been
compiling a set of multi-lingual corpora and promoting the evaluation of multiple multi-lingual
information retrieval systems for diverse kinds of data [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A particular track for the evaluation of
such systems that deal with information on the web has been set up this year as a part of CLEF.
This forum was named WebCLEF, and the best description of this particular task can be seen in
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In WebCLEF, three subtasks were de ned within this year: mixed monolingual, multilingual,
and bilingual English to Spanish.
      </p>
      <p>
        This paper reports results on the evaluation of a Cross-Language Information Retrieval System
(CLIRS) for the bilingual English to Spanish subtask of WebCLEF 2005. A document indexing
reduction is proposed, in order to improve precision of CLIRS and to diminish the storing space
on such systems. Our proposal is based on the use of the Transition Point (TP) technique, which
is somehow a method that obtains important terms from a document. We evaluate di erent
percentages of TP over a subset of EuroGOV corpus [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and we observed that it is possible to
improve precision results reducing the number of terms for a given corpus.
      </p>
      <p>The next section describes our information retrieval system in detail. Section 3 brie y
introduces the corpus used in our experiments, and the results obtained after evaluation. Finally, a
discussion of our experiments is presented.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Description of TPIRS</title>
      <p>We used a boolean model with Jaccard similarity formula for our CLIRS. Our goal was to determine
the behaviour of document indexing reduction in an information retrieval environment. In order
to reduce the terms from every document treated, we applied a technique named Transition Point,
which is described as follows.
2.1</p>
      <sec id="sec-2-1">
        <title>Transition Point Technique</title>
        <p>
          The transition point is a frequency value that splits the vocabulary of a document in two sets of
terms (low and high frequency). This technique is based on Zipf Law of Word Ocurrences [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]
and re ned from studies of Booth [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and, recently, Urbizagastegui [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. These studies are meant
to demonstrate that terms of medium frequency are closely related to the conceptual content of
the document. Thus, it is possible to form the hypothesis that terms closer to TP can be used as
indexes of a document. A typical formula used to obtain this value is given in equation 1.
(1)
(2)
T P =
p8 I1 + 1
2
1
;
where I1 represents the number of words with frequency equal to 1 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>
          Alternatively, TP can be localized by identifying the lowest frequency (from the highest
frequencies) that it is not repeated; this characteristic comes from properties of Booth's law of low
frequency words [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>Let us consider a frequency-sorted vocabulary of a document; i.e., VT P = [(t1; f1); :::; (tn; fn)],
with fi fi 1, then T P = fi 1, iif fi = fi+1. The most important words are those that obtain
the closest frequency values to TP, i.e.,</p>
        <p>T PSET = ftij(ti; fi) 2 VT P ; U1
fi</p>
        <p>U2g;
where U1 is a lower threshold obtained by a given neighbourhood percentage of TP (NTP),
thus, U1 = (1 N T P ) T P . U2 is the upper threshold and it is calculated in a similar way
(U2 = (1 + N T P ) T P ).</p>
        <p>
          We have used TP technique in diverse areas of natural language processing (NLP), like:
clustering of short texts [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], categorization of texts [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], keyphrases extraction [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], summarization
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], and weighting models for information retrieval systems [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Thus, we believe that there exist
enough evidence to utilize this technique as a terms reduction process.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Information Retrieval Model</title>
        <p>Our information retrieval is based on the Boolean Model, and, in order to rank the documents
retrieved, we used the Jaccard similarity function, applied to the query and every document of the
corpus used. Previously, each document was preprocessed and its index terms were selected (the
preprocessing phase is described in section 3.1). For this purpose, several values of a neighbourhood
of TP were used as thresholds, as equation 2 indicates.
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <sec id="sec-3-1">
        <title>Corpus</title>
        <p>We used a subset of EurGov corpus for our evaluation. This subset was composed by a set of
Spanish Internet pages, originally obtained from European government-related sites. We named
this corpus BiEnEs.</p>
        <p>
          In order to construct this corpus, for every page compiled in the EuroGOV corpus, we determine
its language by using TexCat [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], a guesser language program widely used. We construct our
evaluation corpus with those documents identi ed as Spanish language.
        </p>
        <p>The preprocessing of the BiEnEs corpus consisted of elimination of punctuation symbols,
Spanish stopwords, numbers, html tags, script codes and style cascade sheets codes.</p>
        <p>For the evaluation of BiEnEs, a set of 134 queries was composed and re ned, in order to provide
gramatically correct \English" queries. Queries and assessments were created by the participants
in the WebCLEF track, and the particular case of the queries were later reviewed and in some
cases corrected in their English translation by the NLP Group at UNED. Queries were distributed
in the following way: 67 homepages and 67 named page ndings.</p>
        <p>We applied a preprocessing phase to this set of queries. First, we used an online translation
system 1 in order to translate every query from English to Spanish. After that, an elimination of
punctuation symbols, spanish stopwords and numbers was done.</p>
        <p>We did not apply a rigorous method of translation, due to the fact that our main goal in our
rst participation on WebCLEF was to determine the quality of terms reduction in our CLIRS.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Indexing reduction</title>
        <p>In order to determine the behaviour of document indexing reduction on CLIRS, we submit to the
contest, a set of ve runs, which are described as follows.</p>
        <p>First Run: This run used \Full documents" as evaluation corpus, and conformed the baseline
for our experiments. We named it the \Full" evaluation.</p>
        <p>Second Run: This run used an evaluation corpus composed of the reduction of every document,
using the TP technique with a neighbourhood of 10% around TP. We named it the \TP10"
evaluation.</p>
        <p>Third Run: This run used an evaluation corpus composed of the reduction of every document,
using the TP technique with a neighbourhood of 20% around TP. We named it the \TP20"
evaluation.</p>
        <p>Fourth Run: This run used an evaluation corpus composed of the reduction of every document,
using the TP technique with a neighbourhood of 40% around TP. We named it the \TP40"
evaluation.
Fifth Run: This run used an evaluation corpus composed of the reduction of every document,
using the TP technique with a neighbourhood of 60% around TP. We named it the \TP60"
evaluaton.
We proposed an index reduction method for a cross-lingual information retrieval system. Our
proposal is based on the transition point technique.</p>
        <p>After submitting ve runs on the bilingual English to Spanish subtrack from WebCLEF, we
observed that it is possible to reduce terms in the documents that conform the corpus of a CLIRS,
not only by reducing the time needed for indexing but also by improving the precision of the
results obtained by CLIRS.</p>
        <p>Our method is linear in computational time, and therefore it can be used in practical tasks.
Until now, results obtained in terms of MRR are very low, but ndings show that by applying better
techniques of English to Spanish translation of queries, results can be dramatically improved.</p>
        <p>We were concerned with the impact of indexing reduction on CLIRS, and in the future we
hope to improve other components of our CLIRS, for instance, the use of vector space model, in
order to improve the MRR.</p>
        <p>The TP technique has shown an e ective use on diverse areas of NLP, and its best features
for NLP, are mainly two: a high content of semantic information and the sparseness that can
be obtained on vectors for document representation on models based on the vector space model.
On the other hand, its language independence allows to use this technique in CLIRS, that is the
matter of WebCLEF.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Booth</surname>
          </string-name>
          :
          <article-title>A Law of Ocurrences for Words of Low Frequency, Information and</article-title>
          control,
          <year>1967</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bueno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pinto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jimenez</surname>
          </string-name>
          , El parrafo virtual en la generacion de extractos,
          <source>Research on Computing Science Journal, ISSN 1665-9899</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cabrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pinto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jimenez</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Vilarin~o, Una nueva ponderacion para el modelo</article-title>
          de espacio vectorial de recuperacion de informacion,
          <source>Research on Computing Science Journal, ISSN 1665-9899</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2005</year>
          :
          <article-title>Cross-Language Evaluation Forum</article-title>
          , http://www.clef-campaign.org/,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jimenez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pinto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , Seleccion de Terminos No Supervisada para Agrupamiento de Resumenes,
          <source>In proceedings of Workshop on Human Language, ENC05</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Moyotl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jimenez</surname>
          </string-name>
          ,
          <article-title>An Analysis on Frequency of Terms for Text Categorization</article-title>
          ,
          <source>Proceedings of XX Conference of Spanish Natural Language Processing Society (SEPLN-04)</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Pinto</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>Perez:Una Tecnica para la Identi cacion de Terminos MultipalabrIn p</article-title>
          ,
          <source>Proceedings of 2nd. National Conference on Computer Science</source>
          , Mexico,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Reyes-Aguirre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Moyotl-Hernandez &amp; H.</surname>
          </string-name>
          Jimenez-Salazar.: Reduccion de Terminos Indice Usando el Punto de Transicion, In proceedings of Facultad de Ciencias de Computacion XX Anniversary Conferences,
          <string-name>
            <surname>BUAP</surname>
          </string-name>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B.</given-names>
            <surname>Sigurbjo</surname>
          </string-name>
          rnsson, J. Kamps, and M. de Rijke:
          <article-title>EuroGOV: Engineering a Multilingual Web Corpus</article-title>
          ,
          <source>In Proceedings of CLEF</source>
          <year>2005</year>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Sigurbjo</surname>
          </string-name>
          rnsson, J. Kamps, and M. de Rijke:
          <article-title>WebCLEF 2005: Cross-Lingual Web Retrieval</article-title>
          ,
          <source>In Proceedings of CLEF</source>
          <year>2005</year>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <article-title>TextCat: Language identi cation tool</article-title>
          , http://odur.let.rug.nl/ vannord/TextCat/,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tovar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Carrillo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pinto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jimenez</surname>
          </string-name>
          ,
          <source>Combining Keyword Identi cation Techniques, Research on Computing Science Journal, ISSN 1665-9899</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Urbizagastegui</surname>
          </string-name>
          : Las posibilidades de la Ley de Zipf en la indizacion automatica,
          <source>Research report of the California Riverside University</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>G. K.</surname>
          </string-name>
          <article-title>Zipf: Human Behavior and the Principle of Least-E ort</article-title>
          , Addison-Wesley, Cambridge MA,
          <year>1949</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>