<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Text Reduction-Enrichment at WebCLEF∗</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Faculty of Computer Science</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mexico</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Systems and Computation</institution>
          ,
          <addr-line>UPV</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we are reporting the results obtained after submitting one run to the Mixed Monolingual task of WebCLEF 2006. We have used a text reduction process based on the selection of mid-frequency terms. Although our approach enhances precision, it must be improved in recall by an enrichment process based on the addition of high co-ocurrence terms. We have seen that a improvement of 40% in the corpus used last year in the BiEnEs was obtained. But we also observed that low Mean Reciprocal Rank (MRR) values were obtained compared with those of the mixed monolingual task of WebCLEF 2005. We consider that our low MRR is derived of a bad preprocessing phase, but we must investigate this issue in detail.</p>
      </abstract>
      <kwd-group>
        <kwd>Text reduction</kwd>
        <kwd>Text enrichment</kwd>
        <kwd>Mixed-Monolingual</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>∗This work was partially supported by FCC-BUAP and the BUAP-701 PROMEP/103.5/05/1536 grant.</p>
      <p>Nowadays, WebCLEF have defined one task for the evaluation of search engines: the Mixed
Monolingual. Thus, in this paper we are reporting the results obtained after the submission of
one run to this task.</p>
      <p>We have used a text reduction and enrichment process and, therefore, we organized this
document in three sections. The next section describes the components of our search engine. In Section
3.3 the evaluation results are presented, and finally a discussion of findings are given.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Description of the search engine</title>
      <p>We used a boolean model with Jaccard similarity formula for our system. Our goal was to
determine the behaviour of document indexing reduction in an information retrieval environment. In
order to reduce the terms from every document treated, we applied a technique named Transition
Point, which is described as follows.
2.1</p>
      <sec id="sec-2-1">
        <title>The Transition Point Technique</title>
        <p>
          The Transition Point (TP) is a frequency value that splits the vocabulary of a text into two sets
of terms (low and high frequency). This technique is based on the Zipf Law of Word Ocurrences
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and also on the refined studies of Booth [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], as well as of Urbizag´astegui [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. These studies are
meant to demonstrate that mid-frequency terms are closely related to the conceptual content of a
document. Therefore, it is possible to form the hypothesis that terms closer to TP can be used as
indexes of a document. A typical formula used to obtain this value is: T P = (√8 ∗ I1 + 1 − 1)/2,
where I1 represents the number of words with frequency equal to 1; see [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>
          Alternatively, TP can be localized by identifying the lowest frequency (from the highest
frequencies) that it is not repeated in each document; this characteristic comes from the properties
of the Booth’s law of low frequency words [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. In our experiments we have used this approach.
        </p>
        <p>Let us consider a frequency-sorted vocabulary of a document; i.e., VT P = [(t1, f1), ..., (tn, fn)],
with fi ≥ fi+1, then T P = fi−1, iif fi = fi+1. The most important words are those that obtain
the closest frequency values to TP, i.e.,</p>
        <p>T PSET = {ti|(ti, fi) ∈ VT P , U1 ≤ fi ≤ U2},
(1)
where U1 is a lower threshold obtained by a given neighbourhood percentage of TP (NTP), thus,
U1 = (1 − N T P ) ∗ T P . U2 is the upper threshold and it is calculated in a similar way (U2 =
(1 + N T P ) ∗ T P ). Either in WebCLEF-2005 and in the current competition, we have used
N T P = 0.4, considering that the TP technique is language independent.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Term Enrichment</title>
        <p>Certainly TP reduction may increase precision, but furthermore it decreases recall. Due to this
fact, we enriched the selected terms by obtaining new terms, those with similar characteristics to
the initial ones. Specifically, given a text T , with selected terms T PSET , y is a new term if it
co-occurs with some x ∈ T PSET , i.e.,</p>
        <p>T PS0 ET = T PSET ∪ {y|x ∈ T PSET ∧ (f r(xy) &gt; 1 ∨ f r(yx) &gt; 1)}.
(2)
Considering the text length, we only selected a window of size 1 around each term of S, and a
minimum frequency of two for each bigram was required as condition to include new terms.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Information Retrieval Model</title>
        <p>Our information retrieval is based on the Boolean Model and, in order to rank documents retrieved,
we used the Jaccard’s similarity function applied to both, the query and every document of the
corpus used. Previously, each document was preprocessed and its index terms were selected (the
preprocessing phase is described in section 3.1). As we will see in Section 3.3 we represent each text
using the selection given by equation 1, additionally, after reduction, we carried out an enrichment
process based on the identification of related terms to those selected, Eq. 2.
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <sec id="sec-3-1">
        <title>Corpus</title>
        <p>
          We used the EuroGOV corpus provided by the WebCLEF forum which is better described in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ],
but we indexed only 20 domains: DE, AT, BE, DK, SI, ES, EE, IE, IT, SK, LU, MT, NL, LV,
PT, FR, CY, GR, HU, and UK (we did not indexed the following domains: EU, RU, FI, PL, SE,
CZ, LT). Due to this fact, only 1470 from 1939 topics were evaluated, which is approximately a
75,81% of the total of topics. Although we presented in Section 3.3 the MRR over 1939 topics,
469 topics related with the not indexed domains were not evaluated.
        </p>
        <p>The preprocessing phase of the EuroGOV corpus was carried out by writing two scripts for
obtaining the terms to be indexed from each document. The first script uses regular expressions
for excluding all the information which is enclosed by the characters &lt; and &gt;. Although this
script obtains very good results, it is very slow and therefore we decided to used it only with three
domains of the EuroGOV collection, namely Spanish (ES), French (FR), and German (DE).</p>
        <p>On the other hand, we wrote a script based in the html syntax for obtaining all the terms
considered interesting for indexing, i.e., those different than script codes (javascript, vbscript,
style cascade sheet, etc), html codes, etc. This script speeded up our indexing process but it
did not took into account that some web pages are incorrectly written and, therefore, we missed
important information from those documents.</p>
        <p>
          For every page compiled in the EuroGOV corpus, we also determine its language by using
TexCat [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], a language identification program widely used. We construct our evaluation corpus
with those documents identified as a language of the above list.
        </p>
        <p>Another preprocessing problem consisted in the charset codification, which leads to a even
more difficult analysis. Although the EuroGOV corpus is given in UTF-8, the documents that
made up this corpus does not neccesarily keep this charset. We have seen that for some domains,
the charset codification is given in the html metadata tag, but also we found that this codification
could be wrong, perhaps because it was filled without the supervision of the creator of that page,
who may be does not know anything, and evenmore does not matter about charsets codifications.
We consider it as the most difficult problem in the preprocessing process.</p>
        <p>Finally, we eliminated stopwords for each language (except for Greek language) and
punctuation symbols. The same process was applied to the queries.</p>
        <p>For the evaluation of this corpus, a set of queries was provided by WebCLEF-2006.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Indexing reduction</title>
        <p>
          After our first participation in WebCLEF [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], we carried out more experiments using only those
documents in Spanish language from the EuroGOV corpus. We observed that a value of N T P =
0.4 using the reduction process shown in the Equation 1 was adequated. Therefore, in this test
we carried out one run with that value. Moreover, this run took the evaluation corpus composed
by the reduction of every text, using TP technique with a neighbourhood of 40% around TP, an
enriched this set of terms using related terms as described by Equation 2.
        </p>
        <p>Table 1 shows the size of every evaluation corpus used; the vocabulary composed by
representation of all texts, |T PS0 ET |, as well as the percentage of reduction obtained by each one with
respect to the original text. As we can see, the TP technique obtained a percentage of reduction
lower than 5%, which also implies a reduction in time for the indexing process in a search engine.
We have used an index reduction method for our search engine that includes an enrichment step.
Our proposal is based on the transition point technique which allows to obtain mid-frequency
terms from every document to be indexed. Our method is linear in computational time and,
therefore, it can be used in a wide spectrum of practical tasks.</p>
        <p>After submitting our run we observed enhancement if we compare the results obtained with
those of the BiEnEs task in WebCLEF 2005. By using the enrichment, more than 40% on MRR
was achieved. However, using the Vector Space Model similar results to boolean model were
obtained.</p>
        <p>The TP technique has shown an effective use on diverse areas of NLP, and its best features for
NLP, are mainly two: a high content of semantic information and the sparseness that can be
obtained on vectors for document representation on models based on the vector space model. On the
other hand, its language independence allows to use this technique in multilingual environments.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Booth</surname>
          </string-name>
          :
          <article-title>A Law of Ocurrences for Words of Low Frequency, Information and</article-title>
          control,
          <year>1967</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2005</year>
          :
          <article-title>Cross-Language Evaluation Forum</article-title>
          , http://www.clef-campaign.org/,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Pinto</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <article-title>Jim´enez-</article-title>
          <string-name>
            <surname>Salazar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <article-title>Sanchis: TPIRS: A System for Document Indexing Reduction on WebCLEF</article-title>
          ,
          <source>Extended abstract in Working notes of CLEF'05</source>
          ,
          <string-name>
            <surname>Viena</surname>
          </string-name>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Reyes-Aguirre</surname>
          </string-name>
          , E. Moyotl-Hern´andez &amp; H. Jim´enez-Salazar: Reducci´on de T´erminos Indice Usando el Punto de Transici´on, In proceedings of Facultad de Ciencias de Computaci´
          <article-title>on XX Anniversary Conferences</article-title>
          ,
          <string-name>
            <surname>BUAP</surname>
          </string-name>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Sigurbj</surname>
          </string-name>
          ¨ornsson, J. Kamps, and M. de Rijke:
          <article-title>EuroGOV: Engineering a Multilingual Web Corpus</article-title>
          ,
          <source>In Proceedings of CLEF</source>
          <year>2005</year>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Sigurbj</surname>
          </string-name>
          ¨ornsson, J. Kamps, and M. de Rijke:
          <article-title>WebCLEF 2005: Cross-Lingual Web Retrieval</article-title>
          ,
          <source>In Proceedings of CLEF</source>
          <year>2005</year>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>[7] TextCat: Language identification tool</article-title>
          , http://odur.let.rug.nl/ vannord/TextCat/,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Urbizag</surname>
          </string-name>
          ´astegui: Las posibilidades de la Ley de Zipf en la indizaci´
          <source>on autom´atica, Research report of the California Riverside University</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>G. K.</surname>
          </string-name>
          <article-title>Zipf: Human Behavior and the Principle of Least-Effort, Addison-</article-title>
          <string-name>
            <surname>Wesley</surname>
          </string-name>
          , Cambridge MA,
          <year>1949</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>