<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>REINA at WebCLEF 2006. Mixing Fields to Improve Retrieval</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Carlos G. Figuerola, Joes L. Alonso Berrocal, Angel F. Zazo Rodgruez, Emilio Rodgruez REINA Research Group, University of Salamanca</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the participation of the REINA Research Group of the University of Salamanca at WebCLEF 2006. The task in that we have participated this year is the Monolingual Mixed Task in Spanish. To select web pages of the EuroGov collection in Spanish, the wide collection was processed with a language guesser, searching for pages in Spanish. All pages in the .es domain were also pre-selected. Our focus, this year, is to test pre-retrieval ways of mixing elds or elements of information in web pages, as well as to test the retrieval capacity of these elds. Mixing terms from several sources in a only index can be achieved, in retrieval systems based on the vector space model, operating on the term frequency in the document, if we use a tf idf schema of weigthing. BODY eld is, by the way, the most powerfull from the point of view of retrieval, but ANCHORS of backlinks add a considerable improvement. META elds, nevertheless, contribute little to the improvement in retrieval.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        This article describes the participation of the REINA Research Group in the track WebCLEF
2006. The task in that we have participated this year is the Monolingual Mixed Task in Spanish.
The document collection is the same one that in 2005 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], as well as part of the topics.
      </p>
      <p>
        Nevertheless, the last year we exclusively limited our work to documents or pages pertaining
to the domain .es [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In this time, nevertheless, we have chosen to extend the document base to
all those that, being in other domains, also are in Spanish.
      </p>
      <p>To be within the domain .es, on the other hand, does not mean that the page is necessarily
in Spanish; many pages are in some of the other languages spoken in Spain (Catalan, Euskera,
Galician...), and others are versions internationalized in English, specially, but also in French and
German. Of another side, there are pages in Spanish in several of the remaining European domains
that comprise of the collection used in WebCLEF. Thus, if really we want to look for all the pages
that are in Spanish, is necessary to process all the collection and to select those pages that are in
Spanish. This is what we have done, adding them the totality of the pages pertaining to the .es
domain.</p>
      <p>
        Unfortunately, the headers of the pages provide trustworthy information, neither on the
language nor on other thing; in many cases, the headers simply are empty, whereas in others they
have contents that do not correspond with the reality. Therefore, we have had to resort to a
language detector to select the pages in Spanish within the diverse European domains. The selected
detector is TextCat [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], a software based on the elaboration of patterns with n-grams for each
possible language, and the later categorizacin of the text whose language is desired to guess [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>The selection of the topics, of another side, is easy, since the language of these comes signalized
clearly. All the topics in Spanish have been processed, and this includes those of type named pages,
Home Page s, elaborated manually and the elaborated ones of automatic way. All has been dealed
with the same way.</p>
      <p>This paper is organized of the following way: in the next section the adopted approach to
solve the task is described; in the following section the problems and options adopted in the
lexical analysis and extraction of terms of the pages are exposed. Next, usable elds or elements
of information in indexing and retrieval are discussed. In the following section we describe runs
carried out, and the results are discussed. Finally, conclusions are provided.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Our Approach</title>
      <p>
        The adopted basic strategy is: rst to nd the most relevant pages for each topic and, based on
the type of topic, to rearrange the found more relevant documents list. Thus, besides to preselect
pages which they will form the collection (those that belongs to the domain .es plus that they
are in Spanish in other domains), our work has two basic parts: to nd pages relevant and after
to rerank those found pages. To nd the relevant pages for a given topic can approach by means
of a conventional system of indexing and retrieval, like, for example, the based ones on the model
of the vectorial space [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Nevertheless, in a Web page there are more informative elements in
addition to the text that the user sees in the window of his navigator. Even within that same
text, we can nd certain structures that can help, along with those other informative elements, to
improve the retrieval.
      </p>
      <p>
        For example, some of those mentioned elements are the eld TITLE, some META tags, the
ANCHORS of backlinks, etc. Within the eld body, that is what it visualizes in the navigator,
we can dierentiate parts that use dierent typographies, for example. Thus, we have dierent
sources of information that we must mix or fuse. Two basic strategies of fusion have been proposed:
fusion pre-retrieval and fusion post-retrieval [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This last strategy is the one that we applied
in webCLEF 2005 and, basically, consists of building an index by each element of information to
fuse. Whenever there is to solve a topic, it is executed against each one of the indexes and the
results obtained in each index are merged later.
      </p>
      <p>In this time we have wanted to test the strategy pre-retrieval, that consists of elaborating an
only index with the terms of all the elements, but weighted in a dierent way. Once built that only
index, the topics are executed normally against him. Naturally, in the elaboration of that only
index we can wish more or less to value the originating terms of such-and-such eld or element of
information; this allows us to use a mixture that, if it is well in tune, would have to provide good
results in the retrieval.</p>
      <p>
        The application of a dierent weight from the dierent components from the mixture, in our
case, is easy. Since for the indexing and retrieval we use our software Karpanta [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], based on the
good well-known model of the vector space, we do not have more to operate on the frequency of
each term in each one of the components of the mixture.
      </p>
      <p>
        In this case we applied a scheme of weight ATU (slope=0.2) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], but the idea is similar for
any scheme of weight based on tf idf . We can to apply a coecient to tf based on in what
component of mixture appears term, so that it makes increase or diminish the weight of that term
based on in what component it appears, without letting consider the frequency in the document
and the IDF .
      </p>
      <p>
        As for the second part of our strategy, to rearrange the list of documents or pages retrieved
based on the type of topic, we have only considered the case of the topics of type Home Page. For
the location of Home Pages, a simple strategy was followed. First, the topics that they could be
of type Home Page were manually detected, although this one is a strongly subjective valuation.
Then, for the results of those topics, a coecient was calculated on the basis of two criteria; rst,
a simple heuristic based in the existence in TITLE of certain expressions: main, welcome page, etc..
A heuristic similar was also applied to the name of the le (for example, main.html, home.html,
etc.). The other criterion is the length of the URL, understood like the levels of path of the URL
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]: smaller number of levels makes more probable that it is a Home Page [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>The obtained coecient simply is multiplied by the similarity of each page retrieved in each
topic of type Home Page, rearranging the results of each retrieval.
2.1</p>
      <sec id="sec-2-1">
        <title>Lexical Analysis and Extraction of Terms</title>
        <p>The rst operation, previous to any other, is the conversion of the Web pages to plain text. This
is necessary even to be able to determine the language of each page and to select it or not. The
obtaining of the plain text is not trivial and she is not free of problems. As it were already said
before, one cannot trust that the standards are followed. Not even it is possible to be guaranteed
that the content is HTML, although the page begins with the appropiate tag. So, one can be
directly with binary code, PDFs and similars. With the tags META it happens the same; even
though these are present, not always oer correct information.</p>
        <p>In fact, many pages not even contain text; so, rst is to determine the type of content. The old
one and known command le , well in tune, can help to determine the real content of each page. In
addition, it will inform to us into another important data: the used system of codication. For the
pages in Spanish, this one usually is ISO 8859 or UTF-8 ; it is necessary to know this information
to treat the special characters suitably. The conversion to plain text, when the page turns out to
be HTML, is carried out by means of w3m. There are other converters, but after several tests the
best results were obtained with w3m.</p>
        <p>
          Once obtained the plain text, we can determine the language with TexCat. Those documents
in Spanish, as well as all pertaining to the domain .es will form our collection. Of these, it is
precise to extract terms and to normalize them somehow. Basically, the characters are turned to
small letters, the accents are removed, as well as numbers, orthographic and similar characters.
The stop words are eliminated and the terms are passed through an enhanced s-stemming [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Used Fields</title>
        <p>
          Are diverse the elements or sources of information that we can consider in a Web page. The
base is the eld BODY, obviously, but in addition we can use the eld TITLE, that seem clearly
descriptive, as well as diverse tags META that usually are used for these purposes, in special META
content="Description" and META content="Keywords". Nevertheless, as were already
indicated in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], these elds are not present of a uniform way in all the pages. Many do not contain
them, and others contain META values of automatic form by the programs that have produced
those pages Web.
        </p>
        <p>In other cases, although the values are put manually by the authors of the pages, are little
operative. Additionally, we can think that terms that appear in outstanding typography can be
more representative. The labels or elds H1, H2, etc. are an example. Unfortunately, the use of
these tags is not uniform and either withdraws in favour of the denition of sources and specics
sizes of characters, more dicult to process.</p>
        <p>Another important element is the ANCHORS of backlinks that receive the pages. Of more or
less brief form, these ANCHORS describe the page with which they connect; this description is
important, because it is done by somebody dierent one from the author of the page that we want
to index. We can be conceited that this description can add terms dierent from the used ones
RUN 1
RUN 2
RUN 3</p>
        <p>BODY(fd=1)</p>
        <p>BODY(fd=1)</p>
        <p>ANCHOR(fd=1)</p>
        <p>TITLE (fd=1.5)
META-DESC (fd=1.5)
META-TITLE (fd=1)
META-KEY (fd=0.5)</p>
        <p>H1 (fd=0.8)</p>
        <p>H2 (fd=0.8)
Same as RUN 2 + Home Pages boosting
in the own page. Nevertheless, many of these ANCHORS can make reference to very concrete parts
of the pointed page; also, there are pages with many backlinks and many ANCHORS, and others
with few or no. And, in any case, we have the problem to obtain these ANCHORS. In our case, we
have processed the totality of the EuroGov collection, to obtain all the ANCHORS and links towards
pages in Spanish or pertaining to the domain .es. It is evident that, outside EuroGov will be
more links towards those pages, but we do not have way of obtain them.</p>
        <p>Finally, we have built indexess with the following elements: BODY, TITLE, meta-title,
metadescription, meta-keywords, H1, H2, ANCHORS.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Runs Executed</title>
      <p>With topics and estimations of relevance from WebCLEF 2005, we made a phase of training that
allowed us to test several mixtures of elements in diverse proportions. Thus, we have made three
runs ocials and several unocial ones. First run was executed against an only index that was
built based on the BODY eld, and it will serve to us as comparison base. In the second run, the
index is a mixture of all the elds mentioned before in the proportions indicated in the table.
Third one is based on the same index that second, but with boosting of Home Pages added, based
in the size of URLs and the simple heuristic commented before.</p>
      <p>The results of these ocials runs conrm clearly the advantage of the use of those additional
elements of information. Of them, it seems that most useful they are the ANCHORS of backlinks.</p>
      <p>Of another side, there are several runs unocials which conrm the ocial results. The graphic
shows the results. Each run executes against an index elaborated with the terms of each eld or
element of information (fd=1). Each run works with one single of these elds, without using the
terms that appear in BODY. The elds H1 and H2 are not in the graphic, because they produce
extremely low results. It is evident that many documents or pages lack one or several of such
elds, reason why never could be retrieved of this form; but it is a good way to separately verify
the capacity of retrieval of these elds.</p>
      <p>As it is possible to wait for, each eld separately produces worse results than BODY, which is
normal, since in BODY it is where it is the visualizeable text of the page. But, aside from BODY,
the greater eld with being able of retrieval is the ANCHOR of backlinks, although in many cases
those ANCHORS are very short. But it seems that, like saying is had, the descriptions that others
do of a page are quite eective for their retrieval. It follows short distances to the eld TITLE
to him, something foreseeable. Nevertheless, the elds based on the META tags oer poor results,
to enough distance of ANCHORS and TITLE, although they are elds oriented specically to the
retrieval. Although there is little dierence between the results of the three META observed (title,
description and keywords) is this one last one, peculiarly, the one that worse works of the three.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>We have described our participation in WebCLEF 2006, the adopted strategy, the conducted
experiments and the obtained results. The use of elds or additional elements of information to
the text or BODY of the pages allows to improve the results of the retrieval. A form to use these
elds is to elaborate an only index with the terms that appear in them, along with the words
that appear in BODY of the page. This requires to weight the terms of each eld in form dierent,
adjustable on a empirical way.</p>
      <p>Of all those elds, it seems that most eective is the formed one by the ANCHORS of backlinks
that receives each page. The eld TITLE also contributes of remarkable form to the improvement
of the retrieval. The content of the META tags, nevertheless, seems of utility reduced, from the
point of view of the retrieval.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This research has been partially funded by the Government of the Autonomus Community of
Castilla y Loen as project ref. SA089/04.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Steven</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Beitzel</surname>
            , Eric C. Jensen, Abdur Chowdhury, David Grossman,
            <given-names>Ophir</given-names>
          </string-name>
          <string-name>
            <surname>Frieder</surname>
            , and
            <given-names>Nazli</given-names>
          </string-name>
          <string-name>
            <surname>Goharian</surname>
          </string-name>
          .
          <article-title>On fusion of eective retrieval strategies in the same information retrieval system</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology (JASIST)</source>
          ,
          <volume>55</volume>
          (
          <issue>10</issue>
          ):
          <volume>859</volume>
          {
          <fpage>868</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>William</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Cavnar and John M. Trenkle</surname>
          </string-name>
          .
          <article-title>N-gram-based text categorization</article-title>
          .
          <source>In Third Annual Symposium on Document Analysis and Information Retrieval. April 11-13</source>
          ,
          <year>1994</year>
          ,
          <string-name>
            <given-names>Las</given-names>
            <surname>Vegas</surname>
          </string-name>
          , Nevada, pages
          <volume>161</volume>
          {
          <fpage>175</fpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Carlos</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Figuerola</surname>
          </string-name>
          ,
          <string-name>
            <surname>Joes L. Alonso</surname>
            <given-names>Berrocal</given-names>
          </string-name>
          , Angel F.
          <article-title>Zazo Rodgruez, and Emilio Rodgruez. REINA at the WebCLEF task: Combining evidences and link analysis</article-title>
          .
          <source>In Peters [9].</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Carlos</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Figuerola</surname>
          </string-name>
          ,
          <string-name>
            <surname>Joes Luis A. Alonso</surname>
            <given-names>Berrocal</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Angel F. Zazo Rodgruez</surname>
          </string-name>
          , and Emilio Rodgruez Vazquez de Aldana.
          <article-title>Herramientas para la investigacoin en recuperacoin de informacoin: Karpanta, un motor de busqueda experimental</article-title>
          .
          <source>Scire</source>
          ,
          <volume>10</volume>
          (
          <issue>2</issue>
          ):
          <volume>51</volume>
          {
          <fpage>62</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Carlos</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Figuerola</surname>
          </string-name>
          , Angel F. Zazo, Emilio Rodgruez Vazquez de Aldana, and Joes Luis Alonso Berrocal.
          <article-title>La recuperacoin de informacoin en espan~ol y la normalizacoin de etrminos</article-title>
          .
          <source>Inteligencia Articial</source>
          . Revista Iberoamericana de Inteligencia Articial ,
          <volume>8</volume>
          (
          <issue>22</issue>
          ):
          <volume>135</volume>
          {
          <fpage>145</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Edward</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fox</surname>
            and
            <given-names>Joseph A.</given-names>
          </string-name>
          <string-name>
            <surname>Shaw</surname>
          </string-name>
          .
          <article-title>Combination of multiple searches</article-title>
          .
          <source>In The Second Text REtrieval Conference (TREC-2)</source>
          .
          <source>NIST Special Publication 500-215</source>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kraaij</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Westerveld</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Hiemstra</surname>
          </string-name>
          .
          <article-title>The importance of prior probabilities for entry page search</article-title>
          .
          <source>In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , pages
          <volume>27</volume>
          {
          <fpage>34</fpage>
          . ACM Press,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Gertjan van Noord</surname>
          </string-name>
          .
          <article-title>Texcat language guesser</article-title>
          . http://www.let.rug.nl/
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Carol</given-names>
            <surname>Peters</surname>
          </string-name>
          , editor.
          <source>Results of the CLEF</source>
          <year>2005</year>
          <article-title>Cross-Language System Evaluation Campaign</article-title>
          .
          <source>Working notes for the CLEF 2005 Workshop</source>
          ,
          <fpage>21</fpage>
          -
          <lpage>23</lpage>
          September, Vienna, Austria ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Vassilis</surname>
            <given-names>Plachouras</given-names>
          </string-name>
          , Fidel Cacheda, Iadh Ounis, and Cornelis Joost van Rijsbergen. University of Glasgow at the Web Track:
          <article-title>Dynamic application of hyperlink analysis using the query scope</article-title>
          .
          <source>In The Twelfth Text REtrieval Conference (TREC</source>
          <year>2003</year>
          ) .
          <source>NIST Special Publication 500-255</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Gerard</given-names>
            <surname>Salton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>A vector space model for automatic indexing</article-title>
          .
          <source>Communication of the ACM</source>
          ,
          <volume>18</volume>
          :
          <fpage>613</fpage>
          {
          <fpage>620</fpage>
          ,
          <year>1975</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Borkur</surname>
            <given-names>Sigurbjornsson</given-names>
          </string-name>
          , Jaap Kamps, and Maarten de Rijke.
          <article-title>Overview of WebCLEF 2005</article-title>
          . In Peters [
          <volume>9</volume>
          ].
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Amit</surname>
            <given-names>Singhal</given-names>
          </string-name>
          , Chris Buckley, and
          <string-name>
            <given-names>Mandar</given-names>
            <surname>Mitra</surname>
          </string-name>
          .
          <article-title>Pivoted document length normalization</article-title>
          .
          <source>In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August</source>
          <volume>18</volume>
          {
          <fpage>22</fpage>
          ,
          <year>1996</year>
          ,
          <article-title>Zurich, Switzerland (Special Issue of the SIGIR Forum)</article-title>
          , pages
          <fpage>21</fpage>
          {
          <fpage>29</fpage>
          . ACM,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Tomlinson</surname>
          </string-name>
          .
          <article-title>Robust, web ad terabyte retrieval with Hummingbird Searchserver at TREC 2004</article-title>
          . In The Thirteen Text REtrieval Conference (TREC
          <year>2002</year>
          ) .
          <source>NIST Special Publication 500-261</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>