<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Web Track for CLEF2005 at ALICANTE UNIVERSITY</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Trinitario Martínez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elisa Noguera</string-name>
          <email>elisa@dlsi.ua.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafael Muñoz</string-name>
          <email>rafael@dlsi.ua.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fernando Llopis</string-name>
          <email>llopis@dlsi.ua.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Software and Computing Systems University of Alicante</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the first experiment done for the CLEF2005 Multilingual Web Track. At present conference we have focused our main effort in the Spanish part of the Mixed Monolingual task, but we have also participated in others several languages and in the Bilingual English-Spanish task. A passage based IR system is applied at retrieving phase. Also a language identifier has been created in order to build a full automatic system without the need of knowing the topic language.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Information retrieval</kwd>
        <kwd>question answering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>The Cross Language Multilingual Web Retrieval (WebCLEF) track consists of the evaluation of Information
Retrieval systems on noisy multilingual documents. Particularly, the WebCLEF document collection consists of
webpages from European governmental sites for at least 10 languages/countries.</p>
      <p>Retrieving in a Multi/Crosslingual manner is a natural and common established way for carrying out web
searches. The aim of this specific task is to find the correct document on which the topic description is. This
paper is structured as follows: next section describes the collection and topics used, later we explain the corpora
processing and retrieving. Afterwards we show the results and conclusions, and finally we make discuss about
future improvements of the system.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Processing phase</title>
    </sec>
    <sec id="sec-3">
      <title>2.1 Data Specifications</title>
      <p>at, be, cy, cz, de, dk, ee, es, fi, fr, gr, hu, ie, int, it, lt, lu, lv, mt, nl, pl, pt, ru, se, si, sk, uk
The amount of data is impressive: over 20 gigas of compressed text files containing diverse governmental
information on multiple types, such us HTML, ZIP, DOC and PDF format. Documents are gathered in a
pseudoXML format, storing domain, url, id, md5 signature, type (html, doc, pdf…) and data (in binary or text format).
This corpus has been very controversial, and finally just html documents were designed to be retrieved by
organizers.</p>
    </sec>
    <sec id="sec-4">
      <title>2.2 Data Preprocessing</title>
      <p>At our first participation in this kind of competitions, we have focused our efforts in Spanish monolingual
queries, and have made some others symbolic approaches. We have divided the corpus by language. This is
required in order to not managing the whole amount of data.</p>
      <p>Once html files are extracted from the corpuss:
1. Firstly, META labels are collected from the files. Specifically, title and keywords labels are saved for
the retrieval phase.
2. Second step consists of replacing HTML code by its equivalent, as for example “&amp;raquo;”ó“&gt;”.
3. Thirdly, regular expressions are used in order to remove special tags, obtaining a plane text.
4. At the end of the process, id, keywords, title and plane text of each document are stored in sgml files in
order to conform a correct input for the Information Retrieval system (Trec format).</p>
    </sec>
    <sec id="sec-5">
      <title>WebCLEF Processing</title>
      <p>Topics
EuroGOV
collection</p>
      <p>Preprocessing</p>
      <p>Sgml files</p>
      <p>Search
Engine</p>
      <p>Ranked
Output
We also have developed a language identifier with the purpose of fully automating the Mixed Monolingual
process.</p>
      <p>In addition to this, we had built up one specific module to extract pdf, doc and zip files from EuroGOV, but this
has not been used because organizers decided do not retrieving this files types.</p>
    </sec>
    <sec id="sec-6">
      <title>2.3 Topic creation</title>
      <p>As this has been the first Multilingual Retrieval Track at CLEF, topics have been developed by participants.
Queries are based on a collection of 547 multilingual topics. These are classified in two categories:
Ø Home Page finding: a homepage web is searched (i.e. www.dlsi.ua.es).
Ø Named Page finding: a specific non-homepage is searched in this case (i.e.
http://www.dlsi.ua.es/cgibin/wwwadm/personal.cgi?id=eng&amp;nom=rafael&amp;tipus=pdi).</p>
      <p>At this phase, we created 30 monolingual known-item topics (15 named-page and 15 home page topics) in
Spanish.</p>
      <p>We also detected identical or similar pages in the collection by the use of search engines, and also by manual
searches through the corpuss in order to produce consistent and well-formed topics. Also an English translation
of the topic statement is provided with the purpose of being used in the multilingual task. For example, if we
have a topic with this title:</p>
      <sec id="sec-6-1">
        <title>Presidente del gobierno</title>
        <p>In the traduction, the Spanish adjective is added to make more precise the future search through the whole
corpuss:</p>
      </sec>
      <sec id="sec-6-2">
        <title>Spanish government president</title>
        <p>We developed several topics with .PDF and .DOC files which were finally discarded by organizers because of
some participants found some problems with these formats at extraction text task.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>2.4 Retrieving phase: IR-n system</title>
      <p>IR-n is a passage retrieval system (RP). RP systems [6] locate in contiguous fragment of text (passages) and
boost IR field by proposing a set of solutions to tradicional IR systems common problems. One of the main
advantages of these systems is that they allow us to determine not only if a document is relevant or not, but also
the detection of the relevant part of the document.</p>
      <p>IR-n system uses the sentences as atoms with the aim to define the passages. The passages are usually composed
of a fixed number of sentences. This number has a great dependency of the targeted collection. Furthermore,
IRn system uses overlapping passages in order to avoid that some documents cannot be considered relevant if
words of the question appear in adjacent passages.</p>
      <p>For every language, the resources used were provided by the CLEF organizers (http://www.unine.ch/info/clef).
There are stemmers and stopword lists for all languages, with the lack of Danish and Dutch stemmers.
IR-n system allows the use of distinct similarity measures (Ex. Okapi [7]). This involves an advantage, so that, in
each task is used the best similarity measure.</p>
      <p>With the aim of being able to indexing the documents in html format, indexing module has been modified to
consider the tags title and keywords. The words which are in these labels have more weight than the words of the
rest of the document in order to increase the value of the documents which have words of the query in the labels
than the rest one.</p>
      <p>According to others IR systems, IR-n system uses different techniques of the query expansion. Previous
researches [8] have showed that the approaches get better results when they are based on passages and in the
complete document.</p>
      <p>
        Finally, this year for the adhoc task has been implemented a technique called combined passages [
        <xref ref-type="bibr" rid="ref3">9</xref>
        ]. It applies
fusion methods, which are used in multilingual tasks to combine results with the different size of passages.
      </p>
    </sec>
    <sec id="sec-8">
      <title>3 WebCLEF Tasks</title>
      <sec id="sec-8-1">
        <title>Mixed Monolingual task: Although we have focused our brief in the Spanish competition, others languages have been taken into account. The targeted languages have been:</title>
      </sec>
      <sec id="sec-8-2">
        <title>Bilingual task:</title>
        <p>Ø English - Spanish</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>3.1 Mixed Monolingual task</title>
      <p>For the monolingual task, topics have been divided by language so that they are individually processed by the
system. The specific results are finally merged in a results file.</p>
      <p>Note that other languages topics, like Hungarian, Polish, French, Greek, Icelandic and Russian where not taken
into account because we have not resources of these languages.</p>
    </sec>
    <sec id="sec-10">
      <title>3.1.1 Language identification</title>
      <p>As a baseline run, we have developed a language detector in order to automatically distinguish the correct
language of the topic. In particular, our language detector has this general bases:</p>
      <p>Dictionary based (joined dictionaries, specific per-language stopwords)
Characterised part-of-word terminology (i.e. “ção” in the case of Portuguese)</p>
      <p>Specific governmental terminology (i.e. “administration” in the case of English)
This philosophy gave us a good response in Spanish, English, Portuguese and Danish. Lamentably, Dutch and
German are too much similar, and the system becomes occasionally erroneous. We have not reliable experience
with these languages.</p>
      <p>Once language topics were identified, they were separately stored in different files and run with the specific part
of the EuroGOV corpus. By this way, a faster response of the system is obtained than when whole corpus is
taken.
As statistical results, just to mention that the language identificator could not was capable of determinate the
language of seven topics from the Spanish, English, Portuguese, Dutch, German and Danish set. The rest of
languages (87 topics) were not taken into account because they were not later processed by the IR system.</p>
    </sec>
    <sec id="sec-11">
      <title>3.2 BiEnEs task</title>
      <p>The BiEnEs (Bilingual English-Spanish) task consists of carrying out searches in the Spanish corpuss of
EuroGOV from topics written in English. Our automatic approach has been performed by a merging of three
different on-line translators. The main idea is that the more common word is, the higher relevancy has.
The used translators have been Freetranslator1, BabelFish2 and InterTran3. An example of this is shown in next
picture:</p>
      <p>English
topics
Translated
to Spanish</p>
      <p>topics
FreeTranslator</p>
      <p>BabelFish</p>
      <p>InterTran
1 http://www.freetranslation.com/
2 http://world.altavista.com/
3 http://www.tranexp.com/win/itserver.htm</p>
    </sec>
    <sec id="sec-12">
      <title>4. Results</title>
    </sec>
    <sec id="sec-13">
      <title>4.1 Monolingual task results</title>
      <p>In the process of our first experiment at WEBClef2005, we have focused on the Spanish Topics part of the
Mixed Monolingual task. Also to mention Spanish Topics is the greater subset of the topic set. So, this is an
important part of the task. We also have been doing experiments with other five languages: Danish, German,
English, Dutch and Portuguese. On next table, averages at 1, 5, 10, 20 and 50 are shown, as the MRR too. The
last column shows the difference between our system and the average results.</p>
      <sec id="sec-13-1">
        <title>Language ES DA DE</title>
        <p>EN
NL
PT
On next table, results of the application of the automatic language detection at the Mixed Monolingual task are
shown. Obviously, results are lower than previous, and give us an idea about how a mechanized system would
response. Accidentally, one erroneous topic numeration run was submitted, but later another run was made.
Finally, results are shown:</p>
      </sec>
    </sec>
    <sec id="sec-14">
      <title>4.2 Bilingual English-Spanish results</title>
      <p>Clearly, results obtained at this task are influenced by the results of the Spanish Monolingual task and also by the
association of the three mentioned translators.</p>
      <sec id="sec-14-1">
        <title>Aver. at 1 0.0299</title>
      </sec>
      <sec id="sec-14-2">
        <title>Aver. at 5</title>
        <p>0.0522</p>
      </sec>
    </sec>
    <sec id="sec-15">
      <title>5 Conclusions</title>
      <p>In this paper we have presented the first version of our system at the Multilingual Web Track at CLEF. We have
targeted the Mixed Monolingual Task, concretely Spanish, Danish, Dutch, German, English and Portuguese
languages. At Spanish, we are above the average, whilst at other languages the system has a lower performance
(we have never worked before with Danish nor Dutch). More time would be desirable in order to finish the
whole system, and tuning it.</p>
      <p>At the automatic language detection process, we lack of the need of a better language detector. The one used
here has been a fast developed attempt, but not perfect.</p>
      <p>At the Bilingual English to Spanish task, the conclusion is clear: general purpose translator is not a good tool to
be used here, due to the fact that the retrieving collection is focused in a determined scope like governmental
processes are. Our 3-translator association works better than one translator in its own, but this is not the ideal
solution, and we consider the requirement of a specialized translator a must.</p>
      <p>Finally, sometimes we have found that Keywords tags extracted from EuroGOV corpuss were adding noise to
the system, because some HTML document can have several governmental scope keywords. This is why they
are not working perfectly and getting in worse results.</p>
    </sec>
    <sec id="sec-16">
      <title>6 Future works</title>
      <p>A way to improve our proposed system in future would be to extend our Mixed Monolingual task in order to
include missing languages at this participation (Hungarian, Polish, French, Greek, Icelandic and Russian). Our
major lack here is the necessity of resources (stemmers, stopwords lists and so on).</p>
      <p>Another good advance would be to experiment with hyperlinks of the HTML documents of EuroGOV
Collection, storing them and establishing some kind of relation between web pages. Also a little extraction of the
link text string can add more information to retrieve.</p>
      <p>A way to progress in automatic processing with language identification phase would be improving the present
identifier in the way it could use n-grams, and some discriminatory and specific EuroGOV corpuss machine
learning language acquisition would be performed.</p>
      <p>We aim to extend the system so that Multilingual task could be fully run on next WebCLEF participation. This
will require the extraction of language cues by a specific ad-hoc detector.</p>
    </sec>
    <sec id="sec-17">
      <title>Acknowlegments References</title>
      <p>This work has been partially supported by the Spanish Goverment (CiCYT) with grant TIC2003-07158-C04-01
and also by the Regional Tecnology Ministry of Valencia Government by means of the projects with reference
GV04B-276 and GV04B-286.
[4] WinEdt Dictionaries. http://www.winedt.org/Dict/
[5] Rafael M. Terol, Patricio Martínez-Barco, Fernando Llopis, Trinitario Martínez: An Application of NLP
Rules to Spoken Document Segmentation Task. NLDB 2005: 376-379
[6] M. Kaskziel and J. Zobel. Passage retrieval revisited. In Proceedings of the 20th annual International ACM Philadelphia
SIGIR, pages 178–185, 1997.
[7] Aitao Chen and Fredric C. Gey. Combining query translation and document translation in cross-language retrieval. In
Carol Peters, Julio Gonzalo, Martin Braschler, and et al., editors, 4th Workshop of the Cross-Language Evaluation Forum,
CLEF 2003, Lecture notes in Computer Science, pages 108–121, Trondheim, Norway, 2003. Springer-Verlag.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Llopis</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muñoz</surname>
            ,
            <given-names>R</given-names>
          </string-name>
          , Noguera,
          <string-name>
            <surname>E.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Terol</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <article-title>IR-n r2: Using normalized passages</article-title>
          .
          <source>CLEF</source>
          <year>2004</year>
          [2]
          <string-name>
            <surname>Callan</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          :
          <article-title>Pasaje-Level Evidence in Document Retrieval</article-title>
          .
          <source>In Proceedings of the 17th Annual Internacional Conference on Research and Development in Information Retrieval</source>
          , London, UK. Springer Verlag (
          <year>1994</year>
          )
          <fpage>302</fpage>
          -
          <lpage>310</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [3]
          <string-name>
            <surname>WebCLEF.</surname>
          </string-name>
          Cross-lingual
          <source>web retrieval</source>
          ,
          <year>2005</year>
          . http://ilps.science.uva.nl/webclef/ [8]
          <string-name>
            <given-names>Aitao</given-names>
            <surname>Chen and Fredric C.</surname>
          </string-name>
          <article-title>Gey. Combining Query Translation and Document Translation in Cross-Language Retrieval. 4th Workshop of the Cross-Language Evaluation Forum</article-title>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2003</year>
          .
          <volume>108</volume>
          -
          <fpage>121</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Llopis</surname>
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Noguera</surname>
            <given-names>E.</given-names>
          </string-name>
          <article-title>Combining passages in monolingual experiments</article-title>
          .
          <source>In Workshop of Cross-Language Evaluation Forum (CLEF</source>
          <year>2005</year>
          ), In this volume, Vienna, Austria,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>