Web Track for CLEF2005 at ALICANTE UNIVERSITY
                      Trinitario Martínez, Elisa Noguera, Rafael Muñoz, Fernando Llopis
                              Department of Software and Computing Systems
                                          University of Alicante, Spain
                                      {tme, elisa, rafael, llopis}@dlsi.ua.es


                                                    Abstract

           This paper presents the first experiment done for the CLEF2005 Multilingual Web
           Track. At present conference we have focused our main effort in the Spanish part of the
           Mixed Monolingual task, but we have also participated in others several languages and
           in the Bilingual English-Spanish task. A passage based IR system is applied at
           retrieving phase. Also a language identifier has been created in order to build a full
           automatic system without the need of knowing the topic language.


Categories and Subject Descriptors
Information Search and Retrieval

General Terms
Measurement, Experimentation

Keywords
Information retrieval, question answering


1 Introduction
The Cross Language Multilingual Web Retrieval (WebCLEF) track consists of the evaluation of Information
Retrieval systems on noisy multilingual documents. Particularly, the WebCLEF document collection consists of
webpages from European governmental sites for at least 10 languages/countries.
Retrieving in a Multi/Crosslingual manner is a natural and common established way for carrying out web
searches. The aim of this specific task is to find the correct document on which the topic description is. This
paper is structured as follows: next section describes the collection and topics used, later we explain the corpora
processing and retrieving. Afterwards we show the results and conclusions, and finally we make discuss about
future improvements of the system.


2 Processing phase

2.1 Data Specifications

The targeted corpus is a mix of governmental sites in Europe. More concretely, the collection, EuroGOV,
consists of web documents crawled from European governmental sites. Here's a list of (top level) domains from
which pages are included:
              at, be, cy, cz, de, dk, ee, es, fi, fr, gr, hu, ie, int, it, lt, lu, lv, mt, nl, pl, pt, ru, se, si, sk, uk

The amount of data is impressive: over 20 gigas of compressed text files containing diverse governmental
information on multiple types, such us HTML, ZIP, DOC and PDF format. Documents are gathered in a pseudo-
XML format, storing domain, url, id, md5 signature, type (html, doc, pdf…) and data (in binary or text format).
This corpus has been very controversial, and finally just html documents were designed to be retrieved by
organizers.


2.2 Data Preprocessing
At our first participation in this kind of competitions, we have focused our efforts in Spanish monolingual
queries, and have made some others symbolic approaches. We have divided the corpus by language. This is
required in order to not managing the whole amount of data.

Once html files are extracted from the corpuss:
   1. Firstly, META labels are collected from the files. Specifically, title and keywords labels are saved for
        the retrieval phase.
   2. Second step consists of replacing HTML code by its equivalent, as for example “&raquo;”ó“>”.
   3. Thirdly, regular expressions are used in order to remove special tags, obtaining a plane text.
   4. At the end of the process, id, keywords, title and plane text of each document are stored in sgml files in
        order to conform a correct input for the Information Retrieval system (Trec format).


                                                   WebCLEF Processing

                                                                  Topics


 EuroGOV                                                                                     Search                     Ranked
                              Preprocessing                    Sgml files
 collection                                                                                  Engine                     Output


We also have developed a language identifier with the purpose of fully automating the Mixed Monolingual
process.
In addition to this, we had built up one specific module to extract pdf, doc and zip files from EuroGOV, but this
has not been used because organizers decided do not retrieving this files types.


2.3 Topic creation

As this has been the first Multilingual Retrieval Track at CLEF, topics have been developed by participants.

Queries are based on a collection of 547 multilingual topics. These are classified in two categories:
   Ø Home Page finding: a homepage web is searched (i.e. www.dlsi.ua.es).
   Ø Named Page finding: a specific non-homepage is searched in this case (i.e. http://www.dlsi.ua.es/cgi-
         bin/wwwadm/personal.cgi?id=eng&nom=rafael&tipus=pdi).
At this phase, we created 30 monolingual known-item topics (15 named-page and 15 home page topics) in
Spanish.
We also detected identical or similar pages in the collection by the use of search engines, and also by manual
searches through the corpuss in order to produce consistent and well-formed topics. Also an English translation
of the topic statement is provided with the purpose of being used in the multilingual task. For example, if we
have a topic with this title:

                                             Presidente del gobierno

In the traduction, the Spanish adjective is added to make more precise the future search through the whole
corpuss:
                                        Spanish government president


We developed several topics with .PDF and .DOC files which were finally discarded by organizers because of
some participants found some problems with these formats at extraction text task.


2.4 Retrieving phase: IR-n system
IR-n is a passage retrieval system (RP). RP systems [6] locate in contiguous fragment of text (passages) and
boost IR field by proposing a set of solutions to tradicional IR systems common problems. One of the main
advantages of these systems is that they allow us to determine not only if a document is relevant or not, but also
the detection of the relevant part of the document.

IR-n system uses the sentences as atoms with the aim to define the passages. The passages are usually composed
of a fixed number of sentences. This number has a great dependency of the targeted collection. Furthermore, IR-
n system uses overlapping passages in order to avoid that some documents cannot be considered relevant if
words of the question appear in adjacent passages.

For every language, the resources used were provided by the CLEF organizers (http://www.unine.ch/info/clef).
There are stemmers and stopword lists for all languages, with the lack of Danish and Dutch stemmers.

IR-n system allows the use of distinct similarity measures (Ex. Okapi [7]). This involves an advantage, so that, in
each task is used the best similarity measure.

With the aim of being able to indexing the documents in html format, indexing module has been modified to
consider the tags title and keywords. The words which are in these labels have more weight than the words of the
rest of the document in order to increase the value of the documents which have words of the query in the labels
than the rest one.

According to others IR systems, IR-n system uses different techniques of the query expansion. Previous
researches [8] have showed that the approaches get better results when they are based on passages and in the
complete document.

Finally, this year for the adhoc task has been implemented a technique called combined passages [9]. It applies
fusion methods, which are used in multilingual tasks to combine results with the different size of passages.


3 WebCLEF Tasks

Although we have focused our brief in the Spanish competition, others languages have been taken into account.
The targeted languages have been:

Mixed Monolingual task:
    Ø    Danish
    Ø    Spanish
    Ø    Dutch
    Ø    German
    Ø    English
    Ø    Portuguese


Bilingual task:
     Ø English - Spanish


3.1 Mixed Monolingual task
For the monolingual task, topics have been divided by language so that they are individually processed by the
system. The specific results are finally merged in a results file.


Note that other languages topics, like Hungarian, Polish, French, Greek, Icelandic and Russian where not taken
into account because we have not resources of these languages.


3.1.1 Language identification

As a baseline run, we have developed a language detector in order to automatically distinguish the correct
language of the topic. In particular, our language detector has this general bases:

    Ø Dictionary based (joined dictionaries, specific per-language stopwords)
    Ø Characterised part-of-word terminology (i.e. “ção” in the case of Portuguese)
    Ø Specific governmental terminology (i.e. “administration” in the case of English)

This philosophy gave us a good response in Spanish, English, Portuguese and Danish. Lamentably, Dutch and
German are too much similar, and the system becomes occasionally erroneous. We have not reliable experience
with these languages.

Once language topics were identified, they were separately stored in different files and run with the specific part
of the EuroGOV corpus. By this way, a faster response of the system is obtained than when whole corpus is
taken.
                                   Table 1: Language identification stadistics
                                  Language       # Corrects       # Incorrects
                                    Spanish          134                2
                                    English          117                2
                                  Portuguese          56                0
                                     Dutch            52               15
                                   German             44                2
                                    Danish            29                1


As statistical results, just to mention that the language identificator could not was capable of determinate the
language of seven topics from the Spanish, English, Portuguese, Dutch, German and Danish set. The rest of
languages (87 topics) were not taken into account because they were not later processed by the IR system.


3.2 BiEnEs task

The BiEnEs (Bilingual English-Spanish) task consists of carrying out searches in the Spanish corpuss of
EuroGOV from topics written in English. Our automatic approach has been performed by a merging of three
different on-line translators. The main idea is that the more common word is, the higher relevancy has.
The used translators have been Freetranslator1, BabelFish2 and InterTran3. An example of this is shown in next
picture:


                                            FreeTranslator


                                                                                       Translated
           English
                                               BabelFish                               to Spanish
            topics
                                                                                         topics


                                               InterTran


1
  http://www.freetranslation.com/
2
  http://world.altavista.com/
3
  http://www.tranexp.com/win/itserver.htm
4. Results

4.1 Monolingual task results
In the process of our first experiment at WEBClef2005, we have focused on the Spanish Topics part of the
Mixed Monolingual task. Also to mention Spanish Topics is the greater subset of the topic set. So, this is an
important part of the task. We also have been doing experiments with other five languages: Danish, German,
English, Dutch and Portuguese. On next table, averages at 1, 5, 10, 20 and 50 are shown, as the MRR too. The
last column shows the difference between our system and the average results.


                              Table 2: Mixed Monolingual official results per language
   Language      Aver. At 1     Aver. at 5 Aver. at 10 Aver. at 20 Aver. at 50            MRR            Dif
      ES          0.1716          0.3134      0.3433        0 .3731         0.4328       0.2377      +4,4261
     DA           0.0333          0.0667      0.0667        0.0667          0.0667       0.0500        -4,082
     DE           0.1579          0.2105      0.2632        0.3158          0.3158       0.1907       -9,4245
     EN           0.0496          0.0744      0.0826        0.0826          0.0909       0.0614      -15,2636
     NL           0.1356          0.1525      0.1525        0.1695          0.1695       0.1451       -9,4245
      PT          0.0508          0.1695      0.1695        0.2034          0.2712       0.0833       -6,2003


On next table, results of the application of the automatic language detection at the Mixed Monolingual task are
shown. Obviously, results are lower than previous, and give us an idea about how a mechanized system would
response. Accidentally, one erroneous topic numeration run was submitted, but later another run was made.
Finally, results are shown:

             Table 3: Mixed Monolingual with automatic language detection results per language
           Language Aver. at 1 Aver. at 5 Aver. at 10 Aver. at 20 Aver. at 50              MRR
              ES        0.1343     0.2612        0.3134       0.3582         0.4104       0.1995
             DA         0.0333     0.0667        0.0667       0.0667         0.0667       0.0500
             DE         0.0702     0.1053        0.1579       0.2105         0.2105       0.0942
             EN         0.0496     0.0744        0.0826       0.0826         0.0909       0.0614
             NL         0.0847     0.1017        0.1017       0.1186         0.1186       0.0943
              PT        0.0508     0.0847        0.1017       0.1525         0.2203       0.0656


4.2 Bilingual English-Spanish results
Clearly, results obtained at this task are influenced by the results of the Spanish Monolingual task and also by the
association of the three mentioned translators.

                                      Table 4: Bilingual English-Spanish task
        Aver. at 1     Aver. at 5     Aver. at 10 Aver. at 20 Aver. at 50           MRR             Dif
         0.0299         0.0522         0.0597         0.0746        0.0970         0.0395         -2,5028


5 Conclusions
In this paper we have presented the first version of our system at the Multilingual Web Track at CLEF. We have
targeted the Mixed Monolingual Task, concretely Spanish, Danish, Dutch, German, English and Portuguese
languages. At Spanish, we are above the average, whilst at other languages the system has a lower performance
(we have never worked before with Danish nor Dutch). More time would be desirable in order to finish the
whole system, and tuning it.
At the automatic language detection process, we lack of the need of a better language detector. The one used
here has been a fast developed attempt, but not perfect.

At the Bilingual English to Spanish task, the conclusion is clear: general purpose translator is not a good tool to
be used here, due to the fact that the retrieving collection is focused in a determined scope like governmental
processes are. Our 3-translator association works better than one translator in its own, but this is not the ideal
solution, and we consider the requirement of a specialized translator a must.
Finally, sometimes we have found that Keywords tags extracted from EuroGOV corpuss were adding noise to
the system, because some HTML document can have several governmental scope keywords. This is why they
are not working perfectly and getting in worse results.


6 Future works
A way to improve our proposed system in future would be to extend our Mixed Monolingual task in order to
include missing languages at this participation (Hungarian, Polish, French, Greek, Icelandic and Russian). Our
major lack here is the necessity of resources (stemmers, stopwords lists and so on).
Another good advance would be to experiment with hyperlinks of the HTML documents of EuroGOV
Collection, storing them and establishing some kind of relation between web pages. Also a little extraction of the
link text string can add more information to retrieve.
A way to progress in automatic processing with language identification phase would be improving the present
identifier in the way it could use n-grams, and some discriminatory and specific EuroGOV corpuss machine
learning language acquisition would be performed.
We aim to extend the system so that Multilingual task could be fully run on next WebCLEF participation. This
will require the extraction of language cues by a specific ad-hoc detector.


Acknowlegments
This work has been partially supported by the Spanish Goverment (CiCYT) with grant TIC2003-07158-C04-01
and also by the Regional Tecnology Ministry of Valencia Government by means of the projects with reference
GV04B-276 and GV04B-286.


References
[1] Llopis, F., Muñoz, R, Noguera, E., M. Terol, R. IR-n r2: Using normalized passages. CLEF 2004

[2] Callan, J. P.: Pasaje-Level Evidence in Document Retrieval. In Proceedings of the 17th Annual Internacional
Conference on Research and Development in Information Retrieval, London, UK. Springer Verlag (1994) 302-
310.

[3] WebCLEF. Cross-lingual web retrieval, 2005. http://ilps.science.uva.nl/webclef/

[4] WinEdt Dictionaries. http://www.winedt.org/Dict/

[5] Rafael M. Terol, Patricio Martínez-Barco, Fernando Llopis, Trinitario Martínez: An Application of NLP
Rules to Spoken Document Segmentation Task. NLDB 2005: 376-379

[6] M. Kaskziel and J. Zobel. Passage retrieval revisited. In Proceedings of the 20th annual International ACM Philadelphia
SIGIR, pages 178–185, 1997.

[7] Aitao Chen and Fredric C. Gey. Combining query translation and document translation in cross-language retrieval. In
Carol Peters, Julio Gonzalo, Martin Braschler, and et al., editors, 4th Workshop of the Cross-Language Evaluation Forum,
CLEF 2003, Lecture notes in Computer Science, pages 108–121, Trondheim, Norway, 2003. Springer-Verlag.
[8] Aitao Chen and Fredric C. Gey. Combining Query Translation and Document Translation in Cross-Language
Retrieval. 4th Workshop of the Cross-Language Evaluation Forum, CLEF 2003.108-121.

[9] Llopis F., Noguera E. Combining passages in monolingual experiments. In Workshop of Cross-Language Evaluation
Forum (CLEF 2005), In this volume, Vienna, Austria, 2005.