Web Track for CLEF2005 at ALICANTE UNIVERSITY

Web Track for CLEF2005 at ALICANTE UNIVERSITY TrinitarioMartínez Department of Software and Computing Systems University of Alicante

Spain

ElisaNoguera Department of Software and Computing Systems University of Alicante

Spain

RafaelMuñoz rafael@dlsi.ua.es Department of Software and Computing Systems University of Alicante

Spain

FernandoLlopis llopis@dlsi.ua.es Department of Software and Computing Systems University of Alicante

Spain

Web Track for CLEF2005 at ALICANTE UNIVERSITY 4E6AAB91A7EF541155024DE040BFA027 Information Search and Retrieval GROBID - A machine learning software for extracting information from scholarly documents Measurement Experimentation Information retrieval question answering

This paper presents the first experiment done for the CLEF2005 Multilingual Web Track. At present conference we have focused our main effort in the Spanish part of the Mixed Monolingual task, but we have also participated in others several languages and in the Bilingual English-Spanish task. A passage based IR system is applied at retrieving phase. Also a language identifier has been created in order to build a full automatic system without the need of knowing the topic language.

Introduction

The Cross Language Multilingual Web Retrieval (WebCLEF) track consists of the evaluation of Information Retrieval systems on noisy multilingual documents. Particularly, the WebCLEF document collection consists of webpages from European governmental sites for at least 10 languages/countries. Retrieving in a Multi/Crosslingual manner is a natural and common established way for carrying out web searches. The aim of this specific task is to find the correct document on which the topic description is. This paper is structured as follows: next section describes the collection and topics used, later we explain the corpora processing and retrieving. Afterwards we show the results and conclusions, and finally we make discuss about future improvements of the system.

Processing phase

Data Specifications

The targeted corpus is a mix of governmental sites in Europe. More concretely, the collection, EuroGOV, consists of web documents crawled from European governmental sites. Here's a list of (top level) domains from which pages are included: at, be, cy, cz, de, dk, ee, es, fi, fr, gr, hu, ie, int, it, lt, lu, lv, mt, nl, pl, pt, ru, se, si, sk, uk

The amount of data is impressive: over 20 gigas of compressed text files containing diverse governmental information on multiple types, such us HTML, ZIP, DOC and PDF format. Documents are gathered in a pseudo-XML format, storing domain, url, id, md5 signature, type (html, doc, pdf…) and data (in binary or text format). This corpus has been very controversial, and finally just html documents were designed to be retrieved by organizers.

Data Preprocessing

At our first participation in this kind of competitions, we have focused our efforts in Spanish monolingual queries, and have made some others symbolic approaches. We have divided the corpus by language. This is required in order to not managing the whole amount of data.

Once html files are extracted from the corpuss:

1. Firstly, META labels are collected from the files. Specifically, title and keywords labels are saved for the retrieval phase. 2. Second step consists of replacing HTML code by its equivalent, as for example "»"ó">". 3. Thirdly, regular expressions are used in order to remove special tags, obtaining a plane text. 4. At the end of the process, id, keywords, title and plane text of each document are stored in sgml files in order to conform a correct input for the Information Retrieval system (Trec format).

WebCLEF Processing

We also have developed a language identifier with the purpose of fully automating the Mixed Monolingual process.

In addition to this, we had built up one specific module to extract pdf, doc and zip files from EuroGOV, but this has not been used because organizers decided do not retrieving this files types. At this phase, we created 30 monolingual known-item topics (15 named-page and 15 home page topics) in Spanish.

Topic creation

We also detected identical or similar pages in the collection by the use of search engines, and also by manual searches through the corpuss in order to produce consistent and well-formed topics. Also an English translation of the topic statement is provided with the purpose of being used in the multilingual task. For example, if we have a topic with this title:

Presidente del gobierno

In the traduction, the Spanish adjective is added to make more precise the future search through the whole corpuss:

Spanish government president

We developed several topics with .PDF and .DOC files which were finally discarded by organizers because of some participants found some problems with these formats at extraction text task.

Retrieving phase: IR-n system

IR-n is a passage retrieval system (RP). RP systems [6] locate in contiguous fragment of text (passages) and boost IR field by proposing a set of solutions to tradicional IR systems common problems. One of the main advantages of these systems is that they allow us to determine not only if a document is relevant or not, but also the detection of the relevant part of the document. IR-n system uses the sentences as atoms with the aim to define the passages. The passages are usually composed of a fixed number of sentences. This number has a great dependency of the targeted collection. Furthermore, IRn system uses overlapping passages in order to avoid that some documents cannot be considered relevant if words of the question appear in adjacent passages.

For every language, the resources used were provided by the CLEF organizers (http://www.unine.ch/info/clef).

There are stemmers and stopword lists for all languages, with the lack of Danish and Dutch stemmers. IR-n system allows the use of distinct similarity measures (Ex. Okapi [7]). This involves an advantage, so that, in each task is used the best similarity measure.

With the aim of being able to indexing the documents in html format, indexing module has been modified to consider the tags title and keywords. The words which are in these labels have more weight than the words of the rest of the document in order to increase the value of the documents which have words of the query in the labels than the rest one.

According to others IR systems, IR-n system uses different techniques of the query expansion. Previous researches [8] have showed that the approaches get better results when they are based on passages and in the complete document.

Finally, this year for the adhoc task has been implemented a technique called combined passages [9]. It applies fusion methods, which are used in multilingual tasks to combine results with the different size of passages.

WebCLEF Tasks

Although we have focused our brief in the Spanish competition, others languages have been taken into account. The targeted languages have been:

Mixed Monolingual task:

Ø Danish Ø Spanish Ø Dutch Ø German Ø English Ø Portuguese

Bilingual task: Ø English -Spanish

Mixed Monolingual task

For the monolingual task, topics have been divided by language so that they are individually processed by the system. The specific results are finally merged in a results file.

Note that other languages topics, like Hungarian, Polish, French, Greek, Icelandic and Russian where not taken into account because we have not resources of these languages.

Language identification

As a baseline run, we have developed a language detector in order to automatically distinguish the correct language of the topic. In particular, our language detector has this general bases:

Ø Dictionary based (joined dictionaries, specific per-language stopwords) Ø Characterised part-of-word terminology (i.e. "ção" in the case of Portuguese) Ø Specific governmental terminology (i.e. "administration" in the case of English)

This philosophy gave us a good response in Spanish, English, Portuguese and Danish. Lamentably, Dutch and German are too much similar, and the system becomes occasionally erroneous. We have not reliable experience with these languages.

Once language topics were identified, they were separately stored in different files and run with the specific part of the EuroGOV corpus. By this way, a faster response of the system is obtained than when whole corpus is taken. As statistical results, just to mention that the language identificator could not was capable of determinate the language of seven topics from the Spanish, English, Portuguese, Dutch, German and Danish set. The rest of languages (87 topics) were not taken into account because they were not later processed by the IR system.

BiEnEs task

The BiEnEs (Bilingual English-Spanish) task consists of carrying out searches in the Spanish corpuss of EuroGOV from topics written in English. Our automatic approach has been performed by a merging of three different on-line translators. The main idea is that the more common word is, the higher relevancy has. The used translators have been Freetranslator1 , BabelFish2 and InterTran3 . An example of this is shown in next picture:

English topics

FreeTranslator BabelFish InterTran Translated to Spanish topics

Results

Monolingual task results

In the process of our first experiment at WEBClef2005, we have focused on the Spanish Topics part of the Mixed Monolingual task. Also to mention Spanish Topics is the greater subset of the topic set. So, this is an important part of the task. We also have been doing experiments with other five languages: Danish, German, English, Dutch and Portuguese. On next table, averages at 1, 5, 10, 20 and 50 are shown, as the MRR too. The last column shows the difference between our system and the average results.

Bilingual English-Spanish results

Clearly, results obtained at this task are influenced by the results of the Spanish Monolingual task and also by the association of the three mentioned translators.

Conclusions

In this paper we have presented the first version of our system at the Multilingual Web Track at CLEF. We have targeted the Mixed Monolingual Task, concretely Spanish, Danish, Dutch, German, English and Portuguese languages. At Spanish, we are above the average, whilst at other languages the system has a lower performance (we have never worked before with Danish nor Dutch). More time would be desirable in order to finish the whole system, and tuning it.

At the automatic language detection process, we lack of the need of a better language detector. The one used here has been a fast developed attempt, but not perfect.

At the Bilingual English to Spanish task, the conclusion is clear: general purpose translator is not a good tool to be used here, due to the fact that the retrieving collection is focused in a determined scope like governmental processes are. Our 3-translator association works better than one translator in its own, but this is not the ideal solution, and we consider the requirement of a specialized translator a must.

Finally, sometimes we have found that Keywords tags extracted from EuroGOV corpuss were adding noise to the system, because some HTML document can have several governmental scope keywords. This is why they are not working perfectly and getting in worse results.

Future works

A way to improve our proposed system in future would be to extend our Mixed Monolingual task in order to include missing languages at this participation (Hungarian, Polish, French, Greek, Icelandic and Russian). Our major lack here is the necessity of resources (stemmers, stopwords lists and so on).

Another good advance would be to experiment with hyperlinks of the HTML documents of EuroGOV Collection, storing them and establishing some kind of relation between web pages. Also a little extraction of the link text string can add more information to retrieve. A way to progress in automatic processing with language identification phase would be improving the present identifier in the way it could use n-grams, and some discriminatory and specific EuroGOV corpuss machine learning language acquisition would be performed. We aim to extend the system so that Multilingual task could be fully run on next WebCLEF participation. This will require the extraction of language cues by a specific ad-hoc detector.

Table 1 :1Language identification stadisticsLanguage# Corrects# IncorrectsSpanish1342English1172Portuguese560Dutch5215German442Danish291

Table 2 :2Mixed Monolingual official results per language Language Aver. At 1 Aver. at 5 Aver. at 10 Aver. at 20 Aver.On next table, results of the application of the automatic language detection at the Mixed Monolingual task are shown. Obviously, results are lower than previous, and give us an idea about how a mechanized system would response. Accidentally, one erroneous topic numeration run was submitted, but later another run was made. Finally, results are shown:at 50MRRDif

Table 3 :3Mixed Monolingual with automatic language detection results per language Language Aver. at 1 Aver. at 5 Aver. at 10 Aver. at 20 Aver.at 50MRR

Table 44: Bilingual English-Spanish taskAver. at 1Aver. at 5Aver. at 10 Aver. at 20 Aver. at 50MRRDif0.02990.05220.05970.07460.09700.0395-2,5028

http://www.freetranslation.com/ http://world.altavista.com/ http://www.tranexp.com/win/itserver.htm

Acknowlegments

This work has been partially supported by the Spanish Goverment (CiCYT) with grant TIC2003-07158-C04-01 and also by the Regional Tecnology Ministry of Valencia Government by means of the projects with reference GV04B-276 and GV04B-286.

IR-n r2: Using normalized passages FLlopis RMuñoz ENoguera MTerol R CLEF 2004 Pasaje-Level Evidence in Document Retrieval JPCallan Proceedings of the 17th Annual Internacional Conference on Research and Development in Information Retrieval the 17th Annual Internacional Conference on Research and Development in Information Retrieval

London, UK

Springer Verlag 1994 Cross-lingual web retrieval Webclef 2005 WinEdt Dictionaries An Application of NLP Rules to Spoken Document Segmentation Task MRafael PatricioTerol FernandoMartínez-Barco ;Llopis Martínez NLDB 2005 Trinitario Passage retrieval revisited MKaskziel JZobel Proceedings of the 20 th annual International ACM Philadelphia SIGIR the 20 th annual International ACM Philadelphia SIGIR 1997 Combining query translation and document translation in cross-language retrieval AitaoChen FredricCGey 4th Workshop of the Cross-Language Evaluation Forum, CLEF 2003 Lecture notes in Computer Science CarolPeters JulioGonzalo MartinBraschler

Trondheim, Norway

Springer-Verlag 2003 Combining Query Translation and Document Translation in Cross-Language Retrieval AitaoChen FredricCGey 4th Workshop of the Cross-Language Evaluation Forum

CLEF

2003 Combining passages in monolingual experiments FLlopis ENoguera Workshop of Cross-Language Evaluation Forum (CLEF 2005)

Vienna, Austria

2005 In this volume