<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Web Track for CLEF2005 at ALICANTE UNIVERSITY</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Trinitario</forename><surname>Martínez</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Software and Computing Systems</orgName>
								<orgName type="institution">University of Alicante</orgName>
								<address>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Elisa</forename><surname>Noguera</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Software and Computing Systems</orgName>
								<orgName type="institution">University of Alicante</orgName>
								<address>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Rafael</forename><surname>Muñoz</surname></persName>
							<email>rafael@dlsi.ua.es</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Software and Computing Systems</orgName>
								<orgName type="institution">University of Alicante</orgName>
								<address>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Fernando</forename><surname>Llopis</surname></persName>
							<email>llopis@dlsi.ua.es</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Software and Computing Systems</orgName>
								<orgName type="institution">University of Alicante</orgName>
								<address>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Web Track for CLEF2005 at ALICANTE UNIVERSITY</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">4E6AAB91A7EF541155024DE040BFA027</idno>
					<note type="submission">Information Search and Retrieval</note>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T00:39+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Measurement</term>
					<term>Experimentation Information retrieval</term>
					<term>question answering</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper presents the first experiment done for the CLEF2005 Multilingual Web Track. At present conference we have focused our main effort in the Spanish part of the Mixed Monolingual task, but we have also participated in others several languages and in the Bilingual English-Spanish task. A passage based IR system is applied at retrieving phase. Also a language identifier has been created in order to build a full automatic system without the need of knowing the topic language.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The Cross Language Multilingual Web Retrieval (WebCLEF) track consists of the evaluation of Information Retrieval systems on noisy multilingual documents. Particularly, the WebCLEF document collection consists of webpages from European governmental sites for at least 10 languages/countries. Retrieving in a Multi/Crosslingual manner is a natural and common established way for carrying out web searches. The aim of this specific task is to find the correct document on which the topic description is. This paper is structured as follows: next section describes the collection and topics used, later we explain the corpora processing and retrieving. Afterwards we show the results and conclusions, and finally we make discuss about future improvements of the system.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Processing phase</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Data Specifications</head><p>The targeted corpus is a mix of governmental sites in Europe. More concretely, the collection, EuroGOV, consists of web documents crawled from European governmental sites. Here's a list of (top level) domains from which pages are included: at, be, cy, cz, de, dk, ee, es, fi, fr, gr, hu, ie, int, it, lt, lu, lv, mt, nl, pl, pt, ru, se, si, sk, uk</p><p>The amount of data is impressive: over 20 gigas of compressed text files containing diverse governmental information on multiple types, such us HTML, ZIP, DOC and PDF format. Documents are gathered in a pseudo-XML format, storing domain, url, id, md5 signature, type (html, doc, pdf…) and data (in binary or text format). This corpus has been very controversial, and finally just html documents were designed to be retrieved by organizers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Data Preprocessing</head><p>At our first participation in this kind of competitions, we have focused our efforts in Spanish monolingual queries, and have made some others symbolic approaches. We have divided the corpus by language. This is required in order to not managing the whole amount of data.</p><p>Once html files are extracted from the corpuss:</p><p>1. Firstly, META labels are collected from the files. Specifically, title and keywords labels are saved for the retrieval phase. 2. Second step consists of replacing HTML code by its equivalent, as for example "&amp;raquo;"ó"&gt;". 3. Thirdly, regular expressions are used in order to remove special tags, obtaining a plane text. 4. At the end of the process, id, keywords, title and plane text of each document are stored in sgml files in order to conform a correct input for the Information Retrieval system (Trec format).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>WebCLEF Processing</head><p>We also have developed a language identifier with the purpose of fully automating the Mixed Monolingual process.</p><p>In addition to this, we had built up one specific module to extract pdf, doc and zip files from EuroGOV, but this has not been used because organizers decided do not retrieving this files types. At this phase, we created 30 monolingual known-item topics (15 named-page and 15 home page topics) in Spanish.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Topic creation</head><p>We also detected identical or similar pages in the collection by the use of search engines, and also by manual searches through the corpuss in order to produce consistent and well-formed topics. Also an English translation of the topic statement is provided with the purpose of being used in the multilingual task. For example, if we have a topic with this title:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Presidente del gobierno</head><p>In the traduction, the Spanish adjective is added to make more precise the future search through the whole corpuss:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Spanish government president</head><p>We developed several topics with .PDF and .DOC files which were finally discarded by organizers because of some participants found some problems with these formats at extraction text task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Retrieving phase: IR-n system</head><p>IR-n is a passage retrieval system (RP). RP systems <ref type="bibr" target="#b5">[6]</ref> locate in contiguous fragment of text (passages) and boost IR field by proposing a set of solutions to tradicional IR systems common problems. One of the main advantages of these systems is that they allow us to determine not only if a document is relevant or not, but also the detection of the relevant part of the document. IR-n system uses the sentences as atoms with the aim to define the passages. The passages are usually composed of a fixed number of sentences. This number has a great dependency of the targeted collection. Furthermore, IRn system uses overlapping passages in order to avoid that some documents cannot be considered relevant if words of the question appear in adjacent passages.</p><p>For every language, the resources used were provided by the CLEF organizers (http://www.unine.ch/info/clef).</p><p>There are stemmers and stopword lists for all languages, with the lack of Danish and Dutch stemmers. IR-n system allows the use of distinct similarity measures (Ex. Okapi <ref type="bibr" target="#b6">[7]</ref>). This involves an advantage, so that, in each task is used the best similarity measure.</p><p>With the aim of being able to indexing the documents in html format, indexing module has been modified to consider the tags title and keywords. The words which are in these labels have more weight than the words of the rest of the document in order to increase the value of the documents which have words of the query in the labels than the rest one.</p><p>According to others IR systems, IR-n system uses different techniques of the query expansion. Previous researches <ref type="bibr" target="#b7">[8]</ref> have showed that the approaches get better results when they are based on passages and in the complete document.</p><p>Finally, this year for the adhoc task has been implemented a technique called combined passages <ref type="bibr" target="#b8">[9]</ref>. It applies fusion methods, which are used in multilingual tasks to combine results with the different size of passages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">WebCLEF Tasks</head><p>Although we have focused our brief in the Spanish competition, others languages have been taken into account. The targeted languages have been:</p><p>Mixed Monolingual task:</p><formula xml:id="formula_0">Ø Danish Ø Spanish Ø Dutch Ø German Ø English Ø Portuguese</formula><p>Bilingual task: Ø English -Spanish</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Mixed Monolingual task</head><p>For the monolingual task, topics have been divided by language so that they are individually processed by the system. The specific results are finally merged in a results file.</p><p>Note that other languages topics, like Hungarian, Polish, French, Greek, Icelandic and Russian where not taken into account because we have not resources of these languages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.1">Language identification</head><p>As a baseline run, we have developed a language detector in order to automatically distinguish the correct language of the topic. In particular, our language detector has this general bases:</p><p>Ø Dictionary based (joined dictionaries, specific per-language stopwords) Ø Characterised part-of-word terminology (i.e. "ção" in the case of Portuguese) Ø Specific governmental terminology (i.e. "administration" in the case of English)</p><p>This philosophy gave us a good response in Spanish, English, Portuguese and Danish. Lamentably, Dutch and German are too much similar, and the system becomes occasionally erroneous. We have not reliable experience with these languages.</p><p>Once language topics were identified, they were separately stored in different files and run with the specific part of the EuroGOV corpus. By this way, a faster response of the system is obtained than when whole corpus is taken. As statistical results, just to mention that the language identificator could not was capable of determinate the language of seven topics from the Spanish, English, Portuguese, Dutch, German and Danish set. The rest of languages (87 topics) were not taken into account because they were not later processed by the IR system.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">BiEnEs task</head><p>The BiEnEs (Bilingual English-Spanish) task consists of carrying out searches in the Spanish corpuss of EuroGOV from topics written in English. Our automatic approach has been performed by a merging of three different on-line translators. The main idea is that the more common word is, the higher relevancy has. The used translators have been Freetranslator<ref type="foot" target="#foot_0">1</ref> , BabelFish<ref type="foot" target="#foot_1">2</ref> and InterTran<ref type="foot" target="#foot_2">3</ref> . An example of this is shown in next picture:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>English topics</head><p>FreeTranslator BabelFish InterTran Translated to Spanish topics</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Monolingual task results</head><p>In the process of our first experiment at WEBClef2005, we have focused on the Spanish Topics part of the Mixed Monolingual task. Also to mention Spanish Topics is the greater subset of the topic set. So, this is an important part of the task. We also have been doing experiments with other five languages: Danish, German, English, Dutch and Portuguese. On next table, averages at 1, 5, 10, 20 and 50 are shown, as the MRR too. The last column shows the difference between our system and the average results. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Bilingual English-Spanish results</head><p>Clearly, results obtained at this task are influenced by the results of the Spanish Monolingual task and also by the association of the three mentioned translators. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusions</head><p>In this paper we have presented the first version of our system at the Multilingual Web Track at CLEF. We have targeted the Mixed Monolingual Task, concretely Spanish, Danish, Dutch, German, English and Portuguese languages. At Spanish, we are above the average, whilst at other languages the system has a lower performance (we have never worked before with Danish nor Dutch). More time would be desirable in order to finish the whole system, and tuning it.</p><p>At the automatic language detection process, we lack of the need of a better language detector. The one used here has been a fast developed attempt, but not perfect.</p><p>At the Bilingual English to Spanish task, the conclusion is clear: general purpose translator is not a good tool to be used here, due to the fact that the retrieving collection is focused in a determined scope like governmental processes are. Our 3-translator association works better than one translator in its own, but this is not the ideal solution, and we consider the requirement of a specialized translator a must.</p><p>Finally, sometimes we have found that Keywords tags extracted from EuroGOV corpuss were adding noise to the system, because some HTML document can have several governmental scope keywords. This is why they are not working perfectly and getting in worse results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Future works</head><p>A way to improve our proposed system in future would be to extend our Mixed Monolingual task in order to include missing languages at this participation (Hungarian, Polish, French, Greek, Icelandic and Russian). Our major lack here is the necessity of resources (stemmers, stopwords lists and so on).</p><p>Another good advance would be to experiment with hyperlinks of the HTML documents of EuroGOV Collection, storing them and establishing some kind of relation between web pages. Also a little extraction of the link text string can add more information to retrieve. A way to progress in automatic processing with language identification phase would be improving the present identifier in the way it could use n-grams, and some discriminatory and specific EuroGOV corpuss machine learning language acquisition would be performed. We aim to extend the system so that Multilingual task could be fully run on next WebCLEF participation. This will require the extraction of language cues by a specific ad-hoc detector.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="4,70.80,280.08,424.56,205.68" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1 :</head><label>1</label><figDesc>Language identification stadistics</figDesc><table><row><cell>Language</cell><cell># Corrects</cell><cell># Incorrects</cell></row><row><cell>Spanish</cell><cell>134</cell><cell>2</cell></row><row><cell>English</cell><cell>117</cell><cell>2</cell></row><row><cell>Portuguese</cell><cell>56</cell><cell>0</cell></row><row><cell>Dutch</cell><cell>52</cell><cell>15</cell></row><row><cell>German</cell><cell>44</cell><cell>2</cell></row><row><cell>Danish</cell><cell>29</cell><cell>1</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2 :</head><label>2</label><figDesc>Mixed Monolingual official results per language Language Aver. At 1 Aver. at 5 Aver. at 10 Aver. at 20 Aver.On next table, results of the application of the automatic language detection at the Mixed Monolingual task are shown. Obviously, results are lower than previous, and give us an idea about how a mechanized system would response. Accidentally, one erroneous topic numeration run was submitted, but later another run was made. Finally, results are shown:</figDesc><table><row><cell>at 50</cell><cell>MRR</cell><cell>Dif</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3 :</head><label>3</label><figDesc>Mixed Monolingual with automatic language detection results per language Language Aver. at 1 Aver. at 5 Aver. at 10 Aver. at 20 Aver.</figDesc><table><row><cell>at 50</cell><cell>MRR</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 4</head><label>4</label><figDesc></figDesc><table><row><cell></cell><cell></cell><cell cols="3">: Bilingual English-Spanish task</cell><cell></cell><cell></cell></row><row><cell>Aver. at 1</cell><cell>Aver. at 5</cell><cell cols="3">Aver. at 10 Aver. at 20 Aver. at 50</cell><cell>MRR</cell><cell>Dif</cell></row><row><cell>0.0299</cell><cell>0.0522</cell><cell>0.0597</cell><cell>0.0746</cell><cell>0.0970</cell><cell>0.0395</cell><cell>-2,5028</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://www.freetranslation.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">http://world.altavista.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">http://www.tranexp.com/win/itserver.htm</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowlegments</head><p>This work has been partially supported by the Spanish Goverment (CiCYT) with grant TIC2003-07158-C04-01 and also by the Regional Tecnology Ministry of Valencia Government by means of the projects with reference GV04B-276 and GV04B-286.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">IR-n r2: Using normalized passages</title>
		<author>
			<persName><forename type="first">F</forename><surname>Llopis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Muñoz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Noguera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Terol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CLEF</title>
				<imprint>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Pasaje-Level Evidence in Document Retrieval</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Callan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 17th Annual Internacional Conference on Research and Development in Information Retrieval</title>
				<meeting>the 17th Annual Internacional Conference on Research and Development in Information Retrieval<address><addrLine>London, UK</addrLine></address></meeting>
		<imprint>
			<publisher>Springer Verlag</publisher>
			<date type="published" when="1994">1994</date>
			<biblScope unit="page" from="302" to="310" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Cross-lingual web retrieval</title>
		<author>
			<persName><surname>Webclef</surname></persName>
		</author>
		<ptr target="http://ilps.science.uva.nl/webclef/" />
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<ptr target="http://www.winedt.org/Dict/" />
		<title level="m">WinEdt Dictionaries</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">An Application of NLP Rules to Spoken Document Segmentation Task</title>
		<author>
			<persName><forename type="first">M</forename><surname>Rafael</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Patricio</forename><surname>Terol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fernando</forename><surname>Martínez-Barco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">;</forename><surname>Llopis</surname></persName>
		</author>
		<author>
			<persName><surname>Martínez</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">NLDB</title>
		<imprint>
			<biblScope unit="volume">2005</biblScope>
			<biblScope unit="page" from="376" to="379" />
		</imprint>
	</monogr>
	<note>Trinitario</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Passage retrieval revisited</title>
		<author>
			<persName><forename type="first">M</forename><surname>Kaskziel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zobel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 20 th annual International ACM Philadelphia SIGIR</title>
				<meeting>the 20 th annual International ACM Philadelphia SIGIR</meeting>
		<imprint>
			<date type="published" when="1997">1997</date>
			<biblScope unit="page" from="178" to="185" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Combining query translation and document translation in cross-language retrieval</title>
		<author>
			<persName><forename type="first">Aitao</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fredric</forename><forename type="middle">C</forename><surname>Gey</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">4th Workshop of the Cross-Language Evaluation Forum, CLEF 2003</title>
		<title level="s">Lecture notes in Computer Science</title>
		<editor>
			<persName><forename type="first">Carol</forename><surname>Peters</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Julio</forename><surname>Gonzalo</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Martin</forename><surname>Braschler</surname></persName>
		</editor>
		<meeting><address><addrLine>Trondheim, Norway</addrLine></address></meeting>
		<imprint>
			<publisher>Springer-Verlag</publisher>
			<date type="published" when="2003">2003</date>
			<biblScope unit="page" from="108" to="121" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Combining Query Translation and Document Translation in Cross-Language Retrieval</title>
		<author>
			<persName><forename type="first">Aitao</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fredric</forename><forename type="middle">C</forename><surname>Gey</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">4th Workshop of the Cross-Language Evaluation Forum</title>
				<meeting><address><addrLine>CLEF</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2003">2003</date>
			<biblScope unit="page" from="108" to="121" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Combining passages in monolingual experiments</title>
		<author>
			<persName><forename type="first">F</forename><surname>Llopis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Noguera</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Workshop of Cross-Language Evaluation Forum (CLEF 2005)</title>
				<meeting><address><addrLine>Vienna, Austria</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
	<note>In this volume</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
