<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Overview of WebCLEF 2005</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Börkur</forename><surname>Sigurbjörnsson</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Informatics Institute</orgName>
								<orgName type="institution">University of Amsterdam</orgName>
							</affiliation>
						</author>
						<author role="corresp">
							<persName><forename type="first">Jaap</forename><surname>Kamps</surname></persName>
							<email>kamps@science.uva.nl</email>
							<affiliation key="aff0">
								<orgName type="department">Informatics Institute</orgName>
								<orgName type="institution">University of Amsterdam</orgName>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Archives and Information Studies</orgName>
								<orgName type="institution">University of Amsterdam</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Maarten</forename><surname>De Rijke</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Informatics Institute</orgName>
								<orgName type="institution">University of Amsterdam</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Overview of WebCLEF 2005</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">DEBD020CB614515C38AA029CFF6D9553</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T00:39+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing</term>
					<term>H.3.3 Information Search and Retrieval</term>
					<term>H.3.4 Systems and Software</term>
					<term>H.3.7 Digital Libraries</term>
					<term>H.2.3 [Database Management]: Languages-Query Languages Measurement, Performance, Experimentation Web retrieval, Known-item retrieval, Multilingual retrieval</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We describe WebCLEF, the multilingual web track, that was introduced at CLEF 2005. We provide details of the tasks, the topics, and the results of WebCLEF participants. The mixed monolingual task proved an interesting addition to the range of tasks in cross-language information retrieval. Although it may be too early to talk about a solved problem, effective web retrieval techniques seem to carry over to this particular multilingual setting. The multilingual task, in contrast, is still very far from being a solved problem. Remarkably, using non-translated English queries proved more successful than using translations of the English queries.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The world wide web is a natural setting for cross-lingual information retrieval; web content is essentially multilingual, and web searchers are often polyglots. Even though English has emerged as the lingua franca of the web, planning for a business trip or holiday usually involves digesting pages in a foreign language. The same holds for searching information about European culture, education, sports, economy, or politics. To evaluate systems that address multilingual information needs on the web, a new multilingual web track, called WebCLEF, has been set up as part of CLEF 2005.</p><p>Three tasks were organized within this year's WebCLEF track: mixed monolingual, multilingual, and bilingual English to Spanish, with 242 homepage and 305 named page finding queries for the first two tasks, and 67 homepage and 67 named page finding tasks for the third task. All topics, and the accompanying assessments, were created by the participants in the WebCLEF track. In total, 11 teams submitted 61 runs for the three tasks.</p><p>The main findings of the WebCLEF track in 2005 are the following. The mixed monolingual task proved an interesting addition to the range of tasks in cross-language information retrieval. Although it may be too early to talk about a solved problem, effective web retrieval techniques &lt;topic&gt; &lt;num&gt;WC0005&lt;/num&gt; &lt;title&gt;Minister van buitenlandse zaken&lt;/title&gt; &lt;metadata&gt; &lt;topicprofile&gt; &lt;language language="NL"/&gt; &lt;translation language="EN"&gt;dutch minister of foreign affairs&lt;/translation&gt; &lt;/topicprofile&gt; &lt;targetprofile&gt; &lt;language language="NL"/&gt; &lt;domain domain="nl"/&gt; &lt;/targetprofile&gt; &lt;userprofile&gt; &lt;native language="IS"/&gt; &lt;active language="EN"/&gt; &lt;active language="DA"/&gt; &lt;active language="NL"/&gt; &lt;passive language="NO"/&gt; &lt;passive language="SV"/&gt; &lt;passive language="DE"/&gt; &lt;passive_other&gt;Faroese&lt;/passive_other&gt; &lt;countryofbirth country="IS"/&gt; &lt;countryofresidence country="NL"/&gt; &lt;/userprofile&gt; &lt;/metadata&gt; &lt;/topic&gt; seem to carry over to this particular multilingual setting. The multilingual task, in contrast, is still very far from being a solved problem. Remarkably, using non-translated English queries proved more successful than using translations of the English queries.</p><p>The remainder of the paper is organized as follows. In Section 2 we describe the WebCLEF 2005 track in more detail. Section 3 is devoted to a description of the runs submitted by the participants, while the results are presented in Section 4. We conclude in Section 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">The Retrieval Tasks</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Collection</head><p>For the purposes of the WebCLEF track a new corpus, called EuroGOV, has been developed. EuroGOV is a crawl of European government-related sites, where collection building is less restricted by intellectual property rights. It is a multilingual web corpus, which contains over 3.5 million pages from 27 primary domains, covering over twenty languages. There is no single language that dominates the corpus, and its linguistic diversity provides a natural setting for multilingual web search. We refer to <ref type="bibr" target="#b1">[2]</ref> for further details on EuroGOV.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Topics</head><p>Topic development was in the hands of the participating groups. Each group was expected to create at least 30 monolingual known-item topics, 15 homepages and 15 named page topics. Homepage topics are names of a site that the user wants to reach, and named page topics concern nonhomepages that the user wants to reach. The track organizers assigned languages to groups based on their location and the language expertise available within the group. For each topic, topic creators were instructed to detect identical or similar pages in the collection, both in the language of the target page and in other languages. Many European governmental sites provide translations of (some of) their web pages in a small number of languages, e.g., in additional official languages (if applicable), in languages of some neighboring countries, and/or in English.</p><p>In addition, participants provided English translations of their topics. The topic authors were also asked to fill out a form where they provided various types of metadata, including their language knowledge, birth place and residence. This information was used to augment the topics with additional metadata. Figure <ref type="figure" target="#fig_0">1</ref> provides an example of the topic format used at WebCLEF 2005. The track organizers reviewed the topics, suggested improvements, and finally selected the final set of topics.</p><p>As few participants had facilities to search the EuroGOV collection during the topic development phase, the organizers provided a Lucene-based search engine for the collection, and the University of Glasgow provided access to the collection through Terrier, for which we are very grateful. Both search engines were at a proof-of-concept level only and were not specially adapted for the task.</p><p>Table <ref type="table" target="#tab_0">1</ref>, column 3, shows a summary of the number of topics submitted by each participating team. The WebCLEF 2005 topic set contained 547 topics, 242 homepage topics and 305 named page topics. The target pages were in 11 different languages: Spanish (ES), English (EN), Dutch (NL), Portuguese (PT), German (DE), Hungarian (HU), Danish (DA), Russian (RU), Greek (EL), Icelandic (IS), and French (FR). Since topic development depended on language knowledge within participating groups the distribution between languages in the test set varies considerably. Table <ref type="table" target="#tab_1">2</ref> provides more detailed statistics of the WebCLEF 2005 topic set.</p><p>During topic development, topic authors were asked to try to identify duplicates and translations of the target page. Table <ref type="table" target="#tab_1">2</ref> shows the number of duplicates/translations available. We list both the number of topics having a duplicate/translation and also the total count of duplicates/translations. The category Readable trans. refers to the number of translations whose language matches the language knowledge identified by the user. The number of translations naturally varies from one domain to another. As an example, 78 topics target pages were located in the eu.int domain (14% of the topics), and those pages have 232 translations (60% of identified translations). The identification of translations is a difficult and labor intensive process. Due to a lack of resources we have not been able to verify the completeness of duplicate/translation identification. This must be taken into account when interpreting results using the duplicate/translation </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Tasks</head><p>Due to limited resources for evaluation all tasks at WebCLEF 2005 were restricted to known-item searches. The following tasks were organized for WebCLEF 2005.</p><p>• Mixed-Monolingual The mixed-monolingual task is meant to simulate a user searching for a known-item page in an European language. The mixed-monolingual task used the title field of the topics to create a set of monolingual known-item topics.</p><p>• Multilingual The multilingual task is meant to simulate a user looking for a certain knownitem page in a particular European language. The user, however, uses English to formulate her query. The multilingual task used the English translations of the original topic statements.</p><p>• Bilingual English to Spanish For this task a special topic set was used. It contained a reviewed translation of the Spanish topics. The reviewed and revised translations were provided by the NLP group at UNED, for which we are very grateful.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Submission</head><p>For each of the tasks, teams were allowed to submit up to 5 runs. Each run could contain 50 results for each topic.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.5">Evaluation</head><p>Since each NP and HP topic is developed with a URL in mind, the only judging task is to identify URLs of equivalent (near-duplicate or translated) pages. As described previously, this task was carried out during the topic development phase.</p><p>From the assessments obtained during the topic development stage we are able to define a number of qrel sets, including the following.</p><p>• Monolingual This set of qrels contains for each topic, the target page and all its duplicates.</p><p>• Multilingual This set of qrels contains for each topic, the target page, its duplicates and all its translations. • User readable This set of qrels contains for each topic, the target, all its duplicates, and all translations which are in a language that the topic author marked as her native/active/passive language.</p><p>Each of these qrel sets can be further divided into subsets based on the language of the topic or the domain of the target page. In this report we will only use the language base subsets.</p><p>The main metric used for evaluation was mean reciprocal rank (MRR). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Submitted Runs</head><p>Table <ref type="table" target="#tab_0">1</ref> shows a summary of the number of runs submitted by each team. The mixed-monolingual task was the most popular task with 34 runs submitted by 9 teams; Table <ref type="table" target="#tab_2">3</ref> provides details of the runs submitted. The multilingual task was the second most popular task with 19 runs submitted by 4 teams; the details are given in Table <ref type="table" target="#tab_3">4</ref>. For the bilingual English to Spanish task, 8 runs were submitted by 3 teams; consult Table <ref type="table" target="#tab_4">5</ref> for details.</p><p>We will now provide an overview of features used by the participating teams. We divide the overview in three parts: web-specific, linguistic, and cross-lingual features.</p><p>The teams used a wide variety of web-based features. Many teams indexed titles separately: Hummingbird, Miracle, U. Alicante, U. Glasgow, U. Indonesia, and U. Salamanca. A few teams The teams also used a wide variety of linguistic features. Language specific stemming was performed by a number of teams: Hummingbird, Melange, U. Alicante, and U. Glasgow. U. Amsterdam (ILPS) limited themselves to simple accent normalization, but did do an ASCII transliteration for Russian. Miracle extracted proper nouns and keywords and indexed those separately. U. Hildesheim experimented with character tri-grams. U. Indonesia did not use any language specific features. U. Salamanca applied a special stemmer for Spanish.</p><p>In the multilingual task, two different techniques were used by participating groups to bridge the gap between the query language (English) and the target page language. Neither U. Hildesheim nor Miracle used any translation. I.e., both teams simply used the English version of the topics. Both ILPS and Melange used an on-line translator.</p><p>In the bilingual English to Spanish task, two different approaches were used to translate the English queries to Spanish. UNED used an English to Spanish dictionary, but BUAP and U. Alicante use on-line translators.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Mixed-Monolingual Task</head><p>First we look at each team's best scoring baseline run. Figure <ref type="figure" target="#fig_1">2</ref> (left) shows the scores of the 5 best scoring teams. The left-most point shows the MRR over all topics. The successive points to the right show MRR scores for a subset of the topics: one for each language. The languages are sorted by the number of topics: from Spanish (ES) with the most topics (134) to French (FR) with only one topic. Now, let's look at each team's best scoring run, independent of whether it was a baseline run or used some of the topic metadata. Figure <ref type="figure" target="#fig_1">2</ref> (right) shows the scores of the 5 best scoring teams. For the top scoring teams only U. Amsterdam (ILPS) uses no metadata.</p><p>Observe that, for each of the top five scoring runs, there is a considerable amount of variation across languages. For some languages the "hardness" seems independent of systems. Most systems score relatively high for Dutch; relatively low for Russian and Greek; but the score for German is close to their average score. The different performance between languages is only partially caused   by the "hardness" of the particular language. Since the topics are not the same across languages, the "hardness" of the topics may also play a role. Let's turn to the use of metadata now. The highest scoring runs are ones that use metadata. No team used user metadata; information about the domain of the target page proved to be the most popular type of metadata, and using it to restrict retrieval systems' outputs seems to be a sensible strategy, as is witnessed by the fact that it's the only type of metadata that each of the 5 top ranking runs uses.</p><p>Finally, for many runs, there is a clear gap between scores for NPs and HPs, with the named page queries scoring higher than the home page queries. For the best scoring runs, the two types of known-item topic perform comparably. This phenomenon is illustrated in Figure <ref type="figure" target="#fig_2">3</ref>, and mirrors a similar phenomenon at TREC's web track in 2003 and 2004 <ref type="bibr" target="#b0">[1]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Multilingual Task</head><p>For the multilingual task we can actually look at 3 specific subtasks. The tasks differ w.r.t. the translations being used in the qrels. Figure <ref type="figure" target="#fig_3">4</ref> (Top row) shows the results if only the target page and its duplicates are considered relevant. The second row shows the results if all translations are added to the relevant set. And the bottom row shows the results if only "user readable" translations are added to the relevant set. From Table <ref type="table" target="#tab_3">4</ref> we see that the overall MRR increases when translations are added to the relevant set. This effect is, obviously, due to an increase in the amount of relevant pages. There is little difference between the two sets of translations, which may have been caused by several reasons. E.g., the completeness of the translation identification is not known, and there might be a bias toward identifying "readable" translations rather than "un-readable" translations. Note that, the relative ranking of the submitted runs does not change if translations are added to the relevant set. The highest MRR score for the multilingual task is substantially lower than the highest MRR for the mixed monolingual task: 0.1370 vs. 0.5135. The top score of the best scoring team on the multilingual task, U. Hildesheim, is over 14% below their top score on the mixed monolingual task. For the teams that score second and third best on the multilingual task, the corresponding differences are even more dramatic (56% for Miracle, and 95% for U. Amsterdam).</p><p>The success of approaches which did not apply translation is interesting and deserves a closer look. Let's look at the 40 topics which received the highest mean MRR over all submitted runs, using the monolingual result set. Thereof, 26 topics are in English. The remaining 14 topics are listed in Table <ref type="table" target="#tab_6">6</ref>. For the high scoring non-English topics we see that proper names are common, such as Jan-Peter Balkenende, Henri Muller, Paul Hartling, Europol etc. For these queries a translation is hardly needed.</p><p>It is difficult to say whether metadata helped in the multilingual task, since we have very few runs to compare. It is tempting, however, to say that the metadata did indeed help Miracle.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Bilingual English to Spanish Task</head><p>The results for the bilingual English to Spanish task can be seen from Table <ref type="table" target="#tab_4">5</ref>. We refer to the individual participants' papers for a more detailed analysis of the results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusions</head><p>The mixed monolingual task proved an interesting addition to the range of tasks in cross-language information retrieval. A number of participant build effective systems, that cope well with all the eleven languages in the topic set. Specific web-centric techniques or additional knowledge from the metadata fields leads to further improvement. Although it may be too early to talk about a solved problem, effective web retrieval techniques seem to carry over to the multilingual setting. The </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Example of a WebCLEF 2005 topic.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Scores per-language for the 5 best scoring runs for the Mixed-Monolingual task using MRR. (Left): Best scoring baseline run per team. (Right): Best scoring run per team.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Homepages vs. named pages.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: (Top row): Scores per-language for the best scoring runs for the Multilingual task using MRR and only target pages and duplicates. (Left): Baseline runs. (Right): All runs. (Second row): Scores per-language for the 5 best scoring runs for the Multilingual task using MRR and target pages, duplicates and ALL translations. (Left): Baseline runs. (Right): All runs. (Bottom row): Scores per-language for the best scoring runs for the Multilingual task using MRR and target pages, duplicates and user readable translations. (Left): Baseline runs. (Right): All runs.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Summary of participating teams, the number of topics they developed and the number of runs they submitted.</figDesc><table><row><cell>Subm.</cell><cell>Submitted runs</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Number of topics per language for both homepages (HP) and named pages (NP). The languages are sorted by the number of available topics. The bottom part of the table shows how many duplicates/translations were identified. We list both the number of topics having a duplicate/translation and also the total count of duplicates/translations.</figDesc><table><row><cell></cell><cell cols="12">Total ES EN NL PT DE HU DA RU EL IS FR</cell></row><row><cell>Total</cell><cell cols="3">547 134 121</cell><cell>59</cell><cell>59</cell><cell>57</cell><cell>35</cell><cell>30</cell><cell>30</cell><cell>16</cell><cell>5</cell><cell>1</cell></row><row><cell>HP</cell><cell>242</cell><cell>67</cell><cell>50</cell><cell>25</cell><cell>29</cell><cell>23</cell><cell>16</cell><cell>11</cell><cell>15</cell><cell>5</cell><cell>1</cell><cell>-</cell></row><row><cell>NP</cell><cell>305</cell><cell>67</cell><cell>71</cell><cell>34</cell><cell>30</cell><cell>34</cell><cell>19</cell><cell>19</cell><cell>15</cell><cell>11</cell><cell>4</cell><cell>1</cell></row><row><cell>Duplicates (topics)</cell><cell>191</cell><cell>37</cell><cell>47</cell><cell>21</cell><cell>15</cell><cell>38</cell><cell>11</cell><cell>12</cell><cell>8</cell><cell>1</cell><cell>1</cell><cell>-</cell></row><row><cell>Duplicates (total)</cell><cell>473</cell><cell cols="2">82 109</cell><cell>40</cell><cell>95</cell><cell>90</cell><cell>18</cell><cell>26</cell><cell>11</cell><cell>1</cell><cell>1</cell><cell>-</cell></row><row><cell>Translations (topics)</cell><cell>114</cell><cell>25</cell><cell>24</cell><cell>9</cell><cell>4</cell><cell>13</cell><cell>6</cell><cell>15</cell><cell>6</cell><cell>7</cell><cell>5</cell><cell>-</cell></row><row><cell>Translations (total)</cell><cell cols="2">387 100</cell><cell>47</cell><cell>18</cell><cell>7</cell><cell>39</cell><cell cols="2">17 101</cell><cell>11</cell><cell cols="2">19 28</cell><cell>-</cell></row><row><cell>Readable trans. (topics)</cell><cell>72</cell><cell>17</cell><cell>6</cell><cell>9</cell><cell>2</cell><cell>10</cell><cell>6</cell><cell>9</cell><cell>5</cell><cell>7</cell><cell>1</cell><cell>-</cell></row><row><cell>Readable trans. (total)</cell><cell>143</cell><cell>29</cell><cell>8</cell><cell>16</cell><cell>3</cell><cell>26</cell><cell>6</cell><cell>30</cell><cell>6</cell><cell>13</cell><cell>6</cell><cell>-</cell></row><row><cell>information.</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc>Summary of the runs submitted for the Mixed-Monolingual task. The 'metadata usage' columns indicate usage of topic metadata: topic language (TL), page language (PL), page domain (PD), and user's native or active languages (UN, UA, respectively). For each team, its best scoring non-metadata run is in italics, and its best scoring metadata run is in boldface. This run had an error in topic-result mapping. Corrected run has MRR of 0.0923.</figDesc><table><row><cell>Metadata usage</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4 :</head><label>4</label><figDesc>Summary of the runs submitted for the Multilingual task. The 'metadata usage' columns indicate topic metadat usage: topic language (TL), page language (PL), page domain (PD), and the user's native or active languages (UN, UA, respectively). MRR is reported using the monolingual, multilingual, and the user readable assessment sets. For each team, its best scoring non-metadata run is in italics, while its best scoring metadata run is in boldface.</figDesc><table><row><cell>Metadata usage</cell><cell>MRR</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 5 :</head><label>5</label><figDesc>Summary of the runs submitted for the BiEnEs task. For each team, the score of its best scoring run is in boldface. also built special indexes for other HTML tags: Hummingbird, Miracle, and UNED. Several teams used a separate index for anchor text: Melange, U. Glasgow, and U. Salamanca. Miracle also built an index for URL text. Hummingbird, U. Glasgow and U. Salamanca used URL length in their ranking. PageRank was used by Melange and U. Salamanca. Neither U. Amsterdam (ILPS) nor U. Hildesheim used any web-specific features.</figDesc><table><row><cell cols="2">Group id Run name</cell><cell>MRR</cell></row><row><cell>buap</cell><cell>BUAP Full</cell><cell>0.0465</cell></row><row><cell></cell><cell>BUAP PT10</cell><cell>0.0331</cell></row><row><cell></cell><cell>BUAP PT40</cell><cell>0.0844</cell></row><row><cell></cell><cell>BUAP PT60</cell><cell>0.0771</cell></row><row><cell></cell><cell>BUAP PT20</cell><cell>0.0446</cell></row><row><cell cols="2">ualicante BiEn2Es</cell><cell>0.0395</cell></row><row><cell>uned</cell><cell>UNED bilingual baseline</cell><cell>0.0477</cell></row><row><cell></cell><cell cols="2">UNED bilingual exp1 0.0930</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 6 :</head><label>6</label><figDesc>Non-English queries with the highest mean MRR over all runs submitted to the multi-</figDesc><table><row><cell>lingual track</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Topic</cell><cell>Lang.</cell><cell>Original query</cell><cell>English query</cell></row><row><cell cols="2">WC0528 Dutch</cell><cell>cv balkenende</cell><cell>cv balkenende</cell></row><row><cell cols="2">WC0185 German</cell><cell>Europa Newsletter</cell><cell>Europa Newsletter</cell></row><row><cell cols="2">WC0070 French</cell><cell cols="2">Le professeur Henri Muller nommé Prof. Henri Muller named</cell></row><row><cell></cell><cell></cell><cell>Ambassadeur de l'Hellénisme</cell><cell>ambassador for Hellenism</cell></row><row><cell cols="2">WC0232 Danish</cell><cell>Regeringen Poul Hartling</cell><cell>The cabinet of Poul Hartling</cell></row><row><cell cols="3">WC0456 Icelandic upplýsingar um europol</cell><cell>europol factsheet</cell></row><row><cell cols="2">WC0404 Dutch</cell><cell>CV minister-president Jan-Peter</cell><cell>CV of the Dutch prime minister</cell></row><row><cell></cell><cell></cell><cell>Balkenende</cell><cell>Jan-Peter Balkenende</cell></row><row><cell cols="2">WC0149 German</cell><cell>Ernst Breit 80. Geburtstag</cell><cell>80th birthday of Ernst Breit</cell></row><row><cell cols="2">WC0536 German</cell><cell>Interviews mit Staatsminister</cell><cell>Interviews with Minister of State</cell></row><row><cell></cell><cell></cell><cell>Rolf Schwanitz</cell><cell>Rolf Schwanitz</cell></row><row><cell cols="2">WC0025 Greek</cell><cell>-</cell><cell>Historical sources of the Hellenic</cell></row><row><cell></cell><cell></cell><cell></cell><cell>parliament</cell></row><row><cell cols="2">WC0198 Spanish</cell><cell>El Palacio de la Moncloa</cell><cell>Moncloa Palace</cell></row><row><cell cols="2">WC0327 German</cell><cell>Autobahn Südumfahrung Leipzig</cell><cell>Southern Autobahn Ring Road</cell></row><row><cell></cell><cell></cell><cell></cell><cell>of Leipzig</cell></row><row><cell cols="2">WC0202 Danish</cell><cell>Dansk Faerøsk kulturfond</cell><cell>danish faroese culture fund</cell></row><row><cell cols="2">WC0497 Greek</cell><cell>-</cell><cell>Home page of the Hellenic</cell></row><row><cell></cell><cell></cell><cell></cell><cell>parliament for kids</cell></row><row><cell cols="2">WC0491 German</cell><cell>Francesca Ferguson</cell><cell>Francesca Ferguson for Germany</cell></row><row><cell></cell><cell></cell><cell>Architektur-Biennale 2004</cell><cell>at achitecture Biennale 2004</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Acknowledgments</head><p>We want to thank the participating teams for their valuable input that helped to make this test collection a reality. We are thankful to the University of Glasgow for providing additional search engine access to the collection during the topic development phase. We thank UNED for providing a reviewed set of translations for the bilingual English to Spanish task. We would like to thank Ian Soboroff and TREC for their help with creating the topic development guidelines.</p><p>Jaap Kamps was supported by a grant from the Netherlands Organization for Scientific Research (NWO) under project numbers 612.066.302 and 640.001.501. Maarten de Rijke was supported by grants from NWO under project numbers 017.001.190, 220-80-001, 264-70-050, 354-20-005, 612-13-001, 612.000.106, 612.000.207, 612.066.302, and 612.069.006.</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>multilingual task, in contrast, is still very far from being a solved problem. Remarkably, using nontranslated English queries proved more successful than to use translations of the English queries. A closer look at the best scoring queries revealed that a large portion of them did indeed have an English target. As for the best scoring queries which had non-English target, a majority contained a proper name which does not require translation.</p><p>The Future of WebCLEF WebCLEF 2005 was an important first step toward a cross lingual web retrieval test collection. There are a number of steps that can be taken to further improve the quality of the current test collection. Here we list a few.</p><p>• User data More user data was collected during the topic development phase than was used as topic metadata. This serves as an important resource to better understand the challenges of multilingual web retrieval. The data is available to all groups who participated in the topic development process.</p><p>• Duplicates It is not clear how complete the duplicate detection is. It remains as future work to investigate this aspect. Furthermore, we need to analyze how incomplete duplicate detection affects system ranking.</p><p>• Translations As with duplicates, the translations are likely to be incomplete. It is fairly non-trivial to achieve a complete list of translations. It remains as future work to investigate whether the creation of the set of translations can be partly automated.</p><p>If we look a bit further ahead and speculate about future WebCLEF tasks, there are a number of new tasks we can look at.</p><p>• X to English Non-native English speakers are often more comfortable with posting queries in their native language even if they have no problem with reading English results.</p><p>• Ad-hoc retrieval If assessment resources are allocated for the WebCLEF task it would be possible, and worthwhile, to do ad-hoc retrieval.</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Overview of the TREC-2004 Web Track</title>
		<author>
			<persName><forename type="first">N</forename><surname>Craswell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Hawking</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings TREC 2004</title>
				<meeting>TREC 2004</meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">EuroGOV: Engineering a Multilingual Web Corpus</title>
		<author>
			<persName><forename type="first">B</forename><surname>Sigurbjörnsson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kamps</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>De Rijke</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
	<note>In This</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
