-

Overview of WebCLEF 2006

Krisztian Balog

2 3 4

Leif Azzopardi

Leif.Azzopardi@cis.strath.ac.uk 1 3 4

Jaap Kamps

0 2 3 4

General Terms

3 4

H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries

0 Archive and Information Studies, University of Amsterdam 1 Department of Computer and Information Sciences, University of Strathclyde 2 ISLA, University of Amsterdam 3 Maarten de Rijke 4 Measurement , Performance, Experimentation

We report on the CLEF 2006 WebCLEF track devoted to crosslingual web retrieval. We provide details about the retrieval tasks, the used topic set, and the results of WebCLEF participants. WebCLEF 2006 used a stream of known-item topics consisting of: (i) manual topics (including a selection of WebCLEF 2005 topics, and a set of new topics) and (ii) automatically generated topics (generated using two techniques). Our main findings are the following. First, the results over all topics show that current CLIR systems are quite effective, retrieving on average the target page in the top few ranks. Second, when we break down the scores over the manually constructed and the generated topics, we see that the manually constructed topics result in higher performance. Third, the resulting scores on automatic topics give, at least, a solid indication of performance, and can hence be an attractive alternative in situations where manual topics are not readily available.

Web retrieval Known-item retrieval Multilingual retrieval

The world wide web presents one of the greatest challenges for cross-language information retrieval [ 5 ]. Content on the world wide web is essentially multilingual, and web users are often polyglots. The European web space is a case in point: the majority of European speak at least one language other than their mother-tongue, and the Internet is a frequent reason to use a foreign language [ 4 ]. The challenge of crosslingual web retrieval is addressed, head-on, by WebCLEF [ 9 ].

The crosslingual web retrieval track uses an extensive collection of spidered web sites of European governments, baptized EuroGOV [ 7 ]. The retrieval task at WebCLEF 2006 is based on a stream of known-item topics in a range of languages. This task, which is labeled mixed-monolingual retrieval, was pioneered at the WebCLEF 2005 [ 8 ]. Participants of WebCLEF 2005 expressed the wish to be able to iron out issues with the systems they built during last year’s campaign, since for many it was their first attempt at web IR with lots of languages, encoding issues, different formats, and noisy data. The continuation of this known-item retrieval task at WebCLEF 2006 allows veteran participants to take stock and make meaningful comparisons of their results over years. To facilitate this, we decided to include a selection of WebCLEF 2005 topics in the topic set (also available for training purposes), as well as a set of new known-item topics. Furthermore, we decided to experiment with the automatic generation of known-item topics [ 2 ]. By contrasting the human topics with the automatically generated topics, we hope to gain insight in the validity of the automatically generated topics, especially in a multilingual environment. Our main findings are the following. First, the results over all topics show that current CLIR systems are quite effective, retrieving on average the target page in the top few ranks. Second, when we break down the scores over the manually constructed and the generated topics, we see that the manually constructed topics result in higher performance. Third, the resulting scores on automatic topics give, at least, a solid indication of performance, and can hence be an attractive alternative in situations where manual topics are not readily available.

The remainder of this paper is structured as follows. Section 2 gives the details of the method for automatically generating known-item topics. Next, in Section 3, we discuss the details of the track set-up: the retrieval task, document collection, and topics of request. Section 4 reports the runs submitted by participants, and Section 5 discusses the results of the official submissions. Finallly, in Section 6 we discuss our nfidings and draw some initial conclusions. 2

Automatic Topic Construction

This year we experimented with the automatic generation of known-item topics. The main advantage of automatically generating queries is that for any given test collection numerous queries can be produced at minimal cost [ 2 ]. In the WebCLEF setting this could be especially rewarding, since manual development of topics on all the different languages would require human resources we do not dispose of.

To create simulated queries, we model the following behavior of a known-item searcher. We assume that the user wants to retrieve a particular document that they have seen before in the collection, because some need has arisen calling for this document. The user then tries to reconstruct or recall terms, phrases and features that would help identify this document, which they pose as a query.

The basic algorithm we use for generating queries was introduced by Azzopardi and de Rijke [ 2 ], and is based on an abstraction of the actual querying process, as follows: • • • Initialize an empty query q = {} • Select the document d to be the known-item with probability p(d) • Select the query length k with probability p(k)

Repeat k times: – Select a term t from the document model of d with probability p(t|θ d) – Add t to the query q.

Record d and q to define the known-item/query pair.

By repeatedly performing this algorithm we can create many queries. Before doing so, the probability distributions p(d), p(k) and p(t|θ d) need to be denfied. By using different probability distributions we can characterize different types and styles of queries that a user may submit.

Azzopardi and de Rijke [ 2 ] conducted experiments using various term sampling methods in order to simulate different styles of queries. In one case, they set the probability of selecting a term from the document model to a uniform distribution, where p(t|θ d) was set to zero for all terms that did not occur in the document, whilst all other terms were assigned an equal probability.

Query Start

Query Start 0.9 0.1

Noise

Compared to other types of queries, they found that using a uniform selection produced queries which were the most similar to real queries.

In the construction of a set of queries for the EuroGOV collection, we also use uniform sampling, but include query noise and then phrase extraction into the process to create more realistic queries. To include some noise to the process of generating a query, our model for sampling query terms is broken into two parts: sampling from the document (in our case uniformly) and sampling terms at random (i.e., noise). Figure 1 shows the sampling process; where a term is drawn from the unigram document model with some probability λ , or it is drawn from the noise model with probability 1 − λ . Consequently, as λ tends to zero, we assume that the user has almost perfect recollection of the original document. Conversely, as λ tends to one, we assume that the user’s memory of the document degrades to the point that they know the document exists but they have no idea as to the terms other than randomly selecting terms (from the collection). We used λ = 0.1 for topic generation. This model was used for our rfist setting, called auto-uni.

We further extended the process of sampling terms from a document. Once a term has been sampled from the document, we assume that there is some probability that the subsequent term will be drawn. For instance given the sentence, “. . . Information Retrieval Agent . . . ”, if the first term sampled is “Retrieval”, then the subsequent term selected will be “Agent”. This was included to provide some notion of phrase extraction to the process of selecting query terms. The process is depicted in Figure 2. This model was used for our second setting, called auto-bi, where we either add the subsequent term with p = 0.7, or sample a new term independently from the document with p = 0.3.

We indexed each domain within the EuroGOV collection separately, using the Lemur language modeling toolkit [ 6 ]. We experimented with two different styles of queries, and for each of them we generated 30 queries per top level domain. For both settings, the query length k was selected using a Poisson distribution where the mean was set to 3. Two restrictions were placed on sampled query terms: (i) the size of a term needed to be greater than 3, and (ii) the terms should not contain any numeric characters. Finally, the document prior p(d) was also set to a uniform distribution.

Our initial results motivate further work with more sophisticated query generators. A natural next step would be to take structure and document priors into account. 3 3.1

The WebCLEF 2006 Tasks Document Collection

For the purposes of the WebCLEF track the EuroGOV corpus was developed [ 7 ]. EuroGOV is a crawl of European government-related sites, where collection building is less restricted by intellectual property rights. It is a multilingual web corpus, which contains over 3.5 million pages from 27 primary domains, covering over twenty languages. There is no single language that dominates the corpus, and its linguistic diversity provides a natural setting for multilingual web search. 3.2

Topics

The topic set for WebCLEF 2006 consists of a stream of 1,940 known-item topics, consisting of both manual and automatically generated topics. As is shown in Table 1, 195 manual topics were reused from WebCLEF 2005, and 125 new manual topics were constructed. For the generated topics, we focused on 27 primary domains and generated 30 topics using the auto-uni query generation, and another 30 topics using the auto-bi query generation (see Section 2 for details), amounting to 810 automatic topics for each of the methods.

After the runs had been evaluated, we observed that the performance achieved on the automatic topics are frequently very poor. We found that in several cases none of the participants found any relevant page within the top 50 returned results. These are often mixed-language topics, a result of language diversity within a primary domain, or they proved to be too hard for any other reason.

In our post-submission analysis we decided to zoom in on a subset of topics and removed any topics that did not meet the following criterion: “whether any participant found the targetted page within the top 50.” Table 1 presents the number of original, deleted and remaining topics. 820 out of the 1, 940 original topics were removed. Most of the removed topics are automatic (803), but there are also a few manual ones (17). The remaining topic set contains 1,120 topics, and is referred as the new topic set.

We decided to re-evaluate the submitted runs using this new topic set. Since it is a subset of the original topic collection, participants did not have to make any efforts. Submitted runs were re-evaluated using a restricted version of the (original) qrels that correspond to the new topic set. 3.3

Retrieval Task

WebCLEF 2006 saw the continuation of the Mixed Monolingual task of WebCLEF 2005 [ 8 ]. The mixed-monolingual task is meant to simulate a user searching for a known-item page in a European language. The mixed-monolingual task uses the title efild of the topics to create a set of monolingual known-item topics.

Our emphasis this year is on the mixed monolingual task. The manual topics in the topic set contain an English translation of the query. Hence, using only the manual topics, experiments with a Multilingual task are possible. The multilingual task is meant to simulate a user looking for a certain known-item page in a particular European language. The user, however, uses English to formulate her query. The multilingual task used the English translations of the original topic statements. 3.4

Submission

For each task, participating teams were allowed to submit up to 5 runs. The results had to be submitted in TREC format. For each topic a ranked list of no more than 50 results should be returned. For each topic at least 1 result must be returned. Participants were also asked to provide a list of the metadata efilds they used, and a brief description of the methods and techniques employed. 3.5

Evaluation

The WebCLEF 2006 topics were known-item topics where a unique URL is targetted (unless there are page-duplicates in the collection, or near duplicates). Hence, we opted for a precision measure. The main metric used for evaluation was mean reciprocal rank (MRR). The reciprocal rank is, indeed, calculated as 1 divided by the rank at which the (first) relevant page is found. The mean reciprocal rank is obtained by, indeed, averaging the reciprocal ranks of a set of topics. 4

Submitted Runs

There were 8 participating teams that managed to submit official runs to WebCLEF 2006: buap; depok; hildesheim; hummingbird; isla; reina; rafi; and ucm. For details of the respective retrieval approaches to crosslingual web retrieval, we refer to the participants papers.

Table 2 lists the runs submitted to WebCLEF 2006: 35 for the mixed-monolingual task, and 1 for the bilingual task. We also indicate the use of topic metadata, either the topic’s language (TL), the targetted page’s language (PL), or the targetted page’s domain (PD). The mean reciprocal rank (MRR) is reported over both the original and the new topic set. The official results of WebCLEF 2006 were based on the original topic set containing 1,940 topics. As detailed in Section 3.2 above, we have pruned the topic set by removing topics for which none of the participants retrieved the target page, resulting in 1,120 topics. In Appendix A, we provide scores for various breakdowns for both the original topic set and the new topic set.

The task description stated that for each topic, at least 1 result must be returned. However, several runs did not fulfill this condition. The best results for each team were achieved using 1 or more metadata fields. Knowledge of the page’s primary domain (shown in the PD column in Table 2) seemed moderately effective. 5

Results

This year our focus is on the Mixed-Monolingual task. A large number of topics were made available, consisting old manual, new manual, and automatically generated topics. Evaluation results showed that the performance achieved on the automatic topics are frequently very poor, and we made a new topic set where we removed topics for which none of the participants found any relevant page within the top 50 returned results. All the results presented in this section correspond to the new topic set consisting of 1,120 topics. We look at each team’s best scoring run, independent of whether it was a baseline run or used some of the topic metadata. Table 3 presents the scores of the participating teams. We report the results over the whole new qrel set (all ), and over the automatic and manual subsets of topics. What is striking is that the automatic topics proved to be more difficult than manual ones. This may be due in part to the fact that the manual topics cover 11 languages, but the generated topics cover all 27 domains in EuroGOV including the more difficult domains and languages. Another important factor may be the imperfections in the generated topics. Apart from the lower scores, the auto topics also dominate the manual topics in number. Therefore we also used the average of the auto and manual scores for ranking participants. Denfiing an overall ranking of teams is not straightforward, since one team may outperform another on the automatic topics, but perform worse on the manual ones. Still, we observe that participants can be unambiguously assigned into one out of three bins based on either the all or the average scores: the rfist bin consisting of hummingbird and isla; the second bin of depok, hildesheim, rafi, and ucm; and the third bin of buap and reina. 5.2

Evaluation on Automatic Topics

Automatic topics were generated using two different methods, as described in Section 2 above. The participating teams’ scores did not show signicfiant variance between the difficulty of topics, using the the two generators. Table 4 provides details of the best runs when evaluation is restricted to automatically generated topics only.

Note that the scores included in Table 4 are measured on the new topic set. Notice, by the way, that there is very little difference between the number of topics within the new topic set for the two automatic topic subsets (auto-uni and auto-bi in Table 1).

In general, the two query generation methods perform very similarly, and it is system specific whether one type of automatic topics is preferred over the other. Our initial results with automatically generated queries are promising, but still a large portion of these topics are not realistic. This motivates us to work further on more advanced query generation methods. 5.3

Evaluation on Manual Topics

The manual topics include 183 old and 120 new queries. Old topics were randomly sampled from last year’s topics, while new topics were developed by Universidad Complutense de Madrid (UCM) and the track organizers. The new topics cover only languages for which expertise was available: Dutch, English, German, Hungarian, and Spanish.

In case of the old manual topics we witnessed improvements for all teams that took part in WebCLEF 2005, compared to their last year’s scores. Moreover, we found that most participating systems performed better on the new manual topics, compared to the old ones. A possible explanation is the nature of the topics, namely the new topics may be more appropriate for know-item search. Also, language coverage of the new manual topics could play a role. 5.4

Comparing Rankings

We use Kendall’s tau to determine correlations between the rankings of runs resulting from different topic sets. First, we find weak (0.2–0.4) to moderate (0.4–0.6) positive correlations between ranking of runs resulting from automatic topics, and rankings of runs resulting from manual topics, only new manual topics, and only old manual topics; see Table 6. The rankings resulting from the topics generated with the “auto-bi” method are somewhat more correlated with the manual rankings than the ranking resulting from the topics generated with the “auto-uni” method. A very strong positive correlation (0.8–1.0) is found between the ranking of runs obtained using new manual topics and the ranking of runs resulting from using old manual topics. Note that the new topic set we introduced does not affect the relative ranking of systems, thus the correlation scores we reported here are exactly the same for the original and for the new topic sets. Our main focus this year was on the monolingual task, but we allowed submissions for multilingual experiments within the mixed-monolingual setup. The manual topics (both old and new ones) are provided with English titles. The automatically generated topics do not have English translations.

We received only one multilingual submission, from the University of Hildesheim. The evaluation of the multilingual run is restricted to the manual topics in the topic set, Table 2 summarizes the results of that run. A detailed breakdown over the different topic types is provided in Appendix A (Tables 7 and 8) 6

Conclusion

The world-wide-web is a natural reeflction of the language diversity in the world, both in terms of web content as well as in terms of web users. Effective cross-language information retrieval (CLIR) techniques have clear potential for improving the search experience of such users. The WebCLEF track at CLEF 2006 attempts to realize some of this potential, by investigating knownitem retrieval in a multilingual setting. Known-item retrieval is a typical search task on the web [ 3 ]. This year’s track focused on mixed monolingual search, in which the topic set is a stream of knownitem topics in various languages. This task was pioneered at WebCLEF 2005 [ 8 ]. The collection is based on the spidered content of web sites of European governments. This year’s topic set covered all 27 primary domains in the collection, and contained both manually constructed search topics and automatically generated topics. Our main nfidings for the mixed-monolingual task are the following. First, the results over all topics show that current CLIR systems are quite effective. These systems retrieve, on average, the target page in the top few ranks. This is particularly impressive when considering that the topics of WebCLEF 2006 covered no less than 27 European primary domains. Second, when we break down the scores over the manually constructed and the generated topics, we see that the manually constructed topics result in higher performance. The manual topics consisted of both a set of newly constructed topics, and a selection of WebCLEF 2005 topics. For veteran participants, we can compare the scores over years, and we see progress for the old manual topics. The new manual topics (which were not available for training) seem to conrfim this progress.

Building a cross-lingual test collection is a complex endeavor. Information retrieval evaluation requires substantial manual effort by topic authors and relevance assessors. In a cross-lingual setting this is particularly difficult, since the language capabilities of topic authors should sufficiently reflect the linguistic diversity of the used document collection. Alternative proposals to traditional topics and relevance assessments, such as term relevance sets, still require human effort (albeit only a fraction) and linguistic capacities by the topic author.1 This prompted us to experiment with techniques for automatically generating known-item search requests. The automatic construction of known-item topics has been applied earlier in a monolingual setting [ 2 ]. At WebCLEF 2006, two refined versions of the techniques were applied in a mixed-language setting. The general set-up of the the WebCLEF 2006 track can be viewed as an experiment with automatically constructing topics. Recall that the topic set contained both manual and automatic topics. This allows us to critically evaluate the performance on the automatic topics with the manual topics, although the comparison is not necessarily fair given that the manual and automatic subsets of topics differ both in number and in the domains they cover. Our general conclusion on the automatic topics is a mixed one: On the one hand, our results show that there are still some substantial differences 1Recall that term relevance sets (T-rels) consisting of a set of terms likely to occur in relevant documents, and a set of irrelevant terms (especially disambiguation terms avoiding false-positives) [ 1 ]. between the automatic topics and manual topics, and it is clear that automatic topics cannot simply substitute manual topics. Yet on the other hand, the resulting scores on automatic topics give, at least, a solid indication of performance, and can hence be an attractive alternative in situations where manual topics are not readily available.

Acknowledgments Thanks to Universidad Complutense de Madrid (UCM) for providing additional Spanish topics.

Krisztian Balog was supported by the Netherlands Organisation for Scienticfi Research (NWO) under project numbers 220-80-001, 600.065.120 and 612.000.106. Jaap Kamps was supported by NWO under project numbers 612.066.302, 612.066.513, 639.072.601, and 640.001.501; and by the E.U. IST programme of the 6th FP for RTD under project MultiMATCH contract IST-033104. Maarten de Rijke was supported by NWO under project numbers 017.001.190, 220-80-001, 264-70050, 354-20-005, 600.065.120, 612-13-001, 612.000.106, 612.066.302, 612.069.006, 640.001.501, 640.002.501, and and by the E.U. IST programme of the 6th FP for RTD under project MultiMATCH contract IST-033104. A

Breakdown of Scores over Topic Types

We provide a breakdown of scores over the different topic types, both for the original topic set in Table 7 and for the new topic set in Table 8.

RUN isla baseline comb combmeta combNboost combPhrase reina usal base usal mix USAL mix hp usal mix hp usal mix hp ok rfia DPSinDiac ERConDiac ERFinal ERSinDiac ucm webclef-run-all-2006-def-ok-2 webclef-run-all-2006-def-ok webclef-run-all-2006-ok-conref webclef-run-all-2006 webclef-run-all-OK-definitivo

ALL topics all isla baseline comb combmeta combNboost combPhrase reina usal base usal mix USAL mix hp usal mix hp usal mix hp ok rfia DPSinDiac ERConDiac ERFinal ERSinDiac

ALL topics 0.0699 0.1589 0.0439 0.0202 all all new 0.0272 0.0465 0.0923 0.0281 0.0049 0.0480 0.0480 0.0685 0.0436 0.0438 0.0524 0.0164

[1]

Amitay ,

Carmel ,

Lempel , and

Soffer . Scaling ir-system evaluation using term relevance sets . In Proceedings of the 27th annual international ACM SIGIR conference on Research and Development in Information Retrieval , pages 10 - 17 . ACM Press, New York USA, 2004 .

[2]

Azzopardi and M. de Rijke . Automatic construction of known-item nfiding test beds . In Proceedings of the 29th annual international ACM SIGIR conference on Research and Development in Information Retrieval , pages 603 - 604 , New York, NY, USA, 2006 . ACM Press.

[3]

Broder . A taxonomy of web search . SIGIR Forum , 36 ( 2 ): 3 - 10 , 2002 .

[4] Eurobarometer . Europeans and their languages . Special Eurobarometer 243 , European

Commision

, 2006 . URL: http://ec.europa.eu/public_opinion/archives/ebs/ebs_243_en.pdf.

[5]

F. C.

Gey ,

Kando , and

Peters . Cross language information retrieval: a research roadmap . SIGIR Forum , 36 ( 2 ): 72 - 80 , 2002 .

[6] Lemur . The Lemur toolkit for language modeling and information retrieval, 2005 . URL: http://www.lemurproject.org/.

[7]

Sigurboj ¨rnsson, J. Kamps, and M. de Rijke. EuroGOV: Engineering a multilingual Web corpus . In C. Peters,

F. C.

Gey ,

Gonzalo ,

G. J. F.

Jones ,

Kluck ,

Magnini , H. Mu¨ller, and M. de Rijke, editors, Accessing Multilingual Information Repositories: 6th Workshop of the Cross-Language Evaluation Forum (CLEF 2005 ), volume 4022 of Lecture Notes in Computer Science. Springer Verlag, Heidelberg, 2006 .

[8]

Sigurboj ¨rnsson, J. Kamps, and M. de Rijke. Overview of WebCLEF 2005 . In C. Peters,

F. C.

Gey ,

Gonzalo ,

G. J. F.

Jones ,

Kluck ,

[9] WebCLEF. Cross-lingual web retrieval , 2006 . URL: http://ilps.science. uva.nl/WebCLEF/. 0.0080 0.0061 0.0099 0.0790 0.0863 0 . 0679