=Paper= {{Paper |id=Vol-1172/CLEF2006wn-adhoc-DiNunzioEt2006 |storemode=property |title=CLEF 2006: Ad Hoc Track Overview |pdfUrl=https://ceur-ws.org/Vol-1172/CLEF2006wn-adhoc-DiNunzioEt2006.pdf |volume=Vol-1172 |dblpUrl=https://dblp.org/rec/conf/clef/NunzioFMP06a }} ==CLEF 2006: Ad Hoc Track Overview== https://ceur-ws.org/Vol-1172/CLEF2006wn-adhoc-DiNunzioEt2006.pdf
             CLEF 2006: Ad Hoc Track Overview

    Giorgio M. Di Nunzio1 , Nicola Ferro1 , Thomas Mandl2 , and Carol Peters3
         1
             Department of Information Engineering, University of Padua, Italy
                           {dinunzio, ferro}@dei.unipd.it
               2
                 Information Science, University of Hildesheim – Germany
                               mandl@uni-hildesheim.de
                    3
                      ISTI-CNR, Area di Ricerca – 56124 Pisa – Italy
                              carol.peters@isti.cnr.it



       Abstract. We describe the objectives and organization of the CLEF
       2006 ad hoc track and discuss the main characteristics of the tasks of-
       fered to test monolingual, bilingual, and multilingual textual document
       retrieval systems. The track was divided into two streams. The main
       stream offered mono- and bilingual tasks using the same collections as
       CLEF 2005: Bulgarian, English, French, Hungarian and Portuguese. The
       second stream, designed for more experienced participants, offered the
       so-called ”robust task” which used test collections from previous years
       in six languages (Dutch, English, French, German, Italian and Spanish)
       with the objective of privileging experiments which achieve good stable
       performance over all queries rather than high average performance. The
       document collections used were taken from the CLEF multilingual com-
       parable corpus of news documents. The performance achieved for each
       task is presented and a statistical analysis of results is given.


Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Index-
ing; H.3.3 Information Search and Retrieval; H.3.4 [Systems and Software]:
Performance evaluation.

General Terms
Experimentation, Performance, Measurement, Algorithms.

Additional Keywords and Phrases
Multilingual Information Access, Cross-Language Information Retrieval


1     Introduction

The ad hoc retrieval track is generally considered to be the core track in the
Cross-Language Evaluation Forum (CLEF). The aim of this track is to promote
the development of monolingual and cross-language textual document retrieval
systems. The CLEF 2006 ad hoc track was structured in two streams. The main
stream offered monolingual tasks (querying and finding documents in one lan-
guage) and bilingual tasks (querying in one language and finding documents in
another language) using the same collections as CLEF 2005. The second stream,
designed for more experienced participants, was the ”robust task”, aimed at
finding documents for very difficult queries. It used test collections developed in
previous years.
    The Monolingual and Bilingual tasks were principally offered for Bulgar-
ian, French, Hungarian and Portuguese target collections. Additionally, in the
bilingual task only, newcomers (i.e. groups that had not previously participated
in a CLEF cross-language task) or groups using a “new-to-CLEF” query lan-
guage could choose to search the English document collection. The aim in all
cases was to retrieve relevant documents from the chosen target collection and
submit the results in a ranked list.
    The Robust task offered monolingual, bilingual and multilingual tasks using
the test collections built over three years: CLEF 2001 - 2003, for six languages:
Dutch, English, French, German, Italian and Spanish. Using topics from three
years meant that more extensive experiments and a better analysis of the results
were possible. The aim of this task was to study and achieve good performance
on queries that had proved difficult in the past rather than obtain a high average
performance when calculated over all queries.
    In this paper we describe the track setup, the evaluation methodology and the
participation in the different tasks (Section 2), present the main characteristics
of the experiments and show the results (Sections 3 - 5). Statistical testing
is discussed in Section 6 and the final section provides a brief summing up.
For information on the various approaches and resources used by the groups
participating in this track and the issues they focused on, we refer the reader to
the other papers in the Ad Hoc section of the Working Notes.


2   Track Setup

The ad hoc track in CLEF adopts a corpus-based, automatic scoring method
for the assessment of system performance, based on ideas first introduced in the
Cranfield experiments in the late 1960s. The test collection used consists of a
set of “topics” describing information needs and a collection of documents to be
searched to find those documents that satisfy these information needs. Evalu-
ation of system performance is then done by judging the documents retrieved
in response to a topic with respect to their relevance, and computing the recall
and precision measures. The distinguishing feature of CLEF is that it applies
this evaluation paradigm in a multilingual setting. This means that the criteria
normally adopted to create a test collection, consisting of suitable documents,
sample queries and relevance assessments, have been adapted to satisfy the par-
ticular requirements of the multilingual context. All language dependent tasks
such as topic creation and relevance judgment are performed in a distributed
setting by native speakers. Rules are established and a tight central coordina-
             Table 1. Test collections for the main stream Ad Hoc tasks.

                    Language                         Collections
                    Bulgarian           Sega 2002, Standart 2002
                    English      LA Times 94, Glasgow Herald 95
                    French     ATS (SDA) 94/95, Le Monde 94/95
                    Hungarian                 Magyar Hirlap 2002
                    Portuguese        Público 94/95; Folha 94/95



tion is maintained in order to ensure consistency and coherency of topic and
relevance judgment sets over the different collections, languages and tracks.

2.1     Test Collections
Different test collections were used in the ad hoc task this year. The main (i.e.
non-robust) monolingual and bilingual tasks used the same document collections
as in Ad Hoc last year but new topics were created and new relevance assessments
made. As has already been stated, the test collection used for the robust task
was derived from the test collections previously developed at CLEF. No new
relevance assessments were performed for this task.

Documents. The document collections used for the CLEF 2006 ad hoc tasks are
part of the CLEF multilingual corpus of newspaper and news agency documents
described in the Introduction to these Proceedings.
    In the main stream monolingual and bilingual tasks, the English, French and
Portuguese collections consisted of national newspapers and news agencies for
the period 1994 and 1995. Different variants were used for each language. Thus,
for English we had both US and British newspapers, for French we had a national
newspaper of France plus Swiss French news agencies, and for Portuguese we
had national newspapers from both Portugal and Brazil. This means that, for
each language, there were significant differences in orthography and lexicon over
the sub-collections. This is a real world situation and system components, i.e.
stemmers, translation resources, etc., should be sufficiently flexible to handle
such variants. The Bulgarian and Hungarian collections used in these tasks were
new in CLEF 2005 and consist of national newspapers for the year 20021. This
has meant using collections of different time periods for the ad-hoc mono- and
bilingual tasks. This had important consequences on topic creation. Table 1
summarizes the collections used for each language.
    The robust task used test collections containing data in six languages (Dutch,
English, German, French, Italian and Spanish) used at CLEF 2001, CLEF 2002
and CLEF 2003. There are approximately 1.35 million documents and 3.6 gi-
gabytes of text in the CLEF 2006 ”robust” collection. Table 2 summarizes the
collections used for each language.
1
    It proved impossible to find national newspapers in electronic form for 1994 and/or
    1995 in these languages.
                   Table 2. Test collections for the Robust task.

         Language                                        Collections
         English                     LA Times 94, Glasgow Herald 95
         French                       ATS (SDA) 94/95, Le Monde 94
         Italian                    La Stampa 94, AGZ (SDA) 94/95
         Dutch      NRC Handelsblad 94/95, Algemeen Dagblad 94/95
         German   Frankfurter Rundschau 94/95, Spiegel 94/95, SDA 94
         Spanish                                          EFE 94/95



Topics Topics in the CLEF ad hoc track are structured statements representing
information needs; the systems use the topics to derive their queries. Each topic
consists of three parts: a brief “title” statement; a one-sentence “description”; a
more complex “narrative” specifying the relevance assessment criteria.
    Sets of 50 topics were created for the CLEF 2006 ad hoc mono- and bilingual
tasks. One of the decisions taken early on in the organization of the CLEF ad
hoc tracks was that the same set of topics would be used to query all collec-
tions, whatever the task. There were a number of reasons for this: it makes it
easier to compare results over different collections, it means that there is a single
master set that is rendered in all query languages, and a single set of relevance
assessments for each language is sufficient for all tasks. However, in CLEF 2005
the assessors found that the fact that the collections used in the CLEF 2006 ad
hoc mono- and bilingual tasks were from two different time periods (1994-1995
and 2002) made topic creation particularly difficult. It was not possible to create
time-dependent topics that referred to particular date-specific events as all top-
ics had to refer to events that could have been reported in any of the collections,
regardless of the dates. This meant that the CLEF 2005 topic set is somewhat
different from the sets of previous years as the topics all tend to be of broad
coverage. In fact, it was difficult to construct topics that would find a limited
number of relevant documents in each collection, and consequently a - probably
excessive - number of topics used for the 2005 mono- and bilingual tasks have a
very large number of relevant documents.
    For this reason, we decided to create separate topic sets for the two different
time-periods for the CLEF 2006 ad hoc mono- and bilingual tasks. We thus
created two overlapping topic sets, with a common set of time independent
topics and sets of time-specific topics. 25 topics were common to both sets while
25 topics were collection-specific, as follows:
    - Topics C301 - C325 were used for all target collections
    - Topics C326 - C350 were created specifically for the English, French and
Portuguese collections (1994/1995)
    - Topics C351 - C375 were created specifically for the Bulgarian and Hun-
garian collections (2002).
    This meant that a total of 75 topics were prepared in many different languages
(European and non-European): Bulgarian, English, French, German, Hungar-
ian, Italian, Portuguese, and Spanish plus Amharic, Chinese, Hindi, Indonesian,
Oromo and Telugu. Participants had to select the necessary topic set according
to the target collection to be used.
    Below we give an example of the English version of a typical CLEF topic:

  C302 
 Consumer Boycotts  <
EN-desc> Find documents that describe or discuss the impact of consumer
boycotts. 
 Relevant documents will report discussions or points of view on
the efficacy of consumer boycotts. The moral issues involved in such
boycotts are also of relevance. Only consumer boycotts are relevant,
political boycotts must be ignored.  

    For the robust task, the topic sets used in CLEF 2001, CLEF 2002 and
CLEF 2003 were used for evaluation. A total of 160 topics were collected and
split into two sets: 60 topics used to train the system, and 100 topics used for
the evaluation. Topics were available in the languages of the target collections:
English, German, French, Spanish, Italian, Dutch.


2.2   Participation Guidelines

To carry out the retrieval tasks of the CLEF campaign, systems have to build
supporting data structures. Allowable data structures include any new structures
built automatically (such as inverted files, thesauri, conceptual networks, etc.)
or manually (such as thesauri, synonym lists, knowledge bases, rules, etc.) from
the documents. They may not, however, be modified in response to the topics,
e.g. by adding topic words that are not already in the dictionaries used by their
systems in order to extend coverage.
    Some CLEF data collections contain manually assigned, controlled or uncon-
trolled index terms. The use of such terms has been limited to specific experi-
ments that have to be declared as “manual” runs.
    Topics can be converted into queries that a system can execute in many dif-
ferent ways. CLEF strongly encourages groups to determine what constitutes
a base run for their experiments and to include these runs (officially or unof-
ficially) to allow useful interpretations of the results. Unofficial runs are those
not submitted to CLEF but evaluated using the trec eval package. This year
we have used the new package written by Chris Buckley for the Text REtrieval
Conference (TREC) (trec eval 7.3) and available from the TREC website.
    As a consequence of limited evaluation resources, a maximum of 12 runs each
for the mono- and bilingual tasks was allowed (no more than 4 runs for any one
language combination - we try to encourage diversity). We accepted a maximum
of 4 runs per group and topic language for the multilingual robust task. For bi-
and mono-lingual robust tasks, 4 runs were allowed per language or language
pair.
2.3   Relevance Assessment
The number of documents in large test collections such as CLEF makes it imprac-
tical to judge every document for relevance. Instead approximate recall values
are calculated using pooling techniques. The results submitted by the groups
participating in the ad hoc tasks are used to form a pool of documents for each
topic and language by collecting the highly ranked documents from all submis-
sions. This pool is then used for subsequent relevance judgments. The stability
of pools constructed in this way and their reliability for post-campaign experi-
ments is discussed in [1] with respect to the CLEF 2003 pools. After calculating
the effectiveness measures, the results are analyzed and run statistics produced
and distributed. New pools were formed in CLEF 2006 for the runs submitted
for the main stream mono- and bilingual tasks and the relevance assessments
were performed by native speakers. Instead, the robust tasks used the original
pools and relevance assessments from CLEF 2003.
    The individual results for all official ad hoc experiments in CLEF 2006 are
given in the Appendix at the end of the on-line Working Notes prepared for the
Workshop [2].

2.4   Result Calculation
Evaluation campaigns such as TREC and CLEF are based on the belief that
the effectiveness of Information Retrieval Systems (IRSs) can be objectively
evaluated by an analysis of a representative set of sample search results. For
this, effectiveness measures are calculated based on the results submitted by the
participant and the relevance assessments. Popular measures usually adopted for
exercises of this type are Recall and Precision. Details on how they are calculated
for CLEF are given in [3]. For the robust task, we used different measures, see
below Section 5.

2.5   Participants and Experiments
As shown in Table 3, a total of 25 groups from 15 different countries submitted
results for one or more of the ad hoc tasks - a slight increase on the 23 participants
of last year. Table 4 provides a breakdown of the number of participants by
country.
    A total of 296 experiments were submitted with an increase of 16% on the
254 experiments of 2005. On the other hand, the average number of submitted
runs per participant is nearly the same: from 11 runs/participant of 2005 to 11.7
runs/participant of this year.
    Participants were required to submit at least one title+description (“TD”)
run per task in order to increase comparability between experiments. The large
majority of runs (172 out of 296, 58.11%) used this combination of topic fields,
78 (26.35%) used all fields, 41 (13.85%) used the title field, and only 5 (1.69%)
used the description field. The majority of experiments were conducted using au-
tomatic query construction (287 out of 296, 96.96%) and only in a small fraction
 Table 3. CLEF 2006 ad hoc participants – new groups are indicated by *

Participant                    Institution                   Country
alicante      U. Alicante                                Spain
celi          CELI, Torino *                             Italy
colesir       U.Coruna and U.Sunderland                  Spain
daedalus      Daedalus Consortium                        Spain
dcu           Dublin City U.                             Ireland
depok         U.Indonesia                                Indonesia
dsv           U.Stockholm                                Sweden
erss-toulouse U.Toulouse/CNRS *                          France
hildesheim    U.Hildesheim                               Germany
hummingbird Hummingbird Core Technology Group            Canada
indianstat    Indian Statistical Institute *             India
jaen          U.Jaen                                     Spain
ltrc          Int. Inst. IT *                            India
mokk          Budapest U.Tech and Economics              Hungary
nilc-usp      U.Sao Paulo - Comp.Ling. *                 Brazil
pucrs         U.Catolica Rio Grande do Sul *             Brazil
queenmary Queen Mary, U.London *                         United Kingdom
reina         U.Salamanca                                Spain
rim           EMSE - Ecole Sup. des Mines                France
rsi-jhu       Johns Hopkins U. - APL                     United States
saocarlos     U.Fed Sao Carlos - Comp.Sci *              Brazil
u.buffalo     SUNY at Buffalo                            United States
ufrgs-usp     U.Sao Paulo and U.Fed. Rio Grande do Sul * Brazil
unine         U.Neuchatel - Informatics                  Switzerland
xldb          U.Lisbon - Informatics                     Portugal


           Table 4. CLEF 2006 ad hoc participants by country.

                   Country        # Participants
                   Brazil                      4
                   Canada                      1
                   France                      2
                   Germany                     1
                   Hungary                     1
                   India                       2
                   Indonesia                   1
                   Ireland                     1
                   Italy                       1
                   Portugal                    1
                   Spain                       5
                   Sweden                      1
                   Switzerland                 1
                   United Kingdom              1
                   United States               2
                   Total                      25
of the experiments (9 out 296, 3.04%) have queries been manually constructed
from topics. A breakdown into the separate tasks is shown in Table 5(a).
    Fourteen different topic languages were used in the ad hoc experiments. As
always, the most popular language for queries was English, with French second.
The number of runs per topic language is shown in Table 5(b).


3     Main Stream Monolingual Experiments

Monolingual retrieval was offered for Bulgarian, French, Hungarian, and Por-
tuguese. As can be seen from Table 5(a), the number of participants and runs
for each language was quite similar, with the exception of Bulgarian, which had
a slightly smaller participation. This year just 6 groups out of 16 (37.5%) sub-
mitted monolingual runs only (down from ten groups last year), and 5 of these
groups were first time participants in CLEF. This year, most of the groups sub-
mitting monolingual runs were doing this as part of their bilingual or multilingual
system testing activity. Details on the different approaches used can be found
in the papers in this section of the working notes. There was a lot of detailed
work with Portuguese language processing; not surprising as we had four new
groups from Brazil in Ad Hoc this year. As usual, there was a lot of work on the
development of stemmers and morphological analysers ([4], for instance, applies
a very deep morphological analysis for Hungarian) and comparisons of the pros
and cons of so-called ”light” and ”heavy” stemming approaches (e.g. [5]). In
contrast to previous years, we note that a number of groups experimented with
NLP techniques (see, for example, papers by [6], and [7]).


3.1   Results

Table 6 shows the top five groups for each target collection, ordered by mean
average precision. The table reports: the short name of the participating group;
the mean average precision achieved by the run; the run identifier; and the
performance difference between the first and the last participant. Table 6 regards
runs using title + description fields only (the mandatory run).
   Figures from 1 to 4 compare the performances of the top participants of the
Monolingual tasks.


4     Main Stream Bilingual Experiments

The bilingual task was structured in four subtasks (X → BG, FR, HU or PT
target collection) plus, as usual, an additional subtask with English as target
language restricted to newcomers in a CLEF cross-language task. This year, in
this subtask, we focussed in particular on non-European topic languages and in
particular languages for which there are still few processing tools or resources
were in existence. We thus offered two Ethiopian languages: Amharic and Oromo;
two Indian languages: Hindi and Telugu; and Indonesian. Although, as was to
        Table 5. Breakdown of experiments into tracks and topic languages.

                                                        (b) List of experiments by
(a) Number of experiments per track, participant.       topic language.
           Track           # Part. # Runs                 Topic Lang. # Runs
 Monolingual-BG                  4     11                 English         65
 Monolingual-FR                  8     27                 French          60
 Monolingual-HU                  6     17                 Italian         38
 Monolingual-PT                 12     37                 Portuguese      37
 Bilingual-X2BG                  1      2                 Spanish         25
 Bilingual-X2EN                  5     33                 Hungarian       17
 Bilingual-X2FR                  4     12                 German          12
 Bilingual-X2HU                  1      2                 Bulgarian       11
 Bilingual-X2PT                  6     22                 Indonesian      10
 Robust-Mono-DE                  3      7                 Dutch           10
 Robust-Mono-EN                  6     13                 Amharic          4
 Robust-Mono-ES                  5     11                 Oromo            3
 Robust-Mono-FR                  7     18                 Hindi            2
 Robust-Mono-IT                  5     11                 Telugu           2
 Robust-Mono-NL                  3      7                     Total      296
 Robust-Bili-X2DE                2      5
 Robust-Bili-X2ES                3      8
 Robust-Bili-X2NL                1      4
 Robust-Multi                    4     10
 Robust-Training-Mono-DE         2      3
 Robust-Training-Mono-EN         4      7
 Robust-Training-Mono-ES         3      5
 Robust-Training-Mono-FR         5     10
 Robust-Training-Mono-IT         3      5
 Robust-Training-Mono-NL         2      3
 Robust-Training-Bili-X2DE       1      1
 Robust-Training-Bili-X2ES       1      2
 Robust-Training-Multi           2      3
                Total                 296
                           Ad−Hoc Monolingual Bulgarian track Top 5 Participants − Interpolated Recall vs Average Precision
                    100%
                                                           unine [Experiment UniNEbg2; MAP 33.14%; Pooled]
                                                           rsi−jhu [Experiment 02aplmobgtd4; MAP 31.98%; Pooled]
                    90%                                    hummingbird [Experiment humBG06tde; MAP 30.47%; Pooled]
                                                           daedalus [Experiment bgFSbg2S; MAP 27.87%; Pooled]
                    80%


                    70%
Average Precision




                    60%


                    50%


                    40%


                    30%


                    20%


                    10%


                     0%
                       0%       10%       20%       30%       40%       50%      60%       70%       80%       90%      100%
                                                                Interpolated Recall


                                                 Fig. 1. Monolingual Bulgarian

                            Ad−Hoc Monolingual French track Top 5 Participants − Interpolated Recall vs Average Precision
                    100%
                                                                       unine [UniNEfr3; MAP 44.68%; Pooled]
                                                                       rsi−jhu [95aplmofrtd5s; MAP 40.96%; Pooled]
                    90%                                                hummingbird [humFR06tde; MAP 40.77%; Pooled]
                                                                       alicante [8dfrexp; MAP 38.28%; Pooled]
                                                                       daedalus [frFSfr2S; MAP 37.94%; Pooled]
                    80%


                    70%
Average Precision




                    60%


                    50%


                    40%


                    30%


                    20%


                    10%


                     0%
                       0%       10%       20%       30%       40%       50%      60%       70%       80%       90%      100%
                                                                Interpolated Recall


                                                   Fig. 2. Monolingual French
                           Ad−Hoc Monolingual Hungarian track Top 5 Participants − Interpolated Recall vs Average Precision
                    100%
                                                           unine [Experiment UniNEhu2; MAP 41.35%; Pooled]
                                                           rsi−jhu [Experiment 02aplmohutd4; MAP 39.11%; Pooled]
                    90%                                    alicante [Experiment 30dfrexp; MAP 35.32%; Pooled]
                                                           mokk [Experiment plain2; MAP 34.95%; Pooled]
                                                           hummingbird [Experiment humHU06tde; MAP 32.24%; Pooled]
                    80%


                    70%
Average Precision




                    60%


                    50%


                    40%


                    30%


                    20%


                    10%


                     0%
                       0%        10%       20%       30%      40%       50%      60%        70%      80%       90%      100%
                                                                Interpolated Recall


                                                  Fig. 3. Monolingual Hungarian

                           Ad−Hoc Monolingual Portuguese track Top 5 Participants − Interpolated Recall vs Average Precision
                    100%
                                                                   unine [UniNEpt1; MAP 45.52%; Pooled]
                                                                   hummingbird [humPT06tde; MAP 45.07%; Not Pooled]
                    90%                                            alicante [30okapiexp; MAP 43.08%; Not Pooled]
                                                                   rsi−jhu [95aplmopttd5; MAP 42.42%; Not Pooled]
                                                                   u.buffalo [UBptTDrf1; MAP 40.53%; Pooled]
                    80%


                    70%
Average Precision




                    60%


                    50%


                    40%


                    30%


                    20%


                    10%


                     0%
                       0%        10%       20%       30%      40%       50%      60%        70%      80%       90%      100%
                                                                Interpolated Recall


                                                 Fig. 4. Monolingual Portuguese
                    Table 6. Best entries for the monolingual track.

  Track                         Participant Rank
              1st        2nd           3rd        4th        5th       Diff.
Bulgarian    unine      rsi-jhu    hummingbird daedalus              1st vs 4th
      MAP 33.14%        31.98%        30.47%     27.87%               20.90%
       Run UniNEbg2 02aplmobgtd4 humBG06tde bgFSbg2S
  French     unine      rsi-jhu    hummingbird  alicante   daedalus  1st vs 5th
      MAP 44.68%        40.96%        40.77%     38.28%     37.94%    17.76%
       Run UniNEfr3 95aplmofrtd5s1 humFR06tde   8dfrexp    frFSfr2S
Hungarian    unine      rsi-jhu      alicante     mokk   hummingbird 1st vs 5th
      MAP 41.35%        39.11%        35.32%     34.95%     32.24%    28.26%
       Run UniNEhu2 02aplmohutd4     30dfrexp    plain2  humHU06tde
Portuguese   unine   hummingbird     alicante    rsi-jhu   u.buffalo 1st vs 5th
      MAP 45.52%        45.07%        43.08%     42.42%     40.53%    12.31%
       Run UniNEpt1 humPT06tde 30okapiexp 95aplmopttd5 UBptTDrf1



  be expected, the results are not particularly good, we feel that experiments of
  this type with lesser-studied languages are very important (see papers by [8], [9],
  [10])


  4.1     Results

  Table 7 shows the best results for this task for runs using the title+description
  topic fields. The performance difference between the best and the last (up to 5)
  placed group is given (in terms of average precision. Again both pooled and not
  pooled runs are included in the best entries for each track, with the exception
  of Bilingual X → EN.
      For bilingual retrieval evaluation, a common method to evaluate performance
  is to compare results against monolingual baselines. For the best bilingual sys-
  tems, we have the following results for CLEF 2006:

   – X → BG: 52.49% of best monolingual Bulgarian IR system;
   – X → FR: 93.82% of best monolingual French IR system;
   – X → HU: 53.13% of best monolingual Hungarian IR system.
   – X → PT: 90.91% of best monolingual Portuguese IR system;

        We can compare these to those for CLEF 2005:

   – X → BG: 85% of best monolingual Bulgarian IR system;
   – X → FR: 85% of best monolingual French IR system;
   – X → HU: 73% of best monolingual Hungarian IR system.
   – X → PT: 88% of best monolingual Portuguese IR system;

     While these results are very good for the well-established-in-CLEF languages,
  and can be read as state-of-the-art for this kind of retrieval system, at a first
  glance they appear very disappointing for Bulgarian and Hungarian. However,
                              Table 7. Best entries for the bilingual task.

  Track                             Participant Rank
                1st       2nd           3rd                4th              5th             Diff.
Bulgarian    daedalus
      MAP     17.39%
       Run bgFSbgWen2S
  French       unine   queenmary       rsi-jhu         daedalus                           1st vs 4th
      MAP     41.92%     33.96%        33.60%            33.20%                            26.27%
       Run UniNEBifr1 QMUL06e2f10b   aplbienfrd      frFSfrSen2S
Hungarian    daedalus
      MAP     21.97%
       Run huFShuMen2S
Portuguese     unine     rsi-jhu     queenmary         u.buffalo          daedalus        1st vs 5th
      MAP     41.38%     35.49%        35.26%            29.08%            26.50%          55.85%
       Run UniNEBipt2  aplbiesptd  QMUL06e2p10b UBen2ptTDrf2            ptFSptSen2S
 English      rsi-jhu    depok           ltrc              celi              dsv          1st vs 5th
      MAP     32.57%     26.71%        25.04%            23.97%            22.78%          42.98%
       Run aplbiinen5   UI td mt      OMTD       CELItitleNOEXPANSION DsvAmhEngFullNofuzz




           we have to point out that, unfortunately, this year only one group submitted
           cross-language runs for Bulgarian and Hungarian and thus it does not make
           much sense to draw any conclusions from these, apparently poor, results for
           these languages. It is interesting to note that when Cross Language Information
           Retrieval (CLIR) system evaluation began in 1997 at TREC-6 the best CLIR
           systems had the following results:
             – EN → FR: 49% of best monolingual French IR system;
             – EN → DE: 64% of best monolingual German IR system.
               Figures from 5 to 9 compare the performances of the top participants of the
           Bilingual tasks with the following target languages: Bulgarian, French, Hungar-
           ian, Portuguese, and English. Although, as usual, English was by far the most
           popular language for queries, some less common and interesting query to target
           language pairs were tried, e.g. Amharic, Spanish and German to French, and
           French to Portuguese.


           5    Robust Experiments
           The robust task was organized for the first time at CLEF 2006. The evaluation
           of robustness emphasizes stable performance over all topics instead of high aver-
           age performance [11]. The perspective of each individual user of an information
           retrieval system is different from the perspective taken by an evaluation initia-
           tive. The user will be disappointed by systems which deliver poor results for
           some topics whereas an evaluation initiative rewards systems which deliver good
           average results. A system delivering poor results for hard topics is likely to be
           considered of low quality by a user although it may reach high average results.
                     Ad−Hoc Bilingual track, Bulgarian target collection(s) Top 5 Participants − Interpolated Recall vs Average Precision
                    100%
                                                                daedalus [Experiment bgFSbgWen2S; MAP 17.39%; Pooled]
                     90%


                     80%


                     70%
Average Precision




                     60%


                     50%


                     40%


                     30%


                     20%


                     10%


                      0%
                        0%        10%       20%        30%       40%       50%      60%         70%       80%       90%      100%
                                                                   Interpolated Recall


                                                      Fig. 5. Bilingual Bulgarian

                      Ad−Hoc Bilingual track, French target collection(s) Top 5 Participants − Interpolated Recall vs Average Precision
                    100%
                                                                          unine [UniNEBifr1; MAP 41.92%; Pooled]
                                                                          queenmary [QMUL06e2f10b; MAP 33.96%; Pooled]
                     90%                                                  rsi−jhu [aplbienfrd; MAP 33.60%; Pooled]
                                                                          daedalus [frFSfrSen2S; MAP 33.20%; Pooled]
                     80%


                     70%
Average Precision




                     60%


                     50%


                     40%


                     30%


                     20%


                     10%


                      0%
                        0%        10%       20%        30%       40%       50%      60%         70%       80%       90%      100%
                                                                   Interpolated Recall


                                                        Fig. 6. Bilingual French
                    Ad−Hoc Bilingual track, Hungarian target collection(s) Top 5 Participants − Interpolated Recall vs Average Precision
                    100%
                                                                daedalus [Experiment huFShuMen2S; MAP 21.97%; Pooled]
                     90%


                     80%


                     70%
Average Precision




                     60%


                     50%


                     40%


                     30%


                     20%


                     10%


                      0%
                        0%        10%       20%       30%        40%       50%      60%        70%       80%       90%      100%
                                                                   Interpolated Recall


                                                     Fig. 7. Bilingual Hungarian

                    Ad−Hoc Bilingual track, Portuguese target collection(s) Top 5 Participants − Interpolated Recall vs Average Precision
                    100%
                                                                         unine [UniNEBipt2; MAP 41.38%; Pooled]
                                                                         rsi−jhu [aplbiesptd; MAP 35.49%; Not Pooled]
                     90%                                                 queenmary [QMUL06e2p10b; MAP 35.26%; Pooled]
                                                                         u.buffalo [UBen2ptTDrf2; MAP 29.08%; Not Pooled]
                                                                         daedalus [ptFSptSen2S; MAP 26.50%; Not Pooled]
                     80%


                     70%
Average Precision




                     60%


                     50%


                     40%


                     30%


                     20%


                     10%


                      0%
                        0%        10%       20%       30%        40%       50%      60%        70%       80%       90%      100%
                                                                   Interpolated Recall


                                                     Fig. 8. Bilingual Portuguese
                     Ad−Hoc Bilingual track, English target collection(s) Top 5 Participants − Interpolated Recall vs Average Precision
                    100%
                                                                    rsi−jhu [aplbiinen5; MAP 32.57%; Pooled]
                                                                    depok [UI_td_mt; MAP 26.71%; Not Pooled]
                     90%                                            ltrc [OMTD; MAP 25.04%; Pooled]
                                                                    celi [CELItitleNOEXPANSION; MAP 23.97%; Not Pooled]
                                                                    dsv [DsvAmhEngFullNofuzz; MAP 22.78%; Not Pooled]
                     80%


                     70%
Average Precision




                     60%


                     50%


                     40%


                     30%


                     20%


                     10%


                      0%
                        0%        10%       20%       30%        40%       50%      60%        70%       80%       90%      100%
                                                                   Interpolated Recall


                                                       Fig. 9. Bilingual English



The robust task has been inspired by the robust track at TREC where it ran at
TREC 2003, 2004 and 2005. A robust evaluation stresses performance for weak
topics. This can be done by using the Geometric Average Precision (GMAP) as
a main indicator for performance instead of the Mean Average Precision (MAP)
of all topics. Geometric average has proven to be a stable measure for robustness
at TREC [11]. The robust task at CLEF 2006 is concerned with the multilingual
aspects of robustness. It is essentially an ad-hoc task which offers mono-lingual
and cross-lingual sub tasks.
    During CLEF 2001, CLEF 2002 and CLEF 2003 a set of 160 topics (Topics
#41 - #200) was developed for these collections and relevance assessments were
made. No additional relevance judgements were made this year for the robust
task. However, the data collection was not completely constant over all three
CLEF campaigns which led to an inconsistency between relevance judgements
and documents. The SDA 95 collection has no relevance judgements for most
topics (#41 - #140). This inconsistency was accepted in order to increase the
size of the collection. One participant reported that exploiting the knowledge
would have resulted in an increase of approximately 10% in MAP [12]. However,
participants were not allowed to use this knowledge. The results of the original
submissions for the data sets were analyzed in order to identify the most diffi-
cult topics. This turned out to be an impossible task. The difficulty of a topic
varies greatly among languages, target collections and tasks. This confirms the
finding of the TREC 2005 robust task where the topic difficulty differed greatly
even for two different English collections. It was found that topics are not in-
herently difficult but only in combination with a specific collection [13]. Topic
difficulty is usually defined by low MAP values for a topic. We also considered
a low number of relevant documents and high variation between systems as in-
dicators for difficulty. Consequently, the topic set for the robust task at CLEF
2006 was arbitrarily split into two sets. Participants were allowed to use the
available relevance assessments for the set of 60 training topics. The remaining
100 topics formed the test set for which results are reported. The participants
were encouraged to submit results for training topics as well. These runs will be
used to further analyze topic difficulty. The robust task received a total of 133
runs from eight groups listed in Table 5(a).
    Most popular among the participants were the mono-lingual French and En-
glish tasks. For the multi-lingual task, four groups submitted ten runs. The bi-
lingual tasks received fewer runs. A run using title and description was manda-
tory for each group. Participants were encouraged to run their systems with the
same setup for all robust tasks in which they participated (except for language
specific resources). This way, the robustness of a system across languages could
be explored.
    Effectiveness scores for the submissions were calculated with the GMAP
which is calculated as the n-th root of a product of n values. GMAP was com-
puted using the version 8.0 of trec eval2 program. In order to avoid undefined
results, all precision score lower than 0.00001 are set to 0.00001.


5.1    Robust Monolingual Results

Table 8 shows the best results for this task for runs using the title+description
topic fields. The performance difference between the best and the last (up to 5)
placed group is given (in terms of average precision).
   Figures from 10 to 15 compare the performances of the top participants of
the Robust Monolingual.


5.2    Robust Bilingual Results

Table 9 shows the best results for this task for runs using the title+description
topic fields. The performance difference between the best and the last (up to 5)
placed group is given (in terms of average precision).
   For bilingual retrieval evaluation, a common method is to compare results
against monolingual baselines. We have the following results for CLEF 2006:

 – X → DE: 60.37% of best monolingual German IR system;
2
    http://trec.nist.gov/trec_eval/trec_eval.8.0.tar.gz
                  Table 8. Best entries for the robust monolingual task.

 Track                           Participant Rank
             1st         2nd             3rd          4th        5th         Diff.
 Dutch hummingbird     daedalus        colesir                            1st vs 3rd
   MAP     51.06%       42.39%         41.60%                               22.74%
 GMAP      25.76%       17.57%         16.40%                               57.13%
    Run humNL06Rtde nlFSnlR2S      CoLesIRnlTst
English hummingbird      reina           dcu        daedalus    colesir   1st vs 5th
   MAP     47.63%       43.66%         43.48%        39.69%    37.64%       26.54%
 GMAP      11.69%       10.53%         10.11%        8.93%      8.41%       39.00%
    Run humEN06Rtde reinaENtdtest dcudesceng12075 enFSenR2S CoLesIRenTst
French      unine   hummingbird         reina          dcu      colesir   1st vs 5th
   MAP     47.57%       45.43%         44.58%        41.08%    39.51%       20.40%
 GMAP      15.02%       14.90%         14.32%        12.00%    11.91%       26.11%
    Run UniNEfrr1 humFR06Rtde reinaFRtdtest dcudescfr12075 CoLesIRfrTst
German hummingbird      colesir       daedalus                            1st vs 3rd
   MAP     48.30%       37.21%         34.06%                               41.81%
 GMAP      22.53%       14.80%         10.61%                              112.35%
    Run humDE06Rtde CoLesIRdeTst    deFSdeR2S
Italian hummingbird      reina           dcu        daedalus    colesir   1st vs 5th
   MAP     41.94%       38.45%         37.73%        35.11%    32.23%       30.13%
 GMAP      11.47%       10.55%          9.19%        10.50%     8.23%       39.37%
    Run humIT06Rtde reinaITtdtest dcudescit1005    itFSitR2S CoLesIRitTst
Spanish hummingbird      reina           dcu        daedalus    colesir   1st vs 5th
   MAP     45.66%       44.01%         42.14%        40.40%    40.17%       13.67%
 GMAP      23.61%       22.65%         21.32%        19.64%    18.84%       25.32%
    Run humES06Rtde reinaEStdtest dcudescsp12075  esFSesR2S CoLesIResTst
                           Ad−Hoc Robust Monolingual Dutch track Top 5 Participants − Interpolated Recall vs Average Precision
                       100%
                                                                       hummingbird [humNL06Rtde; MAP 51.06%; Not Pooled]
                                                                       daedalus [nlFSnlR2S; MAP 42.39%; Not Pooled]
                        90%
                                                                       colesir [CoLesIRnlTst; MAP 41.60%; Not Pooled]


                        80%


                        70%
   Average Precision




                        60%


                        50%


                        40%


                        30%


                        20%


                        10%


                         0%
                           0%         10%       20%       30%       40%       50%      60%       70%       80%       90%     100%
                                                                      Interpolated Recall


                                                   Fig. 10. Robust Monolingual Dutch.

                              Ad−Hoc Robust Monolingual English track Top 5 Participants − Interpolated Recall vs Average Precision
                       100%
                                                                       hummingbird [humEN06Rtde; MAP 47.63%; Not Pooled]
                                                                       reina [reinaENtdtest; MAP 43.66%; Not Pooled]
                       90%                                             dcu [dcudesceng12075; MAP 43.48%; Not Pooled]
                                                                       daedalus [enFSenR2S; MAP 39.69%; Not Pooled]
                                                                       colesir [CoLesIRenTst; MAP 37.64%; Not Pooled]
                       80%


                       70%
Average Precision




                       60%


                       50%


                       40%


                       30%


                       20%


                       10%


                        0%
                          0%          10%       20%       30%      40%       50%      60%        70%      80%       90%      100%
                                                                     Interpolated Recall


                                                  Fig. 11. Robust Monolingual English.
                           Ad−Hoc Robust Monolingual French track Top 5 Participants − Interpolated Recall vs Average Precision
                    100%
                                                                   unine [UniNEfrr1; MAP 47.57%; Not Pooled]
                                                                   hummingbird [humFR06Rtde; MAP 45.43%; Not Pooled]
                     90%                                           reina [reinaFRtdtest; MAP 44.58%; Not Pooled]
                                                                   dcu [dcudescfr12075; MAP 41.08%; Not Pooled]
                                                                   colesir [CoLesIRfrTst; MAP 39.51%; Not Pooled]
                     80%


                     70%
Average Precision




                     60%


                     50%


                     40%


                     30%


                     20%


                     10%


                      0%
                        0%         10%      20%       30%       40%       50%      60%       70%       80%      90%      100%
                                                                  Interpolated Recall


                                               Fig. 12. Robust Monolingual French.

                        Ad−Hoc Robust Monolingual German track Top 5 Participants − Interpolated Recall vs Average Precision
                    100%
                                                                   hummingbird [humDE06Rtde; MAP 48.30%; Not Pooled]
                                                                   colesir [CoLesIRdeTst; MAP 37.21%; Not Pooled]
                     90%
                                                                   daedalus [deFSdeR2S; MAP 34.06%; Not Pooled]


                     80%


                     70%
Average Precision




                     60%


                     50%


                     40%


                     30%


                     20%


                     10%


                      0%
                        0%         10%      20%       30%       40%       50%      60%       70%       80%      90%      100%
                                                                  Interpolated Recall


                                              Fig. 13. Robust Monolingual German.
                           Ad−Hoc Robust Monolingual Italian track Top 5 Participants − Interpolated Recall vs Average Precision
                    100%
                                                                     hummingbird [humIT06Rtde; MAP 41.94%; Not Pooled]
                                                                     reina [reinaITtdtest; MAP 38.45%; Not Pooled]
                     90%                                             dcu [dcudescit1005; MAP 37.73%; Not Pooled]
                                                                     daedalus [itFSitR2S; MAP 35.11%; Not Pooled]
                                                                     colesir [CoLesIRitTst; MAP 32.23%; Not Pooled]
                     80%


                     70%
Average Precision




                     60%


                     50%


                     40%


                     30%


                     20%


                     10%


                      0%
                        0%        10%       20%       30%       40%       50%      60%        70%       80%       90%      100%
                                                                  Interpolated Recall


                                               Fig. 14. Robust Monolingual Italian.

                        Ad−Hoc Robust Monolingual Spanish track Top 5 Participants − Interpolated Recall vs Average Precision
                    100%
                                                               hummingbird [humES06Rtde; MAP 45.66%; Not Pooled]
                                                               reina [reinaEStdtest; MAP 44.01%; Not Pooled]
                     90%                                       dcu [dcudescsp12075; MAP 42.14%; Not Pooled]
                                                               daedalus [esFSesR2S; MAP 40.40%; Not Pooled]
                                                               colesir [CoLesIResTst; MAP 40.17%; Not Pooled]
                     80%


                     70%
Average Precision




                     60%


                     50%


                     40%


                     30%


                     20%


                     10%


                      0%
                        0%        10%       20%       30%       40%       50%      60%        70%       80%       90%      100%
                                                                  Interpolated Recall


                                              Fig. 15. Robust Monolingual Spanish.
                  Table 9. Best entries for the robust bilingual task.

    Track                     Participant Rank
                  1st               2nd            3rd      4th 5th    Diff.
    Dutch      daedalus
      MAP       35.37%
     GMAP        9.75%
       Run nlFSnlRLfr2S
   German      daedalus           colesir                           1st vs 2nd
      MAP       29.16%            25.24%                              15.53%
     GMAP        5.18%             4.31%                              20.19%
       Run deFSdeRSen2S CoLesIRendeTst
   Spanish       reina              dcu          daedalus           1st vs 3rd
      MAP       36.93%            33.22%          26.89%              37.34%
     GMAP       13.42%            10.44%          6.19%              116.80%
       Run reinaIT2EStdtest dcuitqydescsp12075 esFSesRLit2S

                Table 10. Best entries for the robust multilingual task.

   Track                         Participant Rank
                 1st      2nd            3rd            4th      5th   Diff.
Multilingual    jaen    daedalus        colesir        reina         1st vs 4th
        MAP 27.85%       22.67%         22.63%        19.96%          39.53%
      GMAP 15.69%        11.04%         11.24%        13.25%          18.42%
         Run ujamlrsv2 mlRSFSen2S CoLesIRmultTst reinaES2mtdtest


   – X → ES: 80.88% of best monolingual Spanish IR system;
   – X → NL: 69.27% of best monolingual Dutch IR system.
     Figures from 16 to 18 compare the performances of the top participants of
  the Robust Bilingual tasks.

  5.3   Robust Multilingual Results
  Table 10 shows the best results for this task for runs using the title+description
  topic fields. The performance difference between the best and the last (up to 5)
  placed group is given (in terms of average precision).
     Figure 19 compares the performances of the top participants of the Robust
  Multilingual task.

  5.4   Comments on Robust Cross Language Experiments
  Some participants relied on the high correlation between the measure and opti-
  mized their systems as in previous campaigns. However, several groups worked
  specifically at optimizing for robustness. The SINAI system took an approach
  which has proved successful at the TREC robust task, expansion with terms
  gathered from a web search engine [14]. The REINA system from the Univer-
  sity of Salamanca used a heuristic to determine hard topics during training.
Ad−Hoc Robust Bilingual track, Dutch target collection(s) Top 5 Participants − Interpolated Recall vs Average Precision
       100%
                                                                      daedalus [nlFSnlRLfr2S; MAP 35.37%; Not Pooled]
                             90%


                             80%


                             70%
     Average Precision




                             60%


                             50%


                             40%


                             30%


                             20%


                             10%


                             0%
                               0%    10%   20%      30%       40%       50%      60%        70%       80%       90%      100%
                                                                Interpolated Recall


                                             Fig. 16. Robust Bilingual Dutch
                Ad−Hoc Robust Bilingual track, German target collection(s) Top 5 Participants − Interpolated Recall vs Average Precision
                  100%
                                                                    daedalus [deFSdeRSen2S; MAP 29.16%; Not Pooled]
                                                                    colesir [CoLesIRendeTst; MAP 25.24%; Not Pooled]
                             90%


                             80%


                             70%
         Average Precision




                             60%


                             50%


                             40%


                             30%


                             20%


                             10%


                              0%
                                0%   10%   20%      30%       40%       50%      60%        70%       80%       90%      100%
                                                                Interpolated Recall


                                            Fig. 17. Robust Bilingual German
       Ad−Hoc Robust Bilingual track, Spanish target collection(s) Top 5 Participants − Interpolated Recall vs Average Precision
         100%
                                                                    reina [reinaIT2EStdtest; MAP 36.93%; Not Pooled]
                                                                    dcu [dcuitqydescsp12075; MAP 33.22%; Not Pooled]
                    90%
                                                                    daedalus [esFSesRLit2S; MAP 26.89%; Not Pooled]


                    80%


                    70%
Average Precision




                    60%


                    50%


                    40%


                    30%


                    20%


                    10%


                     0%
                       0%       10%       20%       30%       40%       50%      60%       70%       80%       90%      100%
                                                                Interpolated Recall


                                                Fig. 18. Robust Bilingual Spanish

                            Ad−Hoc Robust Multilingual track Top 5 Participants − Interpolated Recall vs Average Precision
                    100%
                                                                      jaen [ujamlrsv2; MAP 27.85%; Not Pooled]
                                                                      daedalus [mlRSFSen2S; MAP 22.67%; Not Pooled]
                    90%                                               colesir [CoLesIRmultTst; MAP 22.63%; Not Pooled]
                                                                      reina [reinaES2mtdtest; MAP 19.96%; Not Pooled]
                    80%


                    70%
Average Precision




                    60%


                    50%


                    40%


                    30%


                    20%


                    10%


                     0%
                       0%       10%       20%       30%       40%       50%      60%       70%       80%       90%      100%
                                                                Interpolated Recall


                                                  Fig. 19. Robust Multilingual.
Subsequently, different expansion techniques were applied [15]. Hummingbird
experimented with other evaluation measures than those used in the track [16].
The MIRACLE system tried to find a fusion scheme which had a positive effect
on the robust measure [17].


6   Statistical Testing

When the goal is to validate how well results can be expected to hold beyond a
particular set of queries, statistical testing can help to determine what differences
between runs appear to be real as opposed to differences that are due to sampling
issues. We aim to identify runs with results that are significantly different from
the results of other runs. “Significantly different” in this context means that
the difference between the performance scores for the runs in question appears
greater than what might be expected by pure chance. As with all statistical
testing, conclusions will be qualified by an error probability, which was chosen
to be 0.05 in the following. We have designed our analysis to follow closely the
methodology used by similar analyses carried out for TREC [18].
    We used the MATLAB Statistics Toolbox, which provides the necessary func-
tionality plus some additional functions and utilities. We use the ANalysis Of
VAriance (ANOVA) test. ANOVA makes some assumptions concerning the data
be checked. Hull [18] provides details of these; in particular, the scores in ques-
tion should be approximately normally distributed and their variance has to be
approximately the same for all runs. Two tests for goodness of fit to a normal
distribution were chosen using the MATLAB statistical toolbox: the Lilliefors
test [19] and the Jarque-Bera test [20]. In the case of the CLEF tasks under
analysis, both tests indicate that the assumption of normality is violated for
most of the data samples (in this case the runs for each participant).
    In such cases, a transformation of data should be performed. The transfor-
mation for measures that range from 0 to 1 is the arcsin-root transformation:
                                              √ 
                                      arcsin x

which Tague-Sutcliffe [21] recommends for use with precision/recall measures.
    Table 11 shows the results of both the Lilliefors and Jarque-Bera tests before
and after applying the Tague-Sutcliffe transformation. After the transformation
the analysis of the normality of samples distribution improves significantly, with
some exceptions. The difficulty to transform the data into normally distributed
samples derives from the original distribution of run performances which tend
towards zero within the interval [0,1].
    In the following sections, two different graphs are presented to summarize
the results of this test. All experiments, regardless of topic language or topic
fields, are included. Results are therefore only valid for comparison of individual
pairs of runs, and not in terms of absolute performance. Both for the ad-hoc
and robust tasks, only runs where significant differences exist are shown; the
remainder of the graphs can be found in the Appendices [2].
Table 11. Lilliefors (LF) and Jarque-Bera (JB) test for each Ad-Hoc track with and
without Tague-Sutcliffe (TS) arcsin transformation. Each entry is the number of ex-
periments whose performance distribution can be considered drawn from a Gaussian
distribution, with respect to the total number of experiment of the track. The value of
alpha for this test was set to 5%.

        Track                      LF        LF & TS       JB      JB & TS
        Monolingual Bulgarian       1           6           0         4
        Monolingual French          12         25          26        26
        Monolingual Hungarian        5         11           8         9
        Monolingual Portuguese      13         34          35        37
        Bilingual English           0           9          2         2
        Bilingual Bulgarian         0           2          0          2
        Bilingual French             8         12          12        12
        Bilingual Hungarian          0          1           0         0
        Bilingual Portuguese        4          12          15        19
        Robust Monolingual German 0             5          0          7
        Robust Monolingual English 3            9          4         11
        Robust Monolingual Spanish 1            9          0         11
        Robust Monolingual French    4          3           2        15
        Robust Monolingual Italian  6          11           8        10
        Robust Monolingual Dutch     0          7           0         7
        Robust Bilingual German     0           0          0         4
        Robust Bilingual Spanish    0           5          0          4
        Robust Bilingual Dutch      0           3           0         4
        Robust Multilingual         0           5          0          6


   The first graph shows participants’ runs (y axis) and performance obtained
(x axis). The circle indicates the average performance (in terms of Precision)
while the segment shows the interval in which the difference in performance is
not statistically significant.
   The second graph shows the overall results where all the runs that are in-
cluded in the same group do not have a significantly different performance. All
runs scoring below a certain group perform significantly worse than at least
the top entry of the group. Likewise all the runs scoring above a certain group
perform significantly better than at least the bottom entry in that group. To de-
termine all runs that perform significantly worse than a certain run, determine
the rightmost group that includes the run, all runs scoring below the bottom
entry of that group are significantly worse. Conversely, to determine all runs
that perform significantly better than a given run, determine the leftmost group
that includes the run. All runs that score better than the top entry of that group
perform significantly better.

7    Conclusions
We have reported the results of the ad hoc cross-language textual document
retrieval track at CLEF 2006. This track is considered to be central to CLEF as
                                                Ad−Hoc Monolingual French track − Tukey T test with "top group" highlighted




                                 UniNEfr3

                                 UniNEfr1

                                 UniNEfr2

                            95aplmofrtdn5s

                        MercwCombzSqrtAll

                              humFR06tde

                             95aplmofrtd5s

                           frx101frFS3FS6

                                  frFSfr4S

                                  frFSfr2S

                               humFR06td
       Experiments




                        MercDTree5reduced

                                   8dfrexp

                               30okapiexp

                     MercBaseStemNoaccTD

                                  30dfrexp

                                9okapiexp

                              95aplmofrtd4

                                   Cd61.5

                                   Cd62.0

                                humFR06t

                                   Cld61.5

                        RIMAM06TDMLRef

                           RIMAM06TDML

                                 Cbaseline

                              RIMAM06TL

                           RIMAM06TDNL


                                         0.4     0.45        0.5       0.55        0.6       0.65         0.7      0.75       0.8
                                                                   arcsin(sqrt(Mean average precision))

                                                     Run ID         Groups
                                             UniNEfr3            X
                                             UniNEfr1            X
                                             UniNEfr2            X X
                                             95aplmofrtdn5s      X X
                                             MercwCombzSqrtAll   X XX
                                             humFR06tde          X XX
                                             95aplmofrtd5s       X XXX
                                             frx101frFS3FS6      X XXXX
                                             frFSfr4S            X XXXX
                                             frFSfr2S            X XXXX
                                             humFR06td           X XXXX
                                             MercDTree5reduced   X XXXX
                                             8dfrexp             X XXXX
                                             30okapiexp          X XXXX
                                             MercBaseStemNoaccTD X X X X X
                                             30dfrexp            X XXXX X
                                             9okapiexp           X XXXX X
                                             95aplmofrtd4          XXXX X
                                             Cd61.5                XXXX X
                                             Cd62.0                XXXX X
                                             humFR06t              XXXX XX
                                             Cld61.5                 XXX XX
                                             RIMAM06TDMLRef            XX XX
                                             RIMAM06TDML                 X XX
                                             Cbaseline                   X XX
                                             RIMAM06TL                     XX
                                             RIMAM06TDNL                    X


Fig. 20. Ad-Hoc Monolingual French. Experiments grouped according to the
Tukey T Test.
                                          Ad−Hoc Monolingual Hungarian track − Tukey T test with "top group" highlighted



                           UniNEhu2


                           UniNEhu1


                        02aplmohutd4


                       02aplmohutdn4


                           UniNEhu3


                            30dfrexp


                              plain2


                                plain
       Experiments




                             8dfrexp


                        humHU06tde


                     hux101huFS3FS4


                           huFShu4S


                           30dfrnexp


                            8dfrnexp


                         humHU06td


                           huFShu2S


                          humHU06t



                                    0.4     0.45          0.5            0.55         0.6           0.65        0.7        0.75
                                                                arcsin(sqrt(Mean average precision))

                                                  Run ID       Groups
                                              UniNEhu2       X
                                              UniNEhu1       XX
                                              02aplmohutd4   XXX
                                              02aplmohutdn4 X X X
                                              UniNEhu3       XXXX
                                              30dfrexp       XXXX X
                                              plain2         XXXX X
                                              plain          XXXX X
                                              8dfrexp          XXX X
                                              humHU06tde        XX X
                                              hux101huFS3FS4      X XX
                                              huFShu4S            X XX
                                              30dfrnexp           X XX
                                              8dfrnexp              XX
                                              humHU06td             XX
                                              huFShu2S              XX
                                              humHU06t                X


Fig. 21. Ad-Hoc Monolingual Hungarian. Experiments grouped according to the
Tukey T Test.
                                                                 Ad−Hoc Bilingual track, English target collection(s) − Tukey T test with "top group" highlighted




                                        aplbiinen5

                                        aplbiinen4

                                         UI_td_mt

                                           OMTD

                                          OMTDN

                    CELItitleCwnCascadeExpansion

                           CELItitleNOEXPANSION

                          CELIdescNOEXPANSION

                                       UI_title_mt

                                             OMT

                            CELItitleLisaExpansion

                            DsvAmhEngFullNofuzz

                                         UI_td_dic

                            CELItitleCwnExpansion

                                       UI_title_dic
      Experiments




                      CELItitleNOEXPANSIONboost

                           CELIdescLisaExpansion

                      CELIdescLisaExpansionboost

                    CELIdescCwnCascadeExpansion

                     CELIdescNOEXPANSIONboost

                       CELItitleLisaExpansionboost

                           CELIdescCwnExpansion

                                  DsvAmhEngFull

                                     UI_td_dicExp

                          DsvAmhEngFullWeighted

                                  DsvAmhEngTitle

                                            HNTD

                                              HNT

                                   UI_title_dicExp

                                            TETD

                                              TET

                                         UI_td_prl

                                       UI_title_prl


                                                −0.1         0              0.1              0.2             0.3             0.4                0.5                 0.6   0.7
                                                                                             arcsin(sqrt(Mean average precision))


                                                                   Run ID              Groups
                                                      aplbiinen5                   X
                                                      aplbiinen4                   X X
                                                      UI td mt                     X X X
                                                      OMTD                         X X X X
                                                      OMTDN                        X X X X
                                                      CELItitleCwnCascadeExpansion X X X X X
                                                      CELItitleNOEXPANSION         X X X X X
                                                      CELIdescNOEXPANSION          X X X X X
                                                      UI title mt                  X X X X X
                                                      OMT                          X X X X X
                                                      CELItitleLisaExpansion       X X X X X
                                                      DsvAmhEngFullNofuzz          X X X X X
                                                      UI td dic                    X X X X X
                                                      CELItitleCwnExpansion        X X X X X X
                                                      UI title dic                 X X X X X X
                                                      CELItitleNOEXPANSIONboost      X X X X X
                                                      CELIdescLisaExpansion            X X X X
                                                      CELIdescLisaExpansionboost       X X X X
                                                      CELIdescCwnCascadeExpansion      X X X X
                                                      CELIdescNOEXPANSIONboost         X X X X
                                                      CELItitleLisaExpansionboost      X X X X
                                                      CELIdescCwnExpansion             X X X X
                                                      DsvAmhEngFull                    X X X X
                                                      UI td dicExp                     X X X X
                                                      DsvAmhEngFullWeighted              X X X X
                                                      DsvAmhEngTitle                       X X X
                                                      HNTD                                 X X X
                                                      HNT                                  X X X
                                                      UI title dicExp                        X X
                                                      TETD                                     X X
                                                      TET                                      X X
                                                      UI td prl                                  X
                                                      UI title prl                               X



Fig. 22. Ad-Hoc Bilingual English. Experiments grouped according to the Tukey
T Test.
                                        Ad−Hoc Bilingual track, Portuguese target collection(s) − Tukey T test with "top group" highlighted




                          UniNEBipt2

                          UniNEBipt1

                          UniNEBipt3

                          ptFSptSfr3S

                           aplbiesptn

                           aplbiesptd

                           aplbienptn

                           aplbienptd

                      QMUL06e2p10b
      Experiments




                       UBen2ptTDNrf3

                       UBen2ptTDNrf2

                       UBen2ptTDNrf1

                         ptFSptSen3S

                        UBen2ptTDrf3

                        UBen2ptTDrf2

                         ptFSptLes3S

                        UBen2ptTDrf1

                         ptFSptSen2S

                    XLDBBiRel16qe20k

                    XLDBBiRel32qe10k

                    XLDBBiRel32qe20k

                    XLDBBiRel16qe10k


                                               0.35      0.4      0.45      0.5     0.55      0.6      0.65         0.7      0.75      0.8
                                                                     arcsin(sqrt(Mean average precision))

                                                        Run ID       Groups
                                                  UniNEBipt2       X
                                                  UniNEBipt1       XX
                                                  UniNEBipt3       XX X
                                                  ptFSptSfr3S      XX XX
                                                  aplbiesptn       XX XXX
                                                  aplbiesptd       XX XXX
                                                  aplbienptn       XX XXX
                                                  aplbienptd       XX XXX
                                                  QMUL06e2p10b     XX XXX
                                                  UBen2ptTDNrf3      X XXXX
                                                  UBen2ptTDNrf2        XXXX
                                                  UBen2ptTDNrf1        XXXX
                                                  ptFSptSen3S           XXX
                                                  UBen2ptTDrf3          XXX
                                                  UBen2ptTDrf2          XXX
                                                  ptFSptLes3S           XXX
                                                  UBen2ptTDrf1          XXX
                                                  ptFSptSen2S             XX
                                                  XLDBBiRel16qe20k          X
                                                  XLDBBiRel32qe10k          X
                                                  XLDBBiRel32qe20k          X
                                                  XLDBBiRel16qe10k          X


Fig. 23. Ad-Hoc Bilingual Portuguese. Experiments grouped according to the
Tukey T Test.
                                    Ad−Hoc Robust Monolingual German track − Tukey T test with "top group" highlighted




                   humDE06Rtde




                    humDE06Rtd




               dex011deRFW3FS3
 Experiments




               dex021deRFW3FS3




                    deFSdeR3S




                   CoLesIRdeTst




                    deFSdeR2S




                             0.55             0.6               0.65               0.7              0.75                 0.8
                                                          arcsin(sqrt(Mean average precision))

                                                   Run ID     Groups
                                              humDE06Rtde     X
                                              humDE06Rtd        X
                                              dex011deRFW3FS3   X X
                                              dex021deRFW3FS3   X X
                                              deFSdeR3S           X
                                              CoLesIRdeTst        X
                                              deFSdeR2S           X


Fig. 24. Robust Monolingual German. Experiments grouped according to the
Tukey T Test.
                                     Ad−Hoc Robust Monolingual Dutch track − Tukey T test with "top group" highlighted




                   humNL06Rtde



                    humNL06Rtd



                      nlFSnlR4S
  Experiments




                   nlynlRFS3456



                nlx011nlRFW4FS4



                    CoLesIRnlTst



                      nlFSnlR2S




                              0.65             0.7                0.75             0.8               0.85                0.9
                                                           arcsin(sqrt(Mean average precision))

                                                      Run ID     Groups
                                                 humNL06Rtde     X
                                                 humNL06Rtd        X
                                                 nlFSnlR4S         X X
                                                 nlynlRFS3456      X X
                                                 nlx011nlRFW4FS4   X X
                                                 CoLesIRnlTst        X
                                                 nlFSnlR2S           X


Fig. 25. Robust Monolingual Dutch. Experiments grouped according to the Tukey
T Test.
                               Ad−Hoc Robust Bilingual track, Spanish target collection(s) − Tukey T test with "top group" highlighted




                   reinaIT2EStdtest




                  dcuitqynarrsp1005




                dcuitqydescsp12075




               esx011esRLitFW3FS3
 Experiments




               esx021esRLitFW3FS3




                     esFSesRLit3S




                    reinaIT2ESttest




                     esFSesRLit2S




                                  0.4          0.45          0.5            0.55         0.6          0.65         0.7          0.75
                                                                   arcsin(sqrt(Mean average precision))

                                                         Run ID      Groups
                                                  reinaIT2EStdtest   X
                                                  dcuitqynarrsp1005  XX
                                                  dcuitqydescsp12075 XX
                                                  esx011esRLitFW3FS3   X X
                                                  esx021esRLitFW3FS3   X X
                                                  esFSesRLit3S         X X
                                                  reinaIT2ESttest      X X
                                                  esFSesRLit2S           X


Fig. 26. Robust Bilingual Spanish. Experiments grouped according to the Tukey
T Test.
                         Ad−Hoc Robust Multilingual track: ONLY experiments with TEST topics − Tukey T test with "top group" highlighted




                     ujamlrsv2




                        ujamllr




                      ujamlblr




               ml5XRSFSen4S
 Experiments




               ml6XRSFSen4S




               ml4XRSFSen4S




                 mlRSFSen2S




               CoLesIRmultTst




               reinaES2mtdtest




                reinaES2mttest




                                        0.35         0.4        0.45         0.5        0.55         0.6         0.65         0.7
                                                             arcsin(sqrt(Mean average precision))

                                                           Run ID      Groups
                                                       ujamlrsv2       X
                                                       ujamllr         X
                                                       ujamlblr        X
                                                       ml5XRSFSen4S X X
                                                       ml6XRSFSen4S X X
                                                       ml4XRSFSen4S X X
                                                       mlRSFSen2S        X X
                                                       CoLesIRmultTst    X XX
                                                       reinaES2mtdtest     XX
                                                       reinaES2mttest        X


Fig. 27. Robust Multilingual. Experiments grouped according to the Tukey T Test.
for many groups it is the first track in which they participate and provides them
with an opportunity to test their systems and compare performance between
monolingual and cross-language runs, before perhaps moving on to more complex
system development and subsequent evaluation. However, the track is certainly
not just aimed at beginners. It also gives groups the possibility to measure
advances in system performance over time. In addition, each year, we also include
a task aimed at examining particular aspects of cross-language text retrieval.
This year, the focus was examining the impact of ”hard” topics on performance
in the ”robust” task.
    Thus, although the ad hoc track in CLEF 2006 offered the same target lan-
guages for the main mono- and bilingual tasks as in 2005, it also had two new
focuses. Groups were encouraged to use non-European languages as topic lan-
guages in the bilingual task. We were particularly interested in languages for
which few processing tools were readily available, such as Amharic, Oromo and
Telugu. In addition, we set up the ”robust task” with the objective of providing
the more expert groups with the chance to do in-depth failure analysis.
    Finally, it should be remembered that, although over the years we vary the
topic and target languages offered in the track, all participating groups also
have the possibility of accessing and using the test collections that have been
created in previous years for all of the twelve languages included in the CLEF
multilingual test collection. The test collections for CLEF 2000 - CLEF 2003 are
about to be made publicly available on the Evaluations and Language resources
Distribution Agency (ELDA) catalog3 .


References

 1. Braschler, M.: CLEF 2003 - Overview of results. In Peters, C., Braschler, M.,
    Gonzalo, J., Kluck, M., eds.: Comparative Evaluation of Multilingual Informa-
    tion Access Systems: Fourth Workshop of the Cross–Language Evaluation Forum
    (CLEF 2003) Revised Selected Papers, Lecture Notes in Computer Science (LNCS)
    3237, Springer, Heidelberg, Germany (2004) 44–63
 2. Di Nunzio, G.M., Ferro, N.: Appendix A. Results of the Core Tracks. In Nardi,
    A., Peters, C., Vicedo, J.L., eds.: Working Notes for the CLEF 2006 Workshop,
    Published Online (2006)
 3. Braschler, M., Peters, C.: CLEF 2003 Methodology and Metrics. In Peters, C.,
    Braschler, M., Gonzalo, J., Kluck, M., eds.: Comparative Evaluation of Multilingual
    Information Access Systems: Fourth Workshop of the Cross–Language Evaluation
    Forum (CLEF 2003) Revised Selected Papers, Lecture Notes in Computer Science
    (LNCS) 3237, Springer, Heidelberg, Germany (2004) 7–20
 4. Halácsy, P.: Benefits of Deep NLP-based Lemmatization for Information Retrieval.
    In Nardi, A., Peters, C., Vicedo, J.L., eds.: Working Notes for the CLEF 2006
    Workshop, Published Online (2006)
 5. Moreira Orengo, V.: A Study on the use of Stemming for Monolingual Ad-Hoc
    Portuguese. Information Retrieval (2006)

3
    http://www.elda.org/
 6. Azevedo Arcoverde, J.M., das Gracas Volpe Nunes, M., Scardua, W.: Using Noun
    Phrases for Local Analysis in Automatic Query Expansion. In Nardi, A., Peters,
    C., Vicedo, J.L., eds.: Working Notes for the CLEF 2006 Workshop, Published
    Online (2006)
 7. Gonzalez, M., de Lima, V.L.S.: The PUCRS-PLN Group participation at CLEF
    2006. In Nardi, A., Peters, C., Vicedo, J.L., eds.: Working Notes for the CLEF
    2006 Workshop, Published Online (2006)
 8. Tune, K.K., Varma, V.: Oromo-English Information Retrieval Experiments at
    CLEF 2006. In Nardi, A., Peters, C., Vicedo, J.L., eds.: Working Notes for the
    CLEF 2006 Workshop, Published Online (2006)
 9. Pingali, P., Varma, V.: Hindi and Telugu to English Cross Language Information
    Retrieval at CLEF 2006. In Nardi, A., Peters, C., Vicedo, J.L., eds.: Working Notes
    for the CLEF 2006 Workshop, Published Online (2006)
10. Hayurani, H., Sari, S., Adriani, M.: Evaluating Language Resources for English-
    Indonesian CLIR. In Nardi, A., Peters, C., Vicedo, J.L., eds.: Working Notes for
    the CLEF 2006 Workshop, Published Online (2006)
11. Voorhees, E.M.: The TREC Robust Retrieval Track. SIGIR Forum 39 (2005)
    11–20
12. Savoy, J., Abdou, S.: UniNE at CLEF 2006: Experiments with Monolingual, Bilin-
    gual, Domain-Specific and Robust Retrieval. In Nardi, A., Peters, C., Vicedo, J.L.,
    eds.: Working Notes for the CLEF 2006 Workshop, Published Online (2006)
13. Voorhees, E.M.: Overview of the TREC 2005 Robust Retrieval Track. In Voorhees,
    E.M., Buckland, L.P., eds.: The Fourteenth Text REtrieval Conference Proceedings
    (TREC 2005), http://trec.nist.gov/pubs/trec14/t14_proceedings.html [last
    visited 2006, August 4] (2005)
14. Martinez-Santiago, F., Montejo-Ráez, A., Garcia-Cumbreras, M., Ureña-Lopez, A.:
    SINAI at CLEF 2006 Ad-hoc Robust Multilingual Track: Query Expansion using
    the Google Search Engine. In Nardi, A., Peters, C., Vicedo, J.L., eds.: Working
    Notes for the CLEF 2006 Workshop, Published Online (2006)
15. Zazo, A., Figuerola, C., Berrocal, J.: REINA at CLEF 2006 Robust Task: Local
    Query Expansion Using Term Windows for Robust Retrieval. In Nardi, A., Peters,
    C., Vicedo, J.L., eds.: Working Notes for the CLEF 2006 Workshop, Published
    Online (2006)
16. Tomlinson, S.: Comparing the Robustness of Expansion Techniques and Retrieval
    Measures. In Nardi, A., Peters, C., Vicedo, J.L., eds.: Working Notes for the CLEF
    2006 Workshop, Published Online (2006)
17. Goni-Menoyo, J., Gonzalez-Cristobal, J., Vilena-Román, J.: Report of the MIRA-
    CLE teach for the Ad-hoc track in CLEF 2006. In Nardi, A., Peters, C., Vicedo,
    J.L., eds.: Working Notes for the CLEF 2006 Workshop, Published Online (2006)
18. Hull, D.: Using Statistical Testing in the Evaluation of Retrieval Experiments. In
    Korfhage, R., Rasmussen, E., Willett, P., eds.: Proc. 16th Annual International
    ACM SIGIR Conference on Research and Development in Information Retrieval
    (SIGIR 1993), ACM Press, New York, USA (1993) 329–338
19. Conover, W.J.: Practical Nonparametric Statistics. 1st edn. John Wiley and Sons,
    New York, USA (1971)
20. Judge, G.G., Hill, R.C., Griffiths, W.E., Lütkepohl, H., Lee, T.C.: Introduction
    to the Theory and Practice of Econometrics. 2nd edn. John Wiley and Sons, New
    York, USA (1988)
21. Tague-Sutcliffe, J.: The Pragmatics of Information Retrieval Experimentation,
    Revisited. In Spack Jones, K., Willett, P., eds.: Readings in Information Retrieval,
    Morgan Kaufmann Publisher, Inc., San Francisco, California, USA (1997) 205–216