深海生物

=Paper=
{{Paper
|id=Vol-1175/CLEF2009wn-adhoc-FerroEt2009
|storemode=property
|title=CLEF 2009 Ad Hoc Track Overview: TEL and Persian Tasks
|pdfUrl=https://ceur-ws.org/Vol-1175/CLEF2009wn-adhoc-FerroEt2009.pdf
|volume=Vol-1175
|dblpUrl=https://dblp.org/rec/conf/clef/FerroP09a
}}
==CLEF 2009 Ad Hoc Track Overview: TEL and Persian Tasks==
<pdf width="1500px">https://ceur-ws.org/Vol-1175/CLEF2009wn-adhoc-FerroEt2009.pdf</pdf>
<pre>
           CLEF 2009 Ad Hoc Track Overview:
                 TEL & Persian Tasks

                         Nicola Ferro1 and Carol Peters2
       1
           Department of Information Engineering, University of Padua, Italy
                               ferro@dei.unipd.it
                     2
                       ISTI-CNR, Area di Ricerca, Pisa, Italy
                            carol.peters@isti.cnr.it


      Abstract. The 2009 Ad Hoc track was to a large extent a repetition of
      last year’s track, with the same three tasks: Tel@CLEF, Persian@CLEF,
      and Robust-WSD. In this ﬁrst of the two track overviews, we describe
      the objectives and results of the TEL and Persian tasks and provide some
      statistical analyses.


Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Index-
ing; H.3.3 Information Search and Retrieval; H.3.4 [Systems and Software]:
Performance evaluation.

General Terms
Experimentation, Performance, Measurement, Algorithms.

Additional Keywords and Phrases
Multilingual Information Access, Cross-Language Information Retrieval, Word
Sense Disambiguation


1   Introduction
From 2000 - 2007, the ad hoc track at CLEF used exclusively collections of
European newspaper and news agency documents1 . In 2008 it was decided to
change the focus and to introduce document collections in a diﬀerent genre
(bibliographic records from The European Library - TEL2 ), a non-European
language (Persian), and an IR task that would appeal to the NLP community
(robust retrieval on word-sense disambiguated data). The 2009 Ad Hoc track
has been to a large extent a repetition of last year’s track, with the same three
tasks: Tel@CLEF, Persian@CLEF, and Robust-WSD. An important objective
has been to ensure that for each task a good reusable test collections is created.
1
  Over the years, this track has built up test collections for monolingual and cross
  language system evaluation in 14 European languages.
2
  See http://www.theeuropeanlibrary.org/
In this ﬁrst of the two track overviews we describe the activities of the TEL and
Persian tasks3 .
    TEL@CLEF: This task oﬀered monolingual and cross-language search on
library catalog. It was organized in collaboration with The European Library
and used three collections derived from the catalogs of the British Library, the
Bibliothéque Nationale de France and the Austrian National Library. The under-
lying aim was to identify the most eﬀective retrieval technologies for searching
this type of very sparse multilingual data. In fact, the collections contained
records in many languages in addition to English, French or German. The task
presumed a user with a working knowledge of these three languages who wants
to ﬁnd documents that can be useful for them in one of the three target catalogs.
    Persian@CLEF: This activity was coordinated again this year in collabora-
tion with the Database Research Group (DBRG) of Tehran University. We chose
Persian as the ﬁrst non-European language target collection for several reasons:
its challenging script (a modiﬁed version of the Arabic alphabet with elision of
short vowels) written from right to left; its complex morphology (extensive use
of suﬃxes and compounding); its political and cultural importance. The task
used the Hamshahri corpus of 1996-2002 newspapers as the target collection
and was organised as a traditional ad hoc document retrieval task. Monolingual
and cross-language (English to Persian) tasks were oﬀered.
    In the rest of this paper we present the task setup, the evaluation method-
ology and the participation in the two tasks (Section 2). We then describe the
main features of each task and show the results (Sections 3 and 4). The ﬁnal
section provides a brief summing up. For information on the various approaches
and resources used by the groups participating in the two tasks and the issues
they focused on, we refer the reader to the papers in the relevant Ad Hoc sections
of these Working Notes.


2     Track Setup

As is customary in the CLEF ad hoc track, again this year we adopted a corpus-
based, automatic scoring method for the assessment of the performance of the
participating systems, based on ideas ﬁrst introduced in the Cranﬁeld experi-
ments in the late 1960s [5]. The tasks oﬀered are studied in order to eﬀectively
measure textual document retrieval under speciﬁc conditions. The test collec-
tions are made up of documents, topics and relevance assessments. The topics
consist of a set of statements simulating information needs from which the sys-
tems derive the queries to search the document collections. Evaluation of system
performance is then done by judging the documents retrieved in response to
a topic with respect to their relevance, and computing the recall and precision
measures. The pooling methodology is used in order to limit the number of man-
ual relevance assessments that have to be made. As always, the distinguishing
3
    As the task design was the same as last year, much of the task set-up section is a
    repetition of a similar section in our CLEF 2008 working notes paper.
feature of CLEF is that it applies this evaluation paradigm in a multilingual set-
ting. This means that the criteria normally adopted to create a test collection,
consisting of suitable documents, sample queries and relevance assessments, have
been adapted to satisfy the particular requirements of the multilingual context.
All language dependent tasks such as topic creation and relevance judgment are
performed in a distributed setting by native speakers. Rules are established and
a tight central coordination is maintained in order to ensure consistency and
coherency of topic and relevance judgment sets over the diﬀerent collections,
languages and tracks.


2.1   The Documents

As mentioned in the Introduction, the two tasks used diﬀerent sets of documents.
   The TEL task used three collections:

 – British Library (BL); 1,000,100 documents, 1.2 GB;
 – Bibliothéque Nationale de France (BNF); 1,000,100 documents, 1.3 GB;
 – Austrian National Library (ONB); 869,353 documents, 1.3 GB.

    We refer to the three collections (BL, BNF, ONB) as English, French and
German because in each case this is the main and expected language of the
collection. However, each of these collections is to some extent multilingual and
contains documents (catalog records) in many additional languages.
    The TEL data is very diﬀerent from the newspaper articles and news agency
dispatches previously used in the CLEF ad hoc track. The data tends to be very
sparse. Many records contain only title, author and subject heading information;
other records provide more detail. The title and (if existing) an abstract or de-
scription may be in a diﬀerent language to that understood as the language of
the collection. The subject heading information is normally in the main language
of the collection. About 66% of the documents in the English and German col-
lection have textual subject headings, in the French collection only 37%. Dewey
Classiﬁcation (DDC) is not available in the French collection; negligible (¡0.3%)
in the German collection; but occurs in about half of the English documents
(456,408 docs to be exact).
    Whereas in the traditional ad hoc task, the user searches directly for a doc-
ument containing information of interest, here the user tries to identify which
publications are of potential interest according to the information provided by
the catalog card. When we designed the task, the question the user was presumed
to be asking was “Is the publication described by the bibliographic record rele-
vant to my information need?”
    The Persian task used the Hamshahri corpus of 1996-2002 newspapers as
the target collection. This corpus was made available to CLEF by the Data
Base Research Group (DBRG) of the University of Tehran. Hamshahri is one
of the most popular daily newspapers in Iran. The Hamshahri corpus consists
of 345 MB of news texts for the years 1996 to 2002 (corpus size with tags is
564 MB). This corpus contains more than 160,000 news articles about a variety
of subjects and includes nearly 417000 diﬀerent words. Hamshahri articles vary
between 1KB and 140KB in size4 .


2.2    Topics

Topics in the CLEF ad hoc track are structured statements representing informa-
tion needs. Each topic typically consists of three parts: a brief “title” statement; a
one-sentence “description”; a more complex “narrative” specifying the relevance
assessment criteria. Topics are prepared in xml format and uniquely identiﬁed
by means of a Digital Object Identifier (DOI)5 .
    For the TEL task, a common set of 50 topics was prepared in each of the 3
main collection languages (English, French and German) plus this year also in
Chinese, Italian and Greek in response to speciﬁc requests. Only the Title and
Description ﬁelds were released to the participants. The narrative was employed
to provide information for the assessors on how the topics should be judged. The
topic sets were prepared on the basis of the contents of the collections.
    In ad hoc, when a task uses data collections in more than one language,
we consider it important to be able to use versions of the same core topic set
to query all collections. This makes it easier to compare results over diﬀerent
collections and also facilitates the preparation of extra topic sets in additional
languages. However, it is never easy to ﬁnd topics that are eﬀective for several
diﬀerent collections and the topic preparation stage requires considerable dis-
cussion between the coordinators for each collection in order to identify suitable
common candidates. The sparseness of the data makes this particularly diﬃcult
for the TEL task and leads to the formulation of topics that were quite broad in
scope so that at least some relevant documents could be found in each collection.
A result of this strategy is that there tends to be a considerable lack of evenness
of distribution in relevant documents. For each topic, the results expected from
the separate collections can vary considerably. An example of a TEL topic is
given in Figure 1.
    For the Persian task, 50 topics were created in Persian by the Data Base
Research group of the University of Tehran, and then translated into English.
The rule in CLEF when creating topics in additional languages is not to produce
literal translations but to attempt to render them as naturally as possible. This
was a particularly diﬃcult task when going from Persian to English as cultural
diﬀerences had to be catered for. An example of a CLEF 2009 Persian topic is
given in Figure 2.


2.3    Relevance Assessment

The number of documents in large test collections such as CLEF makes it imprac-
tical to judge every document for relevance. Instead approximate recall values
are calculated using pooling techniques. The results submitted by the groups
4
    For more information, see http://ece.ut.ac.ir/dbrg/hamshahri/
5
    http://www.doi.org/
  <?xml version="1.0" encoding="UTF-8" standalone="no"?>
  <topic>
     <identifier>10.2452/711-AH</identifier>

      <title lang="zh">深海生物</title>
      <title lang="en">Deep Sea Creatures</title>
      <title lang="fr">Créatures des fonds océaniques</title>
      <title lang="de">Kreaturen der Tiefsee</title>
      <title lang="el">        </title>
      <title lang="it">Creature delle profondità oceaniche</title>

     <description lang="zh">
         找有关世界上任何深海生物的出版物。
     </description>
     <description lang="en">
         Find publications about any kind of life in the depths
         of any of the world's oceans.
     </description>
     <description lang="fr">
         Trouver des ouvrages sur toute forme de vie dans les
         profondeurs des mers et des océans.
     </description>
     <description lang="de">
         Finden Sie Veröffentlichungen über Leben und
         Lebensformen in den Tiefen der Ozeane der Welt.
     </description>
     <description lang="el">
                          
              
     </description>
     <description lang="it">
         Trova pubblicazioni su qualsiasi forma di vita nelle
         profondità degli oceani del mondo.
     </description>
  </topic>


  Fig. 1. Example of TEL topic http://direct.dei.unipd.it/10.2452/711-AH.


participating in the ad hoc tasks are used to form a pool of documents for each
topic and language by collecting the highly ranked documents from selected runs
according to a set of predeﬁned criteria. One important limitation when forming
the pools is the number of documents to be assessed. Traditionally, the top 100
ranked documents from each of the runs selected are included in the pool; in
such a case we say that the pool is of depth 100. This pool is then used for
subsequent relevance judgments. After calculating the eﬀectiveness measures,
the results are analyzed and run statistics produced and distributed. The sta-
bility of pools constructed in this way and their reliability for post-campaign
experiments is discussed in [3] with respect to the CLEF 2003 pools.

   The main criteria used when constructing the pools in CLEF are:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <topic>
       <identifier>10.2452/641-AH</identifier>
    
       <title lang="en">Pollution in the Persian Gulf</title>
       <title lang="fa">      </title>
    
       <description lang="en">
           Find information about pollution in the Persian Gulf and the causes.
       </description>
       <description lang="fa">
                        
       </description>
    
       <narrative lang="en">
           Find information about conditions of the Persian Gulf with respect to
           pollution; also of interest is information on the causes of pollution
           and comparisons of the level of pollution in this sea against that of
           other seas.
       </narrative>
       <narrative lang="fa">
                           
                         
       </narrative>
    </topic>


    Fig. 2. Example of Persian topic http://direct.dei.unipd.it/10.2452/641-AH.


 – favour diversity among approaches adopted by participants, according to the
   descriptions of the experiments provided by the participants;
 – choose at least one experiment for each participant in each task, chosen
   among the experiments with highest priority as indicated by the participant;
 – add mandatory title+description experiments, even though they do not have
   high priority;
 – add manual experiments, when provided;
 – for bilingual tasks, ensure that each source topic language is represented.
    From our experience in CLEF, using the tools provided by the DIRECT
system [1], we ﬁnd that for newspaper documents, assessors can normally judge
from 60 to 100 documents per hour, providing binary judgments: relevant / not
relevant. Our estimate for the TEL catalog records is higher as these records
are much shorter than the average newspaper article (100 to 120 documents
per hour). In both cases, it can be seen what a time-consuming and resource
expensive task human relevance assessment is. This limitation impacts strongly
on the application of the criteria above - and implies that we are obliged to be
ﬂexible in the number of documents judged per selected run for individual pools.
    This year, in order to create pools of more-or-less equivalent size, the depth
of the TEL English, French, and German pools was 606 . For each collection, we
included in the pool two monolingual and one bilingual experiments from each
participant plus any documents assessed as relevant during topic creation.
6
    Tests made on NTCIR pools in previous years have suggested that a depth of 60
    in normally adequate to create stable pools, presuming that a suﬃcient number of
    runs from diﬀerent systems have been included.
    As we only had a relatively small number of runs submitted for Persian, we
were able to include documents from all experiments, and the pool was created
with a depth of 80.
    These pool depths were the same as those used last year. Given the resources
available, it was not possible to manually assess more documents. For the CLEF
2008 ad hoc test collections, Stephen Tomlinson reported some sampling exper-
iments aimed at estimating the judging coverage [10]. He found that this tended
to be lower than the estimates he produced for the CLEF 2007 ad hoc collec-
tions. With respect to the TEL collections, he estimated that at best 50% to
70% of the relevant documents were included in the pools - and that most of
the unjudged relevant documents were for the 10 or more queries that had the
most known answers. Tomlinson has repeated these experiments for the 2009
TEL and Persian data [9]. Although for two of the four languages concerned
(German and Persian), his ﬁndings were similar to last year’s estimates, for the
other two languages (English and French) this year’s estimates are substantially
lower. These ﬁndings need further investigation. They suggest that if we are to
continue to use the pooling technique, we would perhaps be wise to do some
more exhaustive manual searches in order to boost the pools with respect to rel-
evant documents. We also need to consider more carefully other techniques for
relevance assessment in the future such as, for example, the method suggested
by Sanderson and Joho [8] or Mechanical Turk [2].
    Table 1 reports summary information on the 2009 ad hoc pools used to
calculate the results for the main monolingual and bilingual experiments. In
particular, for each pool, we show the number of topics, the number of runs
submitted, the number of runs included in the pool, the number of documents
in the pool (relevant and non-relevant), and the number of assessors.
    The box plot of Figure 3 compares the distributions of the relevant documents
across the topics of each pool for the diﬀerent ad hoc pools; the boxes are ordered
by decreasing mean number of relevant documents per topic.
    As can be noted, TEL French and German distributions appear similar and
are slightly asymmetric towards topics with a greater number of relevant docu-
ments while the TEL English distribution is slightly asymmetric towards topics
with a lower number of relevant documents. All the distributions show some
upper outliers, i.e. topics with a greater number of relevant document with re-
spect to the behaviour of the other topics in the distribution. These outliers are
probably due to the fact that CLEF topics have to be able to retrieve relevant
documents in all the collections; therefore, they may be considerably broader in
one collection compared with others depending on the contents of the separate
datasets.
    For the TEL documents, we judged for relevance only those documents that
are written totally or partially in English, French and German, e.g. a catalog
record written entirely in Hungarian was counted as not relevant as it was of no
use to our hypothetical user; however, a catalog record with perhaps the title and
a brief description in Hungarian, but with subject descriptors in French, German
or English was judged for relevance as it could be potentially useful. Our assessors
            Table 1. Summary information about CLEF 2009 pools.

            TEL English Pool (DOI 10.2454/AH-TEL-ENGLISH-CLEF2009)
                       26,190 pooled documents
                         – 23,663 not relevant documents
Pool size
                         – 2,527 relevant documents
                       50 topics
                       31 out of 89 submitted experiments
Pooled Experiments       – monolingual: 22 out of 43 submitted experiments
                         – bilingual: 9 out of 46 submitted experiments
Assessors              4 assessors
            TEL French Pool (DOI 10.2454/AH-TEL-FRENCH-CLEF2009)
                       21,971 pooled documents
                         – 20,118 not relevant documents
Pool size
                         – 1,853 relevant documents
                       50 topics
                       21 out of 61 submitted experiments
Pooled Experiments       – monolingual: 16 out of 35 submitted experiments
                         – bilingual: 5 out of 26 submitted experiments
Assessors              1 assessor
            TEL German Pool (DOI 10.2454/AH-TEL-GERMAN-CLEF2009)
                       25,541 pooled documents
                         – 23,882 not relevant documents
Pool size
                         – 1,559 relevant documents
                       50 topics
                       21 out of 61 submitted experiments
Pooled Experiments       – monolingual: 16 out of 35 submitted experiments
                         – bilingual: 5 out of 26 submitted experiments
Assessors              2 assessors
                Persian Pool (DOI 10.2454/AH-PERSIAN-CLEF2009)
                       23,536 pooled documents
                         – 19,072 not relevant documents
Pool size
                         – 4,464 relevant documents
                       50 topics
                       20 out of 20 submitted experiments
Pooled Experiments       – monolingual: 17 out of 17 submitted experiments
                         – bilingual: 3 out of 3 submitted experiments
Assessors              23 assessors
                                                                                                                                           CLEF 2009 Ad−hoc Pools − Box Plot of the Relevant Documents by Topic


                                                                                 10.2454/AH−TEL−PERSIAN−CLEF2009


                                                                                 10.2454/AH−TEL−ENGLISH−CLEF2009


                                                                          Pool
                                                                                 10.2454/AH−TEL−FRENCH−CLEF2009


                                                                                 10.2454/AH−TEL−GERMAN−CLEF2009


Fig. 3. Distribution of the relevant documents across the ad-hoc pools.
                                                                                                                   0   20   40   60   80            100         120          140          160            180      200   220   240   260
                                                                                                                                                             Number of Relevant Documents
had no additional knowledge of the documents referred to by the catalog records
(or surrogates) contained in the collection. They judged for relevance on the
information contained in the records made available to the systems. This was
a non trivial task due to the lack of information present in the documents.
During the relevance assessment activity there was much consultation between
the assessors for the three TEL collections in order to ensure that the same
assessment criteria were adopted by everyone.
    As shown in the box plot of Figure 3, the Persian distribution presents a
greater number of relevant documents per topic with respect to the other dis-
tributions and is slightly asymmetric towards topics with a number of relevant
documents. In addition, as can be seen from Table 1, it has been possible to sam-
ple all the experiments submitted for the Persian tasks. This means that there
were fewer unique documents per run and this fact, together with the greater
number of relevant documents per topic suggests either that all the systems were
using similar approaches and retrieval algorithms or that the systems found the
Persian topics quite easy.
    The relevance assessment for the Persian results was done by the DBRG
group in Tehran. Again, assessment was performed on a binary basis and the
standard CLEF assessment rules were applied.

2.4   Result Calculation
Evaluation campaigns such as TREC and CLEF are based on the belief that
the eﬀectiveness of Information Retrieval Systems (IRSs) can be objectively
evaluated by an analysis of a representative set of sample search results. For
this, eﬀectiveness measures are calculated based on the results submitted by the
participants and the relevance assessments. Popular measures usually adopted
for exercises of this type are Recall and Precision. Details on how they are
calculated for CLEF are given in [4].
    The individual results for all oﬃcial Ad-hoc TEL and Persian experiments in
CLEF 2009 are given in the Appendices of the CLEF 2009 Working Notes [6,7].
You can also access them online at:
 – Ad-hoc TEL:
    • monolingual English: http://direct.dei.unipd.it/DOIResolver.do?
      type=task&id=AH-TEL-MONO-EN-CLEF2009
    • bilingual English: http://direct.dei.unipd.it/DOIResolver.do?type=
      task&id=AH-TEL-BILI-X2EN-CLEF2009
    • monolingual French: http://direct.dei.unipd.it/DOIResolver.do?
      type=task&id=AH-TEL-MONO-FR-CLEF2009
    • bilingual French: http://direct.dei.unipd.it/DOIResolver.do?type=
      task&id=AH-TEL-BILI-X2FR-CLEF2009
    • monolingual German: http://direct.dei.unipd.it/DOIResolver.do?
      type=task&id=AH-TEL-MONO-DE-CLEF2009
    • bilingual German: http://direct.dei.unipd.it/DOIResolver.do?type=
      task&id=AH-TEL-BILI-X2DE-CLEF2009
                      Table 2. CLEF 2009 Ad hoc participants.

                           Ad hoc TEL participants
           Participant              Institution            Country
           aeb         Athens Univ. Economics & Business Greece
           celi        CELI Research srl                 Italy
           chemnitz    Chemnitz University of Technology Germany
           cheshire    U.C.Berkeley                      United States
           cuza        Alexandru Ioan Cuza University    Romania
           hit         HIT2Lab, Heilongjiang Inst. Tech. China
           inesc       Tech. Univ. Lisbon                Portugal
           karlsruhe   Univ. Karlsruhe                   Germany
           opentext    OpenText Corp.                    Canada
           qazviniau   Islamic Azaz Univ. Qazvin         Iran
           trinity     Trinity Coll. Dublin              Ireland
           trinity-dcu Trinity Coll. & DCU               Ireland
           weimar      Bauhaus Univ. Weimar              Germany
                          Ad hoc Persian participants
           Participant              Institution            Country
           jhu-apl     Johns Hopkins Univ.               USA
           opentext    OpenText Corp.                    Canada
           qazviniau   Islamic Azaz Univ. Qazvin         Iran
           unine       U.Neuchatel-Informatics           Switzerland


 – Ad-hoc Persian:
    • monolingual Farsi: http://direct.dei.unipd.it/DOIResolver.do?type=
      task&id=AH-PERSIAN-MONO-FA-CLEF2009
    • bilingual German: http://direct.dei.unipd.it/DOIResolver.do?type=
      task&id=AH-PERSIAN-BILI-X2FA-CLEF2009


2.5    Participants and Experiments

As shown in Table 2, a total of 13 groups from 10 countries submitted oﬃcial
results for the TEL task, while just four groups participated in the Persian task.
    A total of 231 runs were submitted with an average number of submitted
runs per participant of 13.5 runs/participant.
    Participants were required to submit at least one title+description (“TD”)
run per task in order to increase comparability between experiments. The large
majority of runs (216 out of 231, 93.50%) used this combination of topic ﬁelds, 2
(0.80%) used all ﬁelds7 , 13 (5.6%) used the title ﬁeld. All the experiments were
conducted using automatic query construction. A breakdown into the separate
tasks and topic languages is shown in Table 3.
    Seven diﬀerent topic languages were used in the ad hoc experiments. As
always, the most popular language for queries was English, with German second.
However, it must be noted that English topics were provided for both the TEL
7
    The narrative ﬁeld was only oﬀered for the Persian task.
      Table 3. Number of experiments by task and topic language and number of partici-
      pants per task.

        Task     Chinese English Farsi French German Greek Italian Total Participants
TEL Mono English       –      46     –      –      –     –       –   46            12
TEL Mono French        –       –     –     35      –     –       –   35             9
TEL Mono German        –       –     –      –     35     –       –   35             9
TEL Bili English       3       0     0     15     19     5       1   43            10
TEL Bili French        0      12     0      0     12     0       2   26             6
TEL Bili German        1      12     0     12      0     0       1   26             6
Mono Persian           –       –    17      –      –     –       –   17             4
Bili Persian           –       3     –      –      –     –       –    3             1
Total                  4      73   17      62     66     5       4 231              –


      and the Persian tasks. It is thus hardly surprising that English is the most used
      language in which to formulate queries. On the other hand, if we look only at
      the bilingual tasks, the most used source languages were German and French.

      3     TEL@CLEF
      The objective of this activity was to search and retrieve relevant items from
      collections of library catalog cards. The underlying aim was to identify the most
      eﬀective retrieval technologies for searching this type of very sparse data.

      3.1   Tasks
      Two subtasks were oﬀered: Monolingual and Bilingual. In both tasks, the aim
      was to retrieve documents relevant to the query. By monolingual we mean that
      the query is in the same language as the expected language of the collection.
      By bilingual we mean that the query is in a diﬀerent language to the expected
      language of the collection. For example, in an EN → FR run, relevant documents
      (bibliographic records) could be any document in the BNF collection (referred
      to as the French collection) in whatever language they are written. The same
      is true for a monolingual FR → FR run - relevant documents from the BNF
      collection could actually also be in English or German, not just French.
          Ten of the thirteen participating groups attempted a cross-language task; the
      most popular being with the British Library as the target collection. Six groups
      submitted experiments for all six possible oﬃcial cross-language combinations.
      In addition, we had runs submitted to the English target with queries in Greek,
      Chinese and Italian.

      3.2   Results.
      Monolingual Results
      Table 4 shows the top ﬁve groups for each target collection, ordered by mean
      average precision. The table reports: the short name of the participating group;
                   Table 4. Best entries for the monolingual TEL tasks.

 Track Rank Participant                    Experiment DOI                                     MAP
         1st inesc       10.2415/AH-TEL-MONO-EN-CLEF2009.INESC.RUN11                          40.84%
        2nd chemnitz     10.2415/AH-TEL-MONO-EN-CLEF2009.CHEMNITZ.CUT 11 MONO MERGED EN 9 10 40.71%

         3rd trinity     10.2415/AH-TEL-MONO-EN-CLEF2009.TRINITY.TCDENRUN2                    40.35%
English
         4th hit         10.2415/AH-TEL-MONO-EN-CLEF2009.HIT.MTDD10T40                        39.36%
         5th trinity-dcu 10.2415/AH-TEL-MONO-EN-CLEF2009.TRINITY-DCU.TCDDCUEN3                36.96%
        Diﬀerence                                                                             10.50%
         1st karlsruhe   10.2415/AH-TEL-MONO-FR-CLEF2009.KARLSRUHE.INDEXBL                    27.20%
        2nd chemnitz     10.2415/AH-TEL-MONO-FR-CLEF2009.CHEMNITZ.CUT 19 MONO MERGED FR 17 18 25.83%

         3rd inesc       10.2415/AH-TEL-MONO-FR-CLEF2009.INESC.RUN12                          25.11%
French
         4th opentext    10.2415/AH-TEL-MONO-FR-CLEF2009.OPENTEXT.OTFR09TDE                   24.12%
         5th celi        10.2415/AH-TEL-MONO-FR-CLEF2009.CELI.CACAO FRBNF ML                  23.61%
        Diﬀerence                                                                             15.20%
         1st opentext    10.2415/AH-TEL-MONO-DE-CLEF2009.OPENTEXT.OTDE09TDE                   28.68%
        2nd chemnitz     10.2415/AH-TEL-MONO-DE-CLEF2009.CHEMNITZ.CUT 3 MONO MERGED DE 1 2    27.89%
         3rd inesc       10.2415/AH-TEL-MONO-DE-CLEF2009.INESC.RUN12                          27.85%
German
         4th trinity-dcu 10.2415/AH-TEL-MONO-DE-CLEF2009.TRINITY-DCU.TCDDCUDE3                26.86%
         5th trinity     10.2415/AH-TEL-MONO-DE-CLEF2009.TRINITY.TCDDERUN1                    25.77%
        Diﬀerence                                                                             11.30%


the mean average precision achieved by the experiment; the DOI of the experi-
ment; and the performance diﬀerence between the ﬁrst and the last participant.
Figures 4, 6, and 8 compare the performances of the top participants of the TEL
Monolingual tasks.

Bilingual Results
Table 5 shows the top ﬁve groups for each target collection, ordered by mean
average precision. The table reports: the short name of the participating group;
the mean average precision achieved by the experiment; the DOI of the experi-
ment; and the performance diﬀerence between the ﬁrst and the last participant.
Figures 5, 7, and 9 compare the performances of the top participants of the TEL
Bilingual tasks.
    For bilingual retrieval evaluation, a common method is to compare results
against monolingual baselines. We have the following results for CLEF 2009:
 – X → EN: 99.07% of best monolingual English IR system;
 – X → FR: 94.00% of best monolingual French IR system;
 – X → DE: 90.06% of best monolingual German IR system.
    These ﬁgures are very encouraging, especially when compared with the results
for last year for the same TEL tasks:

 – X → EN: 90.99% of best monolingual English IR system;
 – X → FR: 56.63% of best monolingual French IR system;
             Ad−Hoc TEL Monolingual English Task Top 5 Participants − Standard Recall Levels vs Mean Interpolated Precision
             100%
                           inesc [Experiment RUN11; MAP 40.84%; Not Pooled]
                           chemnitz [Experiment CUT_11_MONO_MERGED_EN_9_10; MAP 40.71%; Not Pooled]
              90%          trinity [Experiment TCDENRUN2; MAP 40.35%; Pooled]
                           hit [Experiment MTDD10T40; MAP 39.36%; Pooled]
                           trinity−dcu [Experiment TCDDCUEN3; MAP 36.96%; Not Pooled]
              80%


              70%


              60%
Precision


              50%


              40%


              30%


              20%


              10%


               0%
                 0%        10%      20%       30%       40%      50%       60%       70%      80%       90%      100%
                                                                Recall


                                             Fig. 4. Monolingual English

              Ad−Hoc TEL Bilingual English Task Top 5 Participants − Standard Recall Levels vs Mean Interpolated Precision
             100%
                               chemnitz [Experiment CUT_13_BILI_MERGED_DE2EN_9_10; MAP 40.46%; Pooled]
                               hit [Experiment XTDD10T40; MAP 35.27%; Not Pooled]
              90%              trinity [Experiment TCDDEENRUN3; MAP 35.05%; Not Pooled]
                               trinity−dcu [Experiment TCDDCUDEEN1; MAP 33.33%; Not Pooled]
                               karlsruhe [Experiment DE_INDEXBL; MAP 32.70%; Not Pooled]
              80%


               70%


               60%
 Precision


               50%


               40%


               30%


               20%


               10%


                0%
                  0%       10%       20%      30%       40%      50%       60%       70%       80%      90%      100%
                                                                Recall


                                               Fig. 5. Bilingual English
                  Ad−Hoc TEL Monolingual French Task Top 5 Participants − Standard Recall Levels vs Mean Interpolated Precision
                  100%
                               karlsruhe [Experiment INDEXBL; MAP 27.20%; Not Pooled]
                               chemnitz [Experiment CUT_19_MONO_MERGED_FR_17_18; MAP 25.83%; Not Pooled]
                   90%         inesc [Experiment RUN12; MAP 25.11%; Not Pooled]
                               opentext [Experiment OTFR09TDE; MAP 24.12%; Not Pooled]
                               celi [Experiment CACAO_FRBNF_ML; MAP 23.61%; Not Pooled]
                   80%


                   70%


                   60%
      Precision


                   50%


                   40%


                   30%


                   20%


                   10%


                    0%
                      0%       10%       20%       30%      40%       50%      60%       70%      80%       90%      100%
                                                                     Recall


                                                 Fig. 6. Monolingual French

           Ad−Hoc TEL Bilingual French Task Top 5 Participants − Standard Recall Levels vs Mean Interpolated Precision
          100%
chemnitz [Experiment CUT_24_BILI_EN2FR_MERGED_LANG_SPEC_REF_CUT_17; MAP 25.57%; Not Pooled]
karlsruhe [Experiment EN_INDEXBL; MAP 24.62%; Not Pooled]
cheshire 90%
           [Experiment BIENFRT2FB; MAP 16.77%; Not Pooled]
trinity [Experiment TCDDEFRRUN2; MAP 16.33%; Not Pooled]
weimar [Experiment CLESA169283ENINFR; MAP 14.51%; Pooled]
           80%


                  70%


                  60%
    Precision


                  50%


                  40%


                  30%


                  20%


                  10%


                   0%
                     0%        10%      20%       30%       40%      50%       60%      70%       80%      90%      100%
                                                                    Recall


                                                   Fig. 7. Bilingual French
             Ad−Hoc TEL Monolingual German Task Top 5 Participants − Standard Recall Levels vs Mean Interpolated Precision
              100%
                               opentext [Experiment OTDE09TDE; MAP 28.68%; Not Pooled]
                               chemnitz [Experiment CUT_3_MONO_MERGED_DE_1_2; MAP 27.89%; Not Pooled]
               90%             inesc [Experiment RUN12; MAP 27.85%; Not Pooled]
                               trinity−dcu [Experiment TCDDCUDE3; MAP 26.86%; Not Pooled]
                               trinity [Experiment TCDDERUN1; MAP 25.77%; Not Pooled]
               80%


               70%


               60%
 Precision


               50%


               40%


               30%


               20%


               10%


                0%
                  0%       10%      20%       30%      40%       50%       60%      70%       80%       90%     100%
                                                                Recall


                                           Fig. 8. Monolingual German

              Ad−Hoc TEL Bilingual German Task Top 5 Participants − Standard Recall Levels vs Mean Interpolated Precision
             100%
                                  chemnitz [Experiment CUT_5_BILI_MERGED_EN2DE_1_2; MAP 25.83%; Pooled]
                                  trinity [Experiment TCDENDERUN3; MAP 19.35%; Not Pooled]
              90%                 karlsruhe [Experiment EN_INDEXBL; MAP 16.46%; Not Pooled]
                                  weimar [Experiment COMBINEDFRINDE; MAP 15.75%; Not Pooled]
                                  cheshire [Experiment BIENDET2FBX; MAP 11.50%; Not Pooled]
              80%


              70%


              60%
Precision


              50%


              40%


              30%


              20%


              10%


               0%
                 0%        10%      20%       30%      40%       50%      60%       70%       80%      90%      100%
                                                                Recall


                                             Fig. 9. Bilingual German
                         Table 5. Best entries for the bilingual TEL tasks.

 Track Rank Participant                              Experiment DOI                                               MAP
         1st chemnitz    10.2415/AH-TEL-BILI-X2EN-CLEF2009.CHEMNITZ.CUT 13 BILI MERGED DE2EN 9 10                 40.46%
        2nd hit          10.2415/AH-TEL-BILI-X2EN-CLEF2009.HIT.XTDD10T40                                          35.27%
         3rd trinity     10.2415/AH-TEL-BILI-X2EN-CLEF2009.TRINITY.TCDDEENRUN3                                    35.05%
English
         4th trinity-dcu 10.2415/AH-TEL-BILI-X2EN-CLEF2009.TRINITY-DCU.TCDDCUDEEN1                                33.33%
         5th karlsrhue   10.2415/AH-TEL-BILI-X2EN-CLEF2009.KARLSRUHE.DE INDEXBL                                   32.70%
        Diﬀerence                                                                                                 23.73%
         1st chemnitz    10.2415/AH-TEL-BILI-X2FR-CLEF2009.CHEMNITZ.CUT 24 BILI EN2FR MERGED LANG SPEC REF CUT 17 25.57%

        2nd karlsrhue    10.2415/AH-TEL-BILI-X2FR-CLEF2009.KARLSRUHE.EN INDEXBL                                   24.62%
         3rd chesire     10.2415/AH-TEL-BILI-X2FR-CLEF2009.CHESHIRE.BIENFRT2FB                                    16.77%
French
         4th trinity     10.2415/AH-TEL-BILI-X2FR-CLEF2009.TRINITY.TCDDEFRRUN2                                    16.33%
         5th weimar      10.2415/AH-TEL-BILI-X2FR-CLEF2009.WEIMAR.CLESA169283ENINFR                               14.51%
        Diﬀerence                                                                                                 69.67%
         1st chemnitz    10.2415/AH-TEL-BILI-X2DE-CLEF2009.CHEMNITZ.CUT 5 BILI MERGED EN2DE 1 2                   25.83%
        2nd trinity      10.2415/AH-TEL-BILI-X2DE-CLEF2009.TRINITY.TCDENDERUN3                                    19.35%
         3rd karlsrhue   10.2415/AH-TEL-BILI-X2DE-CLEF2009.KARLSRUHE.EN INDEXBL                                   16.46%
German
         4th weimar      10.2415/AH-TEL-BILI-X2DE-CLEF2009.WEIMAR.COMBINEDFRINDE                                  15.75%
         5th chesire     10.2415/AH-TEL-BILI-X2DE-CLEF2009.CHESHIRE.BIENDET2FBX                                   11.50%
        Diﬀerence                                                                                                124.60%


     – X → DE: 53.15% of best monolingual German IR system.
       In particular, it can be seen that there is a considerable improvement in
   performance for French and German This will be commented in the following
   section.
       The monolingual performance ﬁgures for all three tasks are quite similar to
   those of last year but as these are not absolute values, no real conclusion can be
   drawn from this.

   3.3     Approaches
   As stated in the introduction, the TEL task this year is a repetition of the task set
   last year. A main reason for this was to create a good reusable test collection with
   a suﬃcient number of topics; another reason was to see whether the experience
   gained and reported in the literature last year, and the opportunity to use last
   year’s test collection as training data, would lead to diﬀerences in approaches
   and/or improvements in performance this year. Although we have exactly the
   same number of participants this year as last year, only ﬁve of the thirteen 2009
   participants also participated in 2008. These are the groups tagged as Chemnitz,
   Cheshire, Karlsruhe, INESC-ID and Opentext. The last two of these groups only
   tackled monolingual tasks. These groups all tend to appear in the top ﬁve for
   the various tasks. In the following we attempt to examine brieﬂy the approaches
   adopted this year, focusing mainly on the cross-language experiments.
       In the TEL task in CLEF 2008, we noted that all the traditional approaches
   to monolingual and cross language retrieval were attempted by the diﬀerent
groups. Retrieval methods included language models, vector-space and proba-
bilistic approaches, and translation resources ranged from bilingual dictionaries,
parallel and comparable corpora to on-line MT systems and Wikipedia. Groups
often used a combination of more than one resource. What is immediately no-
ticeable in 2009 is that, although similarly to last year a number of diﬀerent
retrieval models were tested, there is a far more uniform approach to the trans-
lation problem.
    Five of the ten groups that attempted cross-language tasks used the Google
Translate functionality, while a sixth used the LEC Power Translator [13]. An-
other group also used an MT system combining it with concept-based techniques
but did not disclose the name of the MT system used [16]. The remaining three
groups used a bilingual term list [17], a combination of resources including on-line
and in house developed dictionaries [19], and Wikipedia translation links [18]. It
is important to note that four out of the ﬁve groups in the bilingual to English
and bilingual to French tasks and three out of ﬁve for the bilingual to German
task used Google Translate, either on its own or in combination with another
technique. One group noted that topic translation using a statistical MT sys-
tem resulted in about 70% of the mean average precision (MAP) achieved when
using Google Translate [20]. Another group [11] found that the results obtained
by simply translating the query into all the target languages via Google gave
results that were comparable to a far more complex strategy known as Cross-
Language Explicit Semantic Analysis, CL-ESA, where the library catalog records
and the queries are represented in a multilingual concept space that is spanned
by aligned Wikipedia articles. As this year’s results were signiﬁcantly better
than last year’s, can we take this as meaning that Google is going to solve the
cross-language translation resource quandary?
   Taking a closer look at three groups that did consistently well in the cross-
language tasks we ﬁnd the following. The group that had the top result for
each of the three tasks was Chemnitz [15]. They also had consistently good
monolingual results. Not surprisingly, they appear to have a very strong IR
engine, which uses various retrieval models and combines the results. They used
Snowball stemmers for English and French and an n-gram stemmer for German.
They were one of the few groups that tried to address the multilinguality of the
target collections. They used the Google service to translate the topic from the
source language to the four most common languages in the target collections,
queried the four indexes and combined the results in a multilingual result set.
They found that their approach combining multiple indexed collections worked
quite well for French and German but was disappointing for English.
    Another group with good performance, Karlsruhe [16], also attempted to
tackle the multilinguality of the collections. Their approach was again based on
multiple indexes for diﬀerent languages with rank aggregation to combine the
diﬀerent partial results. They ran language detectors on the collections to iden-
tify the diﬀerent languages contained and translated the topics to the languages
recognized. They used Snowball stemmers to stem terms in ten main languages,
ﬁelds in other languages were not preprocessed. Disappointingly, a baseline con-
sisting of a single index without language classiﬁcation and a topic translated
only to the index language achieved similar or even better results. For the trans-
lation step, they combined MT with a concept-based retrieval strategy based on
Explicit Semantic Analysis and using the Wikipedia database in English, French
and German as concept space.
    A third group that had quite good cross-language results for all three col-
lections was Trinity [12]. However, their monolingual results were not so strong.
They used a language modelling retrieval paradigm together with a document
re-ranking method which they tried experimentally in the cross-language con-
text. Signiﬁcantly, they also used Google Translate. Judging from the fact that
they did not do so well in the monolingual tasks, this seems to be the probable
secret of their success for cross-language.


4     Persian@CLEF

This activity was again coordinated in collaboration with the Data Base Research
Group (DBRG) of Tehran University.


4.1   Tasks

The activity was organised as a typical ad hoc text retrieval task on newspa-
per collections. Two tasks were oﬀered: monolingual retrieval; cross-language
retrieval (English queries to Persian target) and 50 topics were prepared (see
section 2.2). For each topic, participants had to ﬁnd relevant documents in the
collection and submit the results in a ranked list.
    Table 3 provides a breakdown of the number of participants and submitted
runs by task and topic language.


4.2   Results

Table 6 shows the top ﬁve groups for each target collection, ordered by mean av-
erage precision. The table reports: the short name of the participating group; the
mean average precision achieved by the experiment; the DOI of the experiment;
and the performance diﬀerence between the ﬁrst and the last participant.
   Figures 10 and 11 compare the performances of the top participants of the
Persian tasks.
   For bilingual retrieval evaluation, a common method is to compare results
against monolingual baselines. We have the following results for CLEF 2009:

 – X → FA: 5.50% of best monolingual Farsi IR system.

    This appears to be a very clear indication that something went wrong with
the bilingual system that has been developed. These results should probably be
discounted.
            Ad−Hoc TEL Monolingual Persian Task Top 5 Participants − Standard Recall Levels vs Mean Interpolated Precision
            100%
                                                 jhu−apl [Experiment JHUFASK41R400TD; MAP 49.38%; Pooled]
                                                 unine [Experiment UNINEPE4; MAP 49.37%; Pooled]
             90%                                 opentext [Experiment OTFA09TDE; MAP 39.53%; Pooled]
                                                 qazviniau [Experiment IAUPERFA3; MAP 37.62%; Pooled]

             80%


             70%


             60%
Precision


             50%


             40%


             30%


             20%


             10%


              0%
                0%        10%      20%       30%       40%      50%       60%       70%       80%      90%      100%
                                                               Recall


                                          Fig. 10. Monolingual Persian

             Ad−Hoc TEL Bilingual Persian Task Top 5 Participants − Standard Recall Levels vs Mean Interpolated Precision
            100%
                                                         qazviniau [Experiment IAUPEREN3; MAP 2.72%; Pooled]

             90%


             80%


             70%


             60%
Precision


             50%


             40%


             30%


             20%


             10%


              0%
                0%        10%      20%       30%       40%      50%       60%       70%      80%       90%      100%
                                                               Recall


                                            Fig. 11. Bilingual Persian
                        Table 6. Best entries for the Persian tasks.

      Track Rank Participant                Experiment DOI                               MAP
             1st jhu-apl     10.2415/AH-PERSIAN-MONO-FA-CLEF2009.JHU-APL.JHUFASK41R400TD 49.38%

            2nd unine        10.2415/AH-PERSIAN-MONO-FA-CLEF2009.UNINE.UNINEPE4          49.37%
             3rd opentext    10.2415/AH-PERSIAN-MONO-FA-CLEF2009.OPENTEXT.OTFA09TDE      39.53%
Monolingual
             4th qazviniau   10.2415/AH-PERSIAN-MONO-FA-CLEF2009.QAZVINIAU.IAUPERFA3     37.62%
             5th –           –                                                               –%
            Diﬀerence                                                                    31.25%
             1st qazviniau   10.2415/AH-PERSIAN-BILI-X2FA-CLEF2009.QAZVINIAU.IAUPEREN3    2.72%
            2nd –            –                                                                –
             3rd –           –                                                                –
 Bilingual
             4th –           –                                                                –
             5th –           –                                                                –
            Diﬀerence                                                                         –


4.3    Approaches

We were very disappointed this year that despite the fact that 14 groups reg-
istered for the Persian task, only four actually submitted results. And only one
of these groups was from Iran. We suspect that one of the reasons for this was
that the date for submission of results was not very convenient for the Iranian
groups. Furthermore, only one group [18] attempted the bilingual task with the
very poor results cited above. The technique they used was the same as that
adopted for their bilingual to English experiments, exploiting Wikipedia trans-
lation links, and the reason they give for the very poor performance here is that
the coverage of Farsi in Wikipedia is still very scarce compared to that of many
other languages.
    In the monolingual Persian task, the top two groups had very similar per-
formance ﬁgures. [21] found they had best results using a light suﬃx-stripping
algorithm and by combining diﬀerent indexing and searching strategies. Inter-
estingly, their results this year do not conﬁrm their ﬁndings for the same task
last year when the use of stemming did not prove very eﬀective. The other
group [14] tested variants of character n-gram tokenization; 4-grams, 5-grams,
and skipgrams all provided about a 10% relative gain over plain words.


5     Conclusions

In CLEF 2009 we deliberately repeated the TEL and Persian tasks oﬀered in 2008
in order to build up our test collections. Although we have not yet had suﬃcient
time to assess them in depth, we are reasonably happy with the results for the
TEL task: several groups worked on tackling the particular features of the TEL
collections with varying success; evidence has been acquired on the eﬀectiveness
of a number of diﬀerent IR strategies; there is a very strong indication of the
validity of the Google Translate functionality.
   On the other hand, the results for the Persian task were quite disappointing:
very few groups participated; the results obtained are either in contradiction to
those obtained previously and thus need further investigation [21] or tend to be
a very straightforward repetition and conﬁrmation of last year’s results [14].


6   Acknowledgements

The TEL task was studied in order to provide useful input to The European Li-
brary (TEL); we express our gratitude in particular to Jill Cousins, Programme
Director, and Sjoerd Siebinga, Technical Developer of TEL. Vivien Petras, Hum-
boldt University, Germany, and Nicolas Moreau, Evaluation and Language Re-
sources Distribution Agency, France, were responsible for the creation of the
topics and the supervision of the relevance assessment work for the ONB and
BNF data respectively. We thank them for their valuable assistance.
    We should also like to acknowledge the enormous contribution to the co-
ordination of the Persian task made by the Data Base Research group of the
University of Tehran and in particular to Abolfazl AleAhmad and Hadi Amiri.
They were responsible for the preparation of the set of topics for the Hamshahri
collection in Farsi and English and for the subsequent relevance assessments.
    Least but not last, we would warmly thank Giorgio Maria Di Nunzio for all
the contributions he gave in carrying out the TEL and Persian tasks.


References
 1. M. Agosti, G. M. Di Nunzio, and N. Ferro. The Importance of Scientiﬁc Data
    Curation for Evaluation Campaigns. In C. Thanos and F. Borri, editors, DELOS
    Conference 2007 Working Notes, pages 185–193. ISTI-CNR, Gruppo ALI, Pisa,
    Italy, February 2007.
 2. O. Alonso and S. Mizzaro. Can we get rid of TREC assessors? Using Mechan-
    ical Turk for relevance assessment. In S. Geva, J. Kamps, C. Peters, T. Sakai,
    A. Trotman, and E. Voorhees, editors, Proc. SIGIR 2009 Workshop on The Future
    of IR Evaluation. http://staff.science.uva.nl/~ kamps/ireval/papers/paper_
    22.pdf, 2009.
 3. M. Braschler. CLEF 2002 – Overview of Results. In C. Peters, M. Braschler,
    J. Gonzalo, and M. Kluck, editors, Advances in Cross-Language Information Re-
    trieval: Third Workshop of the Cross–Language Evaluation Forum (CLEF 2002)
    Revised Papers, pages 9–27. Lecture Notes in Computer Science (LNCS) 2785,
    Springer, Heidelberg, Germany, 2003.
 4. M. Braschler and C. Peters. CLEF 2003 Methodology and Metrics. In C. Peters,
    M. Braschler, J. Gonzalo, and M. Kluck, editors, Comparative Evaluation of Multi-
    lingual Information Access Systems: Fourth Workshop of the Cross–Language Eval-
    uation Forum (CLEF 2003) Revised Selected Papers, pages 7–20. Lecture Notes in
    Computer Science (LNCS) 3237, Springer, Heidelberg, Germany, 2004.
 5. C. W. Cleverdon. The Cranﬁeld Tests on Index Languages Devices. In
    K. Spärck Jones and P. Willett, editors, Readings in Information Retrieval, pages
    47–60. Morgan Kaufmann Publisher, Inc., San Francisco, CA, USA, 1997.
 6. G. M. Di Nunzio and N. Ferro. Appendix A: Results of the TEL@CLEF Task. In
    this volume.
 7. G. M. Di Nunzio and N. Ferro. Appendix B: Results of the Persian@CLEF Task.
    In this volume.
 8. M. Sanderson and H. Joho. Forming Test Collections with No System Pooling. In
    M. Sanderson, K. Järvelin, J. Allan, and P. Bruza, editors, Proc. 27th Annual In-
    ternational ACM SIGIR Conference on Research and Development in Information
    Retrieval (SIGIR 2004), pages 33–40. ACM Press, New York, USA, 2004.
 9. S. Tomlinson. German, French, English and Persian Retrieval Experiments at
    CLEF 2009. In this volume.
10. Tomlinson, S.: Sampling Precision to Depth 10000 at CLEF 2008. Systems for Mul-
    tilingual and Multimodal Information Access: 9th Workshop of the Cross-Language
    Evaluation Forum (CLEF 2008). Revised Selected Papers, Lecture Notes in Com-
    puter Science (LNCS) 5706, Springer, Heidelberg, Germany (2009)
11. Anderka, M., Lipka, N., Stein, B.: Evaluating Cross-Language Explicit Semantic
    Analysis and Cross Querying at TEL@CLEF 2009. In this volume.
12. Zhou, D., Wade, V: Language Modeling and Document Re-Ranking: Trinity Ex-
    periments at TEL@CLEF-2009. In this volume.
13. Larson, R.R.: Multilingual Query Expansion for CLEF Adhoc-TEL. In this vol-
    ume.
14. McNamee, P.: JHU Experiments in Monolingual Farsi Document Retrieval at
    CLEF 2009. In this volume.
15. Kuersten, J.: Chemnitz at CLEF 2009 Ad-Hoc TEL Task: Combining Diﬀerent
    Retrieval Models and Addressing the Multilinguality. In this volume.
16. Sorg, P., Braun, M., Nicolay, D., Cimiano, P.: Cross-lingual Information Retrieval
    based on Multiple Indexes. In this volume.
17. Katsiouli, P., Kalamboukis, T.: An Evaluation of Greek-English Cross Language
    Retrieval within the CLEF Ad-Hoc Bilingual Task. In this volume.
18. Jadidinejad, A.H., Mahmoudi, F.: Query Wikiﬁcation: Mining Structured Queries
    From Unstructured Information Needs using Wikipedia-based Semantic Analysis.
    In this volume.
19. Bosca, A., Dini, L.: CACAO Project at the TEL@CLEF 2009 Task. In this volume.
20. Leveling, J., Zhou, D., Jones, G.F., Wade, V.: TCD-DCU at TEL@CLEF 2009:
    Document Expansion, Query Translation and Language Modeling. In this volume.
21. Dolamic, L., Fautsch, C., Savoy, J.: UniNE at CLEF 2009: Persian Ad Hoc Re-
    trieval and IP. In this volume.

</pre>