-

Introduction to the Working Notes

Italy carol.peters@isti.cnr.it

2007

The objective of the Cross Language Evaluation Forum1 is to promote research in the field of multilingual system development. This is done through the organisation of annual evaluation campaigns in which a series of tracks designed to test different aspects of mono- and cross-language information retrieval (IR) are offered. The intention is to encourage experimentation with all kinds of multilingual information access - from the development of systems for monolingual retrieval operating on many languages to the implementation of complete multilingual multimedia search services. This has been achieved by offering an increasingly complex and varied set of evaluation tasks over the years. The aim is not only to meet but also to anticipate the emerging needs of the R&D community and to encourage the development of next generation multilingual IR systems. These Working Notes contain descriptions of the experiments conducted within CLEF 2007 - the eighth in a series of annual system evaluation campaigns. The results of the experiments will be presented and discussed in the CLEF 2007 Workshop, 19-21 September, Budapest, Hungary. The final papers - revised and extended as a result of the discussions at the Workshop - together with a comparative analysis of the results will appear in the CLEF 2007 Proceedings, to be published by Springer in their Lecture Notes for Computer Science series. As from CLEF 2005, the Working Notes are published in electronic format only and are distributed to participants at the Workshop on CD-ROM together with the Book of Abstracts in printed form. All reports included in the Working Notes will also be inserted in the DELOS Digital Library, accessible at http://delos-dl.isti.cnr.it. Both Working Notes and Book of Abstracts are divided into eight sections, corresponding to the CLEF 2007 evaluation tracks, plus an additional section describing other evaluation initiatives using CLEF data: MorphoChallenge 2007 and SemEval 2007. In addition appendices are included containing run statistics for the Ad Hoc, Domain-Specific, GeoCLEF and CL-SR tracks, plus a list of all participating groups showing in which track they took part. The main features of the 2007 campaign are briefly outlined here below in order to provide the necessary background to the experiments reported in the rest of the Working Notes.

Cross-Language Text Retrieval (Ad Hoc): This year, this track offered mono- and bilingual tasks on target collections for central European languages (Bulgarian, Czech3 and Hungarian). Similarly to last year, a bilingual task encouraging system testing with non-European languages against English documents was offered. Topics were made available in Amharic, Chinese, Oromo and Indonesian. A special sub-task regarded Indian language search against an English target collection was also organised with the assistance of a number of Indian research institutes, responsible for the preparation of the topics. The languages offered were Hindi, Bengali, Tamil, Telugu and Marathi. In order to establish benchmarks in this subtask, all participating groups has to submit: - one monolingual English to English run (mandatory) - at least one run in Hindi to English (mandatory) - runs in other Indian languages to English (optional).

A "robust" task was again be offered, emphasizing the importance of reaching a minimal performance for all topics instead of high average performance. Robustness is a key issue for the transfer of CLEF research into applications. The 2007 robust task involved three languages often used in previous CLEF campaigns (English, French, Portuguese). The track was coordinated jointly by ISTI-CNR and U.Padua (Italy) and U.Hildesheim (Germany). Cross-Language Scientific Data Retrieval (Domain-Specific): Mono- and cross-language domain-specific retrieval was studied in the domain of social sciences using structured data (e.g. bibliographic data, keywords, and abstracts) from scientific reference databases. The target collections provided were: GIRT-4 for German/English, INION for Russian and Cambridge Sociological Abstracts for English. A multi-lingual controlled vocabulary (German, English, Russian) suitable for use with GIRT-4 and INION together with a bi-directional mapping between this vocabulary and that used for indexing the Sociological Abstracts (English) was provided. Topics were offered in English, German and Russian. This track was coordinated by IZ Bonn (Germany). Multilingual Question Answering (QA@CLEF): QA@CLEF 2007 proposed both main and pilot tasks. The main task scenario was topic-related QA, where the questions are grouped by topics and may contain anaphoric references one to the others. The answers were retrieved from heterogeneous document collections, i.e. news articles and Wikipedia. Many sub-tasks were set up, monolingual – where the questions and the target collections searched for answers are in the same language - and bilingual – where source and target languages are different. Bulgarian, Dutch, English, French, German, Italian, Portuguese, Romanian and Spanish were offered as target languages; query languages used in the bilingual tasks depended on demand (see the track overview for details). Following the positive response at QA@CLEF 2006, the Answer Validation Exercise (AVE) was reproposed. A new pilot tasks was also offered: Question Answering on Speech Transcript (QAst), in which the answers to factual questions have to be extracted from spontaneous speech transcriptions (manual and automatic transcriptions) coming from different human interaction scenario. The track is organized by several institutions (one for each source language) and jointly coordinated by CELCT, Trento (Italy), LSI-UNED, Madrid and UPC, Barcelona (Spain).

Cross-Language Retrieval in Image Collections (ImageCLEF): This track evaluated retrieval of images described by text captions in several languages; both text and image retrieval techniques were exploitable. Four challenging tasks were offered: (i) multilingual ad-hoc retrieval (collection with mixed English/German/Spanish annotations, queries in more languages), (ii) medical image retrieval (casenotes in English/ French/German; visual, mixed, semantic queries in same languages), (iii) hierarchical automatic image annotation for medical images (fully categorized in English and German, purely visual task), (iv) photographic annotation through detection of objects in images (using the same collection as (i) with a restricted number of objects, a purely visual task). Image retrieval was not required for all tasks and a default visual and textual retrieval system was made available for participants. The track coordinators were U.Sheffield (UK) and the U. and U. Hospitals of Geneva (Switzerland). Oregon Health and Science U. (US), Victoria U., Melbourne (Australia), RWTH Achen (Germany) and Vienna Univ. Tech (Austria) collaborated in the task organization.

Cross-Language Speech Retrieval (CL-SR): The focus is on searching spontaneous speech from oral history interviews rather than news broadcasts. The test collection created for the track is a subset of a large archive of videotaped oral histories from survivors, liberators, rescuers and witnesses of the Holocaust created by the Survivors of the Shoah Visual History Foundation (VHF). Automatic Speech Recognition (ASR) transcripts and both automatically assigned and manually assigned thesaurus terms were available as part of the collection. In 2006 the CL-SR track included search collections of conversational English and Czech speech using six languages (Czech, Dutch, English, French, German and Spanish). In CLEF 2007 additional topics were added for the Czech speech collection. Speech content is described by automatic speech transcriptions manually and automatically assigned controlled vocabulary descriptors for concepts, dates and locations, manually assigned

3 New this year.

person names, and hand-written segment summaries. The track was coordinated by U. Maryland (USA), Dublin City U. (Ireland) and Charles U. (Czech Republic).

Multilingual Web Retrieval (WebCLEF): The WebCLEF 2007 task combines insights gained from previous editions of WebCLEF 2005–2006 and the WiQA 2006 pilot, and goes beyond the navigational queries considered at WebCLEF 2005 and 2006. At WebCLEF 2007 so-called undirected informational search goals were considered in a web setting: “I want to learn anything/everything about my topic.” The track was coordinated by U. Amsterdam (The Netherlands).

Cross-Language Geographical Retrieval (GeoCLEF): The purpose of GeoCLEF is to test and evacuate cross-language geographic information retrieval (GIR): retrieval for topics with a geographic specification. GeoCLEF 2007 consisted of two sub tasks. A search task ran for the third time and a query classification task was organized for the first. For the GeoCLEF 2007 search task, twenty-five search topics were defined by the organizing groups for searching English, German, Portuguese and Spanish document collections. Topics were translated into English, German and Spanish. For the classification task, a query log from a search engine was provided and the groups needed to identify the queries with a geographic scope and the geographic components within the local queries. The track was coordinated jointly by UC Berkeley (USA), U.Sheffield (UK), U. Hildesheim (Germany), Linguateca SINTEF (Norway), Microsoft Asia (China).

Details on the technical infrastructure and the organisation of these tracks can be found in the track overview reports in this volume, collocated at the beginning of the relevant sections. 2.

Test Collections

A number of different document collections were used in CLEF 2007 to build the test collections: • CLEF multilingual comparable corpus of more than 3 million news documents in 13 languages; new data was added this year for Czech, Bulkagarian and English (see Table 1); Parts of this collections were used in the Ad-Hoc, QuestionAnswering, and GeoCLEF tracks. • The GIRT-4 social science database in English and German (over 300,000 documents) and two Russian databases: the Russian Social Science Corpus (approx. 95,000 documents) and the Russian ISISS collection for sociology and economics (approx. 150,000 docs). The RSSC corpus was not used this year.

Cambridge Sociological Abstracts in English. These collections were used in the domain-specific track. • The ImageCLEF track used collections for both general photographic and medical image retrieval: ¾ IAPR TC-12 photo database of 25,000 photographs with captions in English, German and

Spanish; PASCAL VOC 2006 training data (new this year); ¾ ImageCLEFmed radiological database consisting of 6 distinct datasets – 2 more than last year; IRMA collection in English and German of 12,000 classified images for automatic medical image annotation • Malach collection of spontaneous conversational speech derived from the Shoah archives in English (more than 750 hours) and Czech (approx 500 hours). This collection was used in the speech retrieval track. • EuroGOV, a multilingual collection of about 3.5M webpages, containing documents many languages crawled from European governmental sites, used in the WebCLEF track.

Technical Infrastructure

The CLEF technical infrastructure is managed by the DIRECT system. DIRECT manages the test data plus results submission and analyses for the ad hoc, question answering and geographic IR tracks. It has been designed to facilitate data management tasks but also to support the production, maintenance, enrichment and interpretation of the scientific data for subsequent in-depth evaluation studies.

The technical infrastructure is thus responsible for: • the track set-up, harvesting of documents, management of the registration of participants to tracks; • the submission of experiments, collection of metadata about experiments, and their validation; • the creation of document pools and the management of relevance assessment; • the provision of common statistical analysis tools for both organizers and participants in order to allow the comparison of the experiments; • the provision of common tools for summarizing, producing reports and graphs on the measured performances and conducted analyses.

DIRECT is designed and implemented by Giorgio Di Nunzio and Nicola Ferro 4 The number of tokens extracted from each document can vary slightly across systems, depending on the respective definition of what constitutes a token. Consequently, the number of tokens and features given in this table are approximations and may differ from actual implemented systems.

40 35 s p 30 u o rG25 g in 20 t a ip 15 c i tra 10 P 5 0 100 90 80 70 60 50 40 30 20 10 0

Oceania South America North America Asia

Europe 2000 2001 2002 2003 2004 2005 2006 2007 CLEF 2000-2007 Tracks

AdHoc DomSpec iCLEF CL-SR QA@CLEF ImageCLEF WebClef

GeoClef 2000 2001 2002 2003 2004 2005 2006 2007 Years

Participation

A total of 81 groups submitted runs in CLEF 2006, slightly down from the 90 groups of CLEF 2006: 51(59.5) from Europe, 14(14.5) from N.America; 14(10) from Asia, 1(4) from S.America and 1(1) from Australia. The breakdown of participation of groups per track is as follows: Ad Hoc 22(25); Domain-Specific 5(4); QAatCLEF 28(37); ImageCLEF 35(25); CL-SR 8(6); WebCLEF 4(8); GeoCLEF 13(17)5. A list of groups and indications of the tracks in which they participated is given in the Appendix to these Working Notes. Figure 1 shows the variation in participation over the years and Figure 2 shows the shift in focus as new tracks have been added

5 Last year’s figures are between brackets.

In particular, these figures show that while there is a constant increase in interest in the ImageCLEF track, there is a consistent decrease in popularity of the question answering and web tracks. Although the fluctuation in QA does not seem to be of great significance - this is a very difficult task - the apparent lack of interest in WebCLEF is surprising. With the importance of Internet and web search engines, a larger participation in this task is to be expected. The large numbers for ImageCLEF also give rise to some discussion. The defining feature of CLEF is its multilinguality; ImageCLEF is perhaps the least multilingual of the CLEF tracks as much of the work is done in a language-independent context. These questions will be the subject of debate at the workshop. At the same time, it should be noted that these Working Notes also include reports from two separate evaluation initiatives which actually used CLEF data for certain tasks – thus the impact of CLEF spreads far beyond the boundaries of the CLEF evaluation campaigns.

Workshop

CLEF aims at creating a strong CLIR/MLIR research and development community. The Workshop plays an important role by providing the opportunity for all the groups that have participated in the evaluation campaign to get together comparing approaches and exchanging ideas. The work of the groups participating in this year’s campaign will be presented in plenary paper and poster sessions. There will also be break-out sessions for more in-depth discussion of the results of individual tracks and intentions for the future. The final sessions will include discussions on ideas for new tracks in future campaigns. Overall, the Workshop should provide an ample panorama of the current state-of-the-art and the latest research directions in the multilingual information retrieval area. I very much hope that it will prove an interesting, worthwhile and enjoyable experience to all those who participate.

The final programme and the presentations at the Workshop will be posted on the CLEF website at http://www.clef-campaign.org.

Acknowledgements

It would be impossible to run the CLEF evaluation initiative and organize the annual workshops without considerable assistance from many groups. CLEF is organized on a distributed basis, with different research groups being responsible for the running of the various tracks. My gratitude goes to all those who have been involved in the coordination of the 2007 campaigns. A list of the main institutions involved is given on the following page. Here below, let me thank the people mainly responsible for the coordination of the different tracks: • Giorgio Di Nunzio, Nicola Ferro and Thomas Mandl for the Ad Hoc Track • Vivien Petras, Stefan Baerisch, Maximillian Stempfhuber for the Domain-Specific track • Bernardo Magnini, Danilo Giampiccolo, Pamela Forner, Anselmo Peñas, Christelle Ayache, Corina Forăscu, Valentin Jijkoun, Petya Osenova, Paulo Rocha, Bogdan Sacaleanu, Richard Sutcliffe for QA@CLEF • Allan Hanbury, Paul Clough, Henning Müller, Thomas Deselaers, Michael Grubinger, Jayashree

Kalpathy–Cramer and William Hersh for ImageCLEF • Douglas W. Oard, Gareth J. F. Jones, and Pavel Pecina for CL-SR • Valentin Jijkoun and Maarten de Rijke for Web-CLEF • Thomas Mandl, Fredric Gey, Giorgio Di Nunzio, Nicola Ferro, Ray Larson, Mark Sanderson, Diana

Santos, Christa Womser-Hacker, Xing Xie for GeoCLEF I also thank all those colleagues who have helped us by preparing topic sets in different languages and in particular the NLP Lab. Dept. of Computer Science and Information Engineering of the National Taiwan University for their work on Chinese.

I should also like to thank the members of the CLEF Steering Committee who have assisted me with their advice and suggestions throughout this campaign.

Furthermore, I gratefully acknowledge the support of all the data providers and copyright holders, and in particular: The Los Angeles Times, for the American-English data collection SMG Newspapers (The Herald) for the British-English data collection Le Monde S.A. and ELDA: Evaluations and Language resources Distribution Agency, for the French data Frankfurter Rundschau, Druck und Verlagshaus Frankfurt am Main; Der Spiegel, Spiegel Verlag,

Hamburg, for the German newspaper collections InformationsZentrum Sozialwissen-schaften, Bonn, for the GIRT database SocioNet system for the Russian Social Science Corpora Hypersystems Srl, Torino and La Stampa, for the Italian data

Agencia EFE S.A. for the Spanish data NRC Handelsblad, Algemeen Dagblad and PCM Landelijke dagbladen/Het Parool for the Dutch newspaper data Aamulehti Oyj and Sanoma Osakeyhtiö for the Finnish newspaper data Russika-Izvestia for the Russian newspaper data Público, Portugal, and Linguateca for the Portuguese (PT) newspaper collection Folha, Brazil, and Linguateca for the Portuguese (BR) newspaper collection Tidningarnas Telegrambyrå (TT) SE-105 12 Stockholm, Sweden for the Swedish newspaper data Schweizerische Depeschenagentur, Switzerland, for the French, German and Italian Swiss news agency data Ringier Kiadoi Rt. [Ringier Publishing Inc.].and the Research Institute for Linguistics, Hungarian Acad. Sci. for the Hungarian newspaper documents Sega AD, Sofia; Standart Nyuz AD, Novinar OD Sofia, and the BulTreeBank Project, Linguistic Modelling Laboratory, IPP, Bulgarian Acad. Sci, for the Bulgarian newspaper documents Mafra a.s. and Lidové Noviny a.s. for the Czech newspaper data St Andrews University Library for the historic photographic archive University and University Hospitals, Geneva, Switzerland and Oregon Health and Science University for the ImageCLEFmed Radiological Medical Database The Radiology Dept. of the University Hospitals of Geneva for the Casimage database and the PEIR (Pathology Education Image Resource) for the images and the HEAL (Health Education Assets Library) for the Annotation of the Peir dataset.

Aachen University of Technology (RWTH), Germany for the IRMA database of annotated medical images Mallinkrodt Institue of Radiology for permission to use their nuclear medicine teaching file University of Basel's Pathopic project for their Pathology teaching file Michael Grubinger, administrator of the IAPR Image Benchmark, Clement Leung who initiated and supervised the IAPR Image Benchmark Project, and André Kiwitz, the Managing Director of Viventura for granting access to the image database and the raw image annotations of the tour guides. The Survivors of the Shoah Visual History Foundation, and IBM for the Malach spoken document collection Without their contribution, this evaluation activity would be impossible.

Last and not least, I should like to express my gratitude to Alessandro Nardi and Valeria Quochi for their assistance in the organisation of the CLEF 2007 Workshop. Maristella Agosti, University of Padova, Italy Martin Braschler, Zurich University of Applied Sciences Winterthur, Switzerland Amedeo Cappelli, ISTI-CNR & CELCT, Italy Hsin-Hsi Chen, National Taiwan University, Taipei, Taiwan Khalid Choukri, Evaluations and Language resources Distribution Agency, Paris, France Paul Clough, University of Sheffield, UK Thomas Deselaers, RWTH Aachen University, Germany David A. Evans, Clairvoyance Corporation, USA Marcello Federico, ITC-irst, Trento, Italy Christian Fluhr, CEA-LIST, Fontenay-aux-Roses, France Norbert Fuhr, University of Duisburg, Germany Frederic C. Gey, U.C. Berkeley, USA Julio Gonzalo, LSI-UNED, Madrid, Spain Donna Harman, National Institute of Standards and Technology, USA Gareth Jones, Dublin City University, Ireland Franciska de Jong, University of Twente, Netherlands Noriko Kando, National Institute of Informatics, Tokyo, Japan Jussi Karlgren, Swedish Institute of Computer Science, Sweden Michael Kluck, German Institute for International and Security Affairs, Berlin, Germany Natalia Loukachevitch, Moscow State University, Russia Bernardo Magnini, ITC-irst, Trento, Italy Paul McNamee, Johns Hopkins University, USA Henning Müller, University & University Hospitals of Geneva, Switzerland Douglas W. Oard, University of Maryland, USA Maarten de Rijke, University of Amsterdam, Netherlands Diana Santos, Linguateca, Sintef, Oslo, Norway Jacques Savoy, University of Neuchatel, Switzerland Peter Schäuble, Eurospider Information Technologies, Switzerland Richard Sutcliffe, University of Limerick, Ireland Max Stempfhuber, Informationszentrum Sozialwissenschaften Bonn, Germany Hans Uszkoreit, German Research Center for Artificial Intelligence (DFKI), Germany Felisa Verdejo, LSI-UNED, Madrid, Spain José Luis Vicedo, University of Alicante, Spain Ellen Voorhees, National Institute of Standards and Technology, USA Christa Womser-Hacker, University of Hildesheim, Germany