Cross-language Retrieval at Twente and TNO. Dennis Reidsma1 , Djoerd Hiemstra1 , Franciska de Jong1,2 , Wessel Kraaij2 1 University of Twente, Dept. of Computer Science, P.O. Box 217, 7500 AE Enschede, The Netherlands 2 TNO TPD, P.O. Box 155, 2600 AD Delft, The Netherlands {reidsma,hiemstra,fdejong}@cs.utwente.nl, kraaij@tpd.tno.nl Abstract This paper describes the official runs of the Twenty-One group for CLEF-2002. The Twenty-One group participated in the Dutch and Finnish monolingual and the Dutch bilingual tasks. This paper also reports on an experiment that was carried out during the assessment work. The experiment was designed to examine possible influences on the assessments caused by the use of highlighting in the assessment program. 1 Introduction This paper describes the CLEF participation of the Twenty-One group. 1 Section 2 provides the context in which research on multilingual information retrieval is carried out at TNO TPD and the University of Twente. Section 3 discusses the Dutch and Finnish runs that the Twenty-One group submitted to CLEF 2002. First the retrieval model is described (section 3.1), after which our submissions to CLEF 2002 are presented. Section 4 describes and analyses the results of an experiment that has been carried out on some aspects of the assessment protocol and discusses its results. 2 CLIR as an aspect of multimedia retrieval The work on cross-language information (CLIR) that has been carried out by a joint research group from TNO and the University of Twente since 1997 (TREC-6), has been part of a larger research area that can be described as content-based multimedia retieval. CLIR is just one of the themes in a series of collaborative projects on multimedia retrieval, of which Twenty-One provided the name of the search engine that has been developed and used for the participation in TREC and, later on, CLEF. Though the focus on CLIR-aspects is not in all projects as strong as it used to be in Twenty-One, the possibility to search in digital multimedia archives with different query languages and to identify relevant material in other languages than the query language has always been part of the envisaged functionality. Where the early projects exploited mainly the textual material avaliable in multimedia archives (production scripts, cut lists, etc.), the use of timecoded textual information (subtitles, transcripts generated by automatic speech recognition tools, etc.) has become more dominant in the current running projects, for which video and audio retrieval are the major goals, e.g. DRUID and the IST-projects ECHO and MUMIS2 . In some projects the CLIR functionality is made available by allowing the users of the demonstator systems to select query terms from a closed list which is tuned to the domain of the media archive to be searched. Translation to other languages is then simply a matter of mapping these query terms 1 Twenty-One was an information retrieval project funded by the with the TAP programme of the EU. Though the Twenty-One project was completed in June 1999, TNO TPD and the University of Twente still participate in the CLEF events under that name. 2 For details, cf. http://parlevink.cs.utwente.nl/projects, http://www.tpd.tno.nl/, and [5] to their translation equivalents. Ambiguity resolution and other problems inherent to CLIR-tasks are circumvented in this concept search like approach. However, there is always the additional user requirement to be able to search for terms that are not in the controlled list. Therefore, even in ontology driven projects such as MUMIS, the type of CLIR functionality that is central to the current CLEF-campaign remains relevant in the multimedia domain. 3 Retrieval experiments on the Dutch and Finnish docu- ment set The Twenty-One group participated in the Dutch and Finnish monolingual task and the Duch bilingual task. In this section we present the retrieval model (section 3.1) and discuss the scores for the different tasks. 3.1 The retrieval model Runs were carried out with an information retrieval system based on a simple unigram language model. The basic idea is that documents can be represented by simple statistical language models. Now, if a query is more probable given a language model based on document d1 , than given e.g. a language model based on document d2 , then we hypothesise that the document d1 is more likely to be relevant to the query than document d2 . Thus the probability of generating a certain query given a document-based language model can serve as a score to rank the documents. n Y P (T1 , T2 , · · · , Tn |D)P (D) = P (D) (1 − λ)P (Ti ) + λP (Ti |D) (1) i=1 Formula 1 shows the basic idea of this approach to information retrieval, where the document-based language model P (Ti |D) is interpolated with a background language model P (Ti ) to compensate for sparseness. In the formula, Ti is a random variable for the query term on position i in the query (1 ≤ i ≤ n, where n is the query length), which sample space is the set of all terms in the collection. The probability measure P (Ti ) defines the probability of drawing a term at random from the collection, P (Ti |Dk ) defines the probability of drawing a term at random from the document; and λ is the smoothing parameter, which is set to λ = 0.15. The marginal probability of relevance P (D) is assumed to be uniformly distributed over the documents in which case it may be ignored in the above formula. For a description of the embedding of statistical word-by-word translation into our retrieval model, we refer to [1]. 3.2 The Dutch runs For Dutch three separate runs were submitted. First there was the manual run, in which we had a special interest because of our role in the assesment of all the runs submitted for Dutch (cf. section 4). The expected effect of submitting a run for which the queries were manually created from the topics, was to increase the size and quality of the pool of documents to be assessed. The engine applied was a slightly modified version of the NIST Z/Prise 2.0 system. The Dutch bilingual run is an automatic run done with the TNO retrieval system (also referred to as the Twenty-One engine) as developed and used for previous CLEF participations [1, 2]. Furthermore we used the VLIS lexical database developed by Van Dale Lexicography and the morphological analyzers developed by Xerox Research Centre Grenoble. For completeness we did a post-evaluation automatic monolingual Dutch run. Mean average precision figures for the three runs are given in Table 1. 3.3 The Finnish run Since we did not have a Finnish morphological analizer or stemmer, we decided to apply an N- gram approach, which has been advocated as a language independent, knowledge-poor approach run label m.a.p. description tnoutn1 0.4471 manual monolingual tnoen1 0.3369 EN-NL dictionary based tnofifi1 0.4056 automatic monolingual (Finnish) tnonn1 0.4247 automatic monolingual Table 1: mean average precision of the runs on the Dutch and Finnish dataset by McNamee and Mayfield [3]. After applying a stoplist and lowercasing, documents and queries were indexed by character 5-grams. Unlike the JHU approach, the 5-grams did not span word boundaries. This extremely simple approach turned out to be very effective: for almost all topics the score of this run was at least as high as the median score. 4 Assessment of the Dutch results The University of Twente was responsible for assessing the results for the Dutch newspaper col- lections (articles from the newspapers ’NRC Handelsblad’ and ’Algemeen Dagblad’). Besides assessing all topics in the standard way for the official ranking of the submitted runs, we also repeated some assessments without allowing highlighting of search terms. This section discusses the motivation for this additional experiment and reports on the findings. 4.1 Introduction The program used to do the assessments is developed at NIST and offers the possibility to highlight terms in the documents. Highlighting words and phrases for which a search engine has detected a relation to the query terms might make it easier for the assessor to decide on the relevance of a document. Usually the assessor will be told explicitly that the presence or absence of highlighted terms in a document is not decisive in marking a document relevant. The assumption is that using or not using highlighting will not influence the assessment results, or more specifically the ranking of the search engines that follows from those results. We think however that this assumption can be questioned. The following subsection explains that highlighting can affect the assessments and that therefore the use of highlighting may influence the ranking of search engines. A simple experiment will be described that we applied to detect such differences. If the assessment process would indeed be seriously influenced by the use of highlighting, the implications would be large. Not only the assessment protocol would have to change, but the validity of the assessments of previous years should also have to be reconsidered. 4.2 Possible influences of highlighting on assessment results We wanted to investigate two different aspects of the assessment results which might be affected by the use of highlighting. The first is the amount of documents that are marked as relevant, the second is the score of the participating search engines. We did not expect to find hard statistical evidence for presence or absence of either one of the influences, given the size of test data, but rather expected some trend to show up, which would warrant further investigation. The amount of relevant documents Using highlighting might result in more (or less) doc- uments being marked as relevant. Although the assessors are explicitly told not to let the high- lighting affect their judgement it is still possible that that happens unintentionally. For example, assessors might read the documents where terms are highlighted less thouroughly, missing in those documents the relevant parts which do not contain highlighted terms. Or the assessors might just be biased in favor of documents containing highlighted terms. The scores of search engines If the assessors are indeed biased towards documents containing highlighted terms this might influence the scores of the search engines. After all, many search engines rely on detecting the presence of query words for marking them as relevant. So in that case, those engines would perform better with the biased assessment than with assessments produced without using highlighting. 4.3 The experiments The experiment was simple: 18 topics were each assessed at least twice, once with and once without highlighting. These assessments were assigned randomly over 10 people, in such a way that every assessor did some assessments with and without highlighting and no-one assessed one topic twice. The assessors were absolutely not allowed to talk to each other about these assessments until all assessments were finished. 4.4 The results The results of this experiment were not conclusive. For half of the topics, the assessments with highlighting resulted in more relevant documents than the assessments without highlighting. For the rest of the topics it was the other way around. Viewed from the perspective of the assessors, using highlighting did also not result in significantly more or less relevant documents relative to the other assessors working on that topic. 4.5 Conclusion There was no trend discernible that confirmed our expectations. However, we could only test the first aspect described above; we did not have the necessary data to test the effect of the highlighting on the scores of the search engines. This second aspect however is where we expected the most interesting results. We recommend therefore testing that as well. If the amount of data is too small to get reliable results, more data should be collected. If the results show a significant change in the scores of the search engines when highlighting is turned off, the assessment protocol should be reconsidered. It is possible then that the benefits of highlighting do not outweigh the adverse effects on the quality of the assessments, in which case highlighting should not be used anymore. References [1] D. Hiemstra, W. Kraaij, R. Pohlmann and T. Westerveld. Translation resources, merging strategies and relevance feedback for cross-language information retrieval. In Cross-language Information Retrieval and Evaluation, Lecture Notes in Computer Science (LNCS-2069), Springer-Verlag, pages 102–115, 2000. [2] W. Kraaij, TNO at CLEF-2001: Comparing Translation Resources. In Working Notes of CLEF 2001 Workshop, 2001. [3] P. McNamee, J. Mayfield. A Language-Independent Approach to European Text Retrieval In Cross-language Information Retrieval and Evaluation, Lecture Notes in Computer Science (LNCS-2069), Springer-Verlag, pages 102–115, 2000. [4] Djoerd Hiemstra. Using Language Models for Information Retrieval Ph.D. Thesis, Centre for Telematics and Information Technology, University of Twente, January 2001 [5] F. de Jong, J.-L. Gauvain, Dj. Hiemstra, K. Netter. Language-Based Multimedia Informa- tion Retrieval. In Content-Based Multimedia Information Access, RIAO 2000 Conference Proceedings, 2000, ISBN 2-905450-07-X, C.I.D.-C.A.S.I.S., Paris, 713-722.