iCLEF at Sheffield Zoë Bathie, Mark Sanderson (m.sanderson@shef.ac.uk) Department of Information Studies, University of Sheffield, Western Bank, Sheffield, S10 2TN, UK Abstract Sheffield’s contribution to the interactive cross language information retrieval track took the approach of comparing user’s abilities at judging the relevance of machine translated French documents against ones written in the users’ native language: English. Conducting such an experiment is challenging, and the issues surrounding the experimental design are discussed. Experimental results strongly suggest that users are just as capable of judging relevance on the native language documents are they are on the translated. Introduction An important and relatively little studied aspect of cross language information retrieval (CLIR) research is user interaction with such a system. Even the most fundamental aspects of retrieval, such as user ability to formulate effective queries or judge retrieved documents, has hardly been examined in a cross language context. As a consequence, the interactive cross language evaluation forum (iCLEF) was set up. Starting this year, the track studied one aspect of the interactive process: the user’s ability to judge the relevance of retrieved foreign (i.e. target) language documents translated in some manner into the users’native (i.e. source) language. Using a test collection, user relevance judgements were compared to the judgements previously made by relevance assessors. The aim of the track was to compare different translation methods. However, at Sheffield, a different approach was taken: comparing user ability to judge the relevance of translated news articles against ability to judge articles written in the user’s native language. The rest of this article describes the Sheffield work: a short literature review prefixes the experimental design followed by a discussion of issues arising from the design. Next, the results are presented, and possible future work is outlined. The experiment It might seem reasonable to assume that if a user is presented with a clearly written document that another has judged for relevance, the user will agree with the judgement. However, as is well known, relevance assessments are subjective depending on user interpretation of the query and document, which is based on prior knowledge of the subject. Consequently, there can be a reasonable level of disagreement between judges. Voorhees (1988), amongst others, studied this issue. The iCLEF experiments compare relevance judgements of users against those made previously by assessors, although in this case the assessors were reading the judged documents in their original language and the versions examined by the users were translations of some type. Although the aim of iCLEF was to assess the extent the translation had impaired users’ ability to judge, any such measurement would also include disagreements between users and assessors on what constitutes relevance. Others conducting the iCLEF experiment choose to rely on past work on levels of disagreement to provide an indication of how important this factor is. Sheffield opted instead to attempt to separate out these factors by conducting a form of control experiment: comparing ability to judge relevance of translated documents against judgements made on native language documents. Specifically, users were presented with documents retrieved in response to iCLEF test collection queries: from either French newspaper articles automatically translated using Systran software; or English language articles from the LA Times. Both newspapers collections covered the same time frame. Specifics about the queries used, collections searched and forms of relevance judgement made are outlined in the iCLEF overview paper elsewhere in the iCLEF notes (Oard & Gonzalo Arroyo 2001). Designing an experiment to compare effectively user ability to judge relevance in native and translated documents is problematic. In the design chosen here any difference in relevance judgements across the two sets documents can be attributed to factors other than the quality of English in the texts. Writing styles or assumptions of prior cultural knowledge may differ in Le Monde and the LA Times and such factors may affect user relevance judgement. In addition, the assessors (to whom user judgements are compared) are different for the two collections, as are the conditions under which they performed their assessment; again this might be an influencing factor. Even the retrieval system may have behaved differently on the two collections and this may influence the type of retrieved relevant documents presented to the user. Despite these issues, it was judged that continuing with the experiment as described was sensible as there appears to be no simple experimental design that can accurately measure user ability to judge relevance against translation quality that is not confounded by other factors1. Therefore, we take the position of assuming that the additional factors in this experiment do not contribute significantly to our experimental results. Therefore, following the iCLEF design, eight subjects were presented with retrieved documents from four queries in two different situations: half of the queries retrieved on the French collection and the other half on the English collection. A latin square design was used to ensure that query order and presentation of system did not confound the experiment. Users were given twenty minutes to judge the documents retrieved for each query. The subjects were Sheffield University students, who were native English speakers. They spent three hours in total on the experiment being paid £20 for their participation. Results Results from the initial data returned by iCLEF are shown in the table below. The effectiveness of users was determined using Van Rijsbergen’s F measure (Van Rijsbergen 1979), where user judgements were compared to those made previously by assessors. The variable a was set to values of 0.2 and 0.8 to bias F to indicate user preference for recall and precision respectively. System F (a=0.2) F (a=0.8) Le Monde 0.49 0.60 LA Times 0.40 0.46 As can be seen for both values of F users judge relevance better on the translated French documents than on the English originals, however use of a t-test indicated that the differences were not significant. We believe that despite the potential problems with the experimental design, we have shown with some degree of confidence that the users reading the retrieved machine translated documents are more than able to judge the relevance of the retrieved text. As described in the main iCLEF paper (Oard & Gonzalo Arroyo, 2001), users were asked to judge documents as relevant, not relevant, or somewhat relevant. The table above shows results of the user judgements focussing only on documents marked as relevant. The table below shows results re-calculated when documents marked as somewhat relevant are included. As can be seen, the difference in F values between Le Monde and LA Times is somewhat smaller particularly in the precision oriented F measure, indicating that users are more accurate in judging the relevance of marginally relevant native language documents than they are of translations. System F (a=0.2) F (a=0.8) Le Monde 0.65 0.59 LA Times 0.58 0.52 Finally, the degree of overlap between the sets of relevant documents (judged by the assessors) and the experimental subjects was measured. Overlap is defined as the intersection of the two sets divided by the union. In Voorhees’s work (1998) overlap between pairs of assessors was found to range between 0.42 and 0.49. Taking the user judgements of those judged both relevant and somewhat relevant, the overlap ranged between 0.39 and 0.47, a similar range. Conclusions and future work In this report, we have described an experiment that compares user ability to judge relevance of documents written in different languages. The difficulty of designing such an experiment was discussed and the results of the experiment presented. The conclusion from the results was that for the documents tested here, French documents automatically translated into English using a good machine translation system are sufficiently readable to allow users to make accurate relevance judgements. 1 One could design an experiment where native English speakers judge the translated French documents and native French speakers judge the same (un-translated) French documents. However, the two groups of users are likely to have different cultural backgrounds, which may influence the results. Whether this difference would influence experimental results more or less than the design chosen here can only be determined through further experimentation. Note, there was a very pragmatic reason for not pursuing this design: finding a sufficient number of French speakers would have been hard to achieve. Extensions of this work would involve conducting further experiments to expand the number of users to try to find statistical significance in the data. In addition, exploring other experimental designs will also be a priority. Acknowledgements The first author conducted this work as part of her Master’s course at Sheffield. Financial support for the work in this report was provided jointly by the MIND (IST-2000-26061, mind.cs.strath.ac.uk) and Clarity (IST-2000- 25310, clarity.shef.ac.uk) projects. References Oard, D.W., Gonzalo Arroyo, J. (2001): The CLEF 2001 Interactive Track, in Working notes of the Cross Language Evaluation Forum Van Rijsbergen, C.J. (1979): Information retrieval (second edition), Butterworths, London Voorhees, E. (1998): Variations in Relevance Judgements and the Measurement of Retrieval Effectiveness, in Proceedings of the 21st annual international ACM-SIGIR conference on Research and development in information retrieval: 315-323