Over view of the CLEF 2006 Multilingual Question Answer ing Tr ack Bernardo Magnini1, Danilo Giampiccolo2, Pamela Forner2, Christelle Ayache3, Valentin Jijkoun4, Petya Osenova5, Anselmo Peñas6, Paulo Rocha7, Bogdan Sacaleanu8, and Richard Sutcliffe9. 1 ITC­Irst, Trento, Italy (magnini@itc.it) 2 CELCT, Trento, Italy ({giampiccolo, forner}@celct.it) 3 ELDA/ELRA, Paris, France (ayache@elda.fr) 4 Informatics Institute, University of Amsterdam, The Netherlands (jijkoun@science.uva.nl) 5 BTB, Bulgaria, (petya@bultreebank.org) 6 Departamento de Lenguajes y Sistemas Informáticos, UNED, Madrid, Spain (anselmo@lsi.uned.es) 7 Linguateca, SINTEF ICT, Norway and Portugal, (Paulo.Rocha@alfa.di.uminho.pt) 8 DFKI, Germany, (Bogdan.Sacaleanu@dfki.de) 9 DLTG, University of Limerick, Ireland (richard.sutcliffe@ul.ie) Abstract Having being proposed for the fourth time, the QA at CLEF track has confirmed a still raising interest from the research community, recording a constant increase both in the number of participants and submissions. In 2006, two pilot tasks, WiQA and AVE, were proposed beside the main tasks, representing two promising experiments for the future of QA. Also in the main task some significant innovations were introduced, namely list questions and requiring text snippet(s) to support the exact answers. Although this had an impact on the work load of the organizers both to prepare the question sets and especially to evaluate the submitted runs, it had no significant influence on the performance of the systems, which registered a higher Best accuracy than in the previous campaign, both in monolingual and bilingual tasks. In this paper the preparation of the test set and the evaluation process are described, together with a detailed presentation of the results for each of the languages. The pilot tasks WiQA and AVE will be presented in dedicated articles. 1 Introduction Inspired by previous TREC evaluation campaigns, QA tracks have been proposed at CLEF since 2003. During these years, the effort of the organisers has been focused on two main issues. One issue was to offer an evaluation exercise characterised by cross­linguality, covering as many languages as possible. From this perspective, major attention has been given to European languages, adding at least one new language each year, but keeping the offer open to languages from all­over the world, as the use of Indonesian shows. The other important issue was to maintain a balance between the established procedure inherited by the TREC campaigns and innovation. This allowed newcomers to join the competition and, at the same time, offered “veterans” more challenges. Following these principles, in QA@CLEF 2006 two pilot tasks, namely WiQA and Answer Validation Exercise (AVE), were proposed together with a main task. As far as the latter is concerned, the most significant innovation was the introduction of lIST questions, which had also been considered for previous competitions, but had previously been avoided due to the problems that their selection and assessment implied. Other important innovations consisted in the possibility to return more than one answer per question, and by the request to provide text snippets together with the docid to support the exact answer. All these changes implied the necessity of introducing new evaluation measures, which would account also for List and multiple answers. Nevertheless, the evaluation process proved to be more complicated than expected, partly because of the excessive workload that multiple answers represented for groups already in charge for a larger number of runs. As a consequence, some groups, like the Spanish and the English ones, could only correct one answer per question, which decreased the possibility of comparisons between runs. As a general remark, it can be said that the positive trend in participation registered in the previous campaigns was confirmed, and 10 new participants joined the competition from Europe, Asia and America. As reflected in the results, systems' performance improved considerably, with the Best Accuracy increasing from 64% to 68% in the monolingual tasks, and, more significantly, from 39% to 49% in the bilingual ones. This paper describes the preparation process and presents the results of the QA track at CLEF 2006. In section 2, the task is described in detail. The different phases of the Gold Standard preparation are exposed in section 3. After a quick presentation of the participants in section 4, the evaluation procedure and the results are reported respectively in section 5 and 6. In section 7, some final considerations are given about this campaign and the future of QA@CLEF. 2 Tasks In 2006 campaign, the procedure consolidated in previous competitions was used. Accordingly, there was a main task (which was comprehensive of a monolingual task and several cross­language sub­tasks), and two pilot tasks described below: 1. WiQA: developed by Maarten de Rijke. The purpose of the WiQA pilot is to see how IR and NLP techniques can be effectively used to help readers and authors of Wikipages access information spread throughout Wikipedia rather than stored locally on the pages.[2] 2. Answer Validation Exercise (AVE): A voluntary exercise to promote the development and evaluation of subsystems aimed at validating the correctness of the answers given by a QA system. The basic idea is that once a pair [answer + snippet] is returned by a QA system, a hypothesis is built by turning the pair [question + answer] into the affirmative form. If the related text (a snippet or a document) semantically entails this hypothesis, then the answer is expected to be correct. [3] Two specific papers in the present Working Notes are dedicated to these pilot tasks. More detailed information, together with the results, can be found there. In addition to the tasks proposed during the actual competition, a "time­constrained" QA exercise will be proposed by the University of Alicante during the CLEF 2006 Workshop. In order to evaluate the ability of QA systems to retrieve answers in real time, the participants will be given a time limit (e.g. one or two hours) in which to answer a set of questions. These question sets are different and smaller than those provided in the main task (e.g. 15­25 questions). The initiative is aimed towards providing a more realistic scenario for a QA exercise. The main task was basically the same as in previous campaigns. Some new ideas were implemented in order to make the competition more challenging. The participating systems were fed a set of 200 questions, which could be about: · facts or events (F­actoid questions); · definitions of people, things or organisations (D­efinition questions); · lists of people, objects or data (L­ist questions). The systems were then asked to return from one to ten exact answers. “Exact” meant that neither more nor less than the information required is given. The answer needed to be supported by the docid of the document(s) in which the exact answer was found, and by one to ten text snippets which gave the actual context of it. The text snippets were to be put one next to the other, separated by a tab. The snippets were substrings of the specified documents. They should provide enough context to justify the exact answer suggested. Snippets for a given response had to be a set of sentences of not more than 500 bytes in total (although for example the Portuguese group accepted – and actually preferred – length to be specified in sentences). There were no particular restrictions on the length of an answer­string, but unnecessary pieces of information were penalized, since the answer was marked as ineXact. Since Definition questions may have long strings as answers, they were (subjectively) assessed mainly on their informativity and usefulness, and not on exactness. The tasks were both: · monolingual, where the language of the question (Source language) and the language of the news collection (Target language) were the same; · cross­lingual, where the questions were formulated in a language different from that of the news collection. Table 1: Task activated in 2006 TARGET LANGUAGES (corpus and answers) BG DE EN ES FR IT NL PT BG DE SOURCE LANGUAGES (questions) EN ES FR IN IT NL PT PL RO Eleven source languages were considered, namely, Bulgarian, Dutch , English, French, German, Indonesian, Italian, Polish , Portuguese, Romanian and Spanish. Note the loss of Finnish, and the introduction of Polish and Romanian with respect to last year. All these languages were also considered as target languages, except for Indonesian, Polish and Romanian. These three languages had no news collection available for the queries. As was done for Indonesian in the previous two campaigns, the English question set was translated into Indonesian (IN), Polish (PL) and Romanian (RO), and the German question set into Romanian (RO). Only the bilingual tasks IN­EN, PL­EN, RO­EN and RO­DE were activated. In the case of IN­EN, PL­EN, and RO­EN, the questions were posed in the respective language (i.e. IN, PL, RO), while the answers were retrieved from the English collection. In the RO­DE case, the question was made in Romanian, whilst the answer was retrieved from the German collection. As shown in Table 1, 24 tasks were proposed and divided in: · 7 Monolingual ­i.e. Bulgarian (BG), German (DE), Spanish (ES), French (FR), Italian (IT), Dutch (NL), and Portuguese (PT); · 17 Cross­lingual. Table 2: Tasks chosen by at least 1 par ticipant in QA@CLEF campaigns. MONOLINGUAL CROSS­LINGUAL CLEF 2003 CLEF 2004 6 13 CLEF 2005 8 15 CLEF 2006 7 17 As customary in recent campaigns, a monolingual English (EN) task was not available as it seems to have been already thoroughly investigated in TREC campaigns, even though English was both source and target language in the cross­language tasks. Although the task was not radically changed with regard to previous campaigns, some new elements were introduced. The most important one was the addition of List questions to the question sets, which implied some major issues. For this first year of QA@CLEF, we were not too strict on the definition of lists, using both questions asking for a specific finite number of answers (that could be called "closed lists") e.g.: Q: What are the names of the two lovers from Verona separated by family issues in one of Shakespeare’s plays? A: Romeo and Juliet. and open lists ,where as many correct answers could be returned, e.g. Q: Name books by Jules Verne. and let organizing groups decide on how to assess the answers to these different kinds of questions. Other innovations were: · the input format, where the type of question (F,D,L) was no longer indicated; · and the result format, where up to a maximum of ten answers per question was allowed, with one to ten text snippets supporting the exact answer. Table 3: Document collections used in CLEF 2006. TARGET LANG.. COLLECTION PERIOD SIZE 2002 120 MB (33,356 docs) Sega Bulgar ian (BG) 2002 93 MB (35,839 docs) Standart 1994 320 MB (139,715 docs) Frankfurter Rundschau Ger many (DE) 1994/1995 63 MB (13,979 docs) Der Spiegel 1994 144 MB (71,677 docs) German SDA 1995 141 MB (69,438 docs) German SDA English (EN) Los Angeles Times 1994 425 MB (113,005 docs) Glasgow Herald 1995 154 MB (56,472 docs) 1994 509 MB (215,738 docs) EFE Spanish (ES) 1995 577 MB (238,307 docs) EFE Le Monde 1994 157 MB (44,013 docs) Fr ench (FR) Le Monde 1995 156 MB (47,646 docs) French SDA 1994 86 MB (43,178 docs) French SDA 1995 88 MB (42,615 docs) 1994 193 MB (58,051 docs) La Stampa Italian (IT) 1994 85 MB (50,527 docs) Itallian SDA 1995 85 MB (50,527 docs) Itallian SDA Dutch (NL) 1994/1995 299 MB (84,121 docs) NRC Handelsblad 1994/1995 241 MB (106,483 docs) Algemeen Dagblad Público 1994 164 MB (51,751 docs) Por tuguese (PT) Público 1995 176 MB (55,070 docs) Folha de São Paulo 1994 108 MB (51,875 docs) Folha de São Paulo 1995 116 MB (52,038 docs) 3 Test Set Preparation Following the procedure established in previous campaigns, initially each organising group (one for each Target language) was assigned a number of topics taken from the CLEF IR track on which candidates’ questions were based. This choice was originally made to reduce the number of duplicates in the multilingual question set. As the number of new topics introduced in 2006 was small, old topics were simply reassigned to different groups. Some groups questioned this methodology, preferring to produce questions with other methods instead of following particular topics. The topics, and hence the questions, were aimed at data collections composed of news articles provided by ELRA/ELDA dating back to 1994/1995; with the exception of Bulgarian, which dated back to 2000 (see Table 3). The choice of a different collection was a matter for long discussion; copyright issues remaining a major obstacle. A step towards a possible solution was nevertheless made by the proposal of the WiQA pilot task, which represents a first attempt to set the QA competitions in their natural context, i.e. the Internet. Table 4: Test set breakdown accor ding to question type F (150) D (40) L (10) T (40) NIL (20) BG 145 43 12 26 17 DE 152 38 10 39 20 EN 150 40 10 40 18 ES 148 42 10 40 21 FR 148 42 10 40 20 IT 147 41 12 38 20 NL 147 40 13 30 20 PT 143 47 9 23 18 Initially, 100 questions were selected in each of the source languages, distributed between Factoid, Definition and List questions. Factoid questions are fact­based questions, asking for the name of a person, a location, the extent of something, the day on which something happened, etc. The following 6 answer types for factoids were considered: - PERSON (e.g. "Who was Lisa Marie Presley's father ?") - TIME (e.g. "What year did the Second World War finish? ") - LOCATION (e.g. "What is the capital of Japan? ") - ORGANIZATION (e.g. "What party did Hitler belong to? ") - MEASURE (e.g. "How many monotheistic religions are there in the world? ") - OTHER, i.e. everything else that does not fit into the other five categories (e.g. "What is the most­read Italian daily newspaper? ") Definition questions, i.e. questions like "What/Who is X?", were divided into the following categories: · PERSON ­i.e. questions asking for the role, job, and/or important information about someone (e.g. "Who is Lisa Marie Presley? "); · ORGANIZATION ­i.e. questions asking for the mission, full name, and/or important information about an organization (e.g. "What is Amnesty International? " or "What is the FDA?"); · OBJECT ­i.e. questions asking for the description or function of objects (e.g. “What is a Swiss army knife? ”, “What is a router? ”); · OTHER ­i.e. question asking for the description of natural phenomena, technologies, legal procedures etc. (e.g. “What is a tsunami? ”, “What is DSL? ”, “What is impeachment? ”). The last two categories were especially added to reduce the numbers of definition questions which may be answered very easily (such as acronyms concerning organizations, which are usually answered rendering the abbreviation in full, and people’s job­description, which are usually found as appositions of proper names in news text). As mentioned above, questions that require a list of items as answers, were introduced for the first time. (e.g. Name works by Tolstoy). Among these three categories, a number of NIL questions, i.e. questions that do not have any known answer in the target document collection, were distributed. They are important because a good QA system should identify them, instead of returning wrong answers. Three different types of temporal restriction – a temporal specification that provides important information for the retrieval of the correct answer, were associated to a certain number of F, D, L, more specifically: - restriction by DATE (e.g. "Who was the US president in 1962?"; “Who was Berlusconi in 1994?”) - restriction by PERIOD (e.g. "How many cars were sold in Spain between 1980 and 1995?") - restriction by EVENT (e.g. "Where did Michael Milken study before enrolling in the university of Pennsylvania? ") The distribution of the questions among these categories is described in Table 4. Each of the question sets was then translated into English, so that each group could choose additional 100 questions from those proposed by the others and translate them in their own languages. At the end, each source language had 200 questions, which were collected in an XML document. Unlike in the previous campaigns, the questions were not translated in all the languages due to time constraints, and the Gold Standard contained questions in multiple languages only for activated tasks. Since Indonesian, Polish and Romanian did not have a data collection of their own, the English question set was translated, so that the cross­lingual subtasks IN­EN, PL­EN and RO­EN were made available. As not all questions had been previously translated, a translation of the target language question sets into the source languages was needed for cross­language sub­tasks which had at least one registered participant. 4 Participants The number of participants has constantly grown over the years [see Table 5]. In fact, about ten new groups have joined the competition each year, and in 2006 a total of 30 participants was reached. Table 5: Number of participating gr oups Registered New Veterans Absent America Europe Asia Australia TOTAL participants groups veterans CLEF 2003 3 5 ­ ­ 8 CLEF 2004 1 17 ­ ­ 18 (+125% ) CLEF 2005 1 22 1 ­ 24(+33% ) 27 9 15 4 CLEF 2006 4 24 2 ­ 30 (+25% ) 36 10 20 4 For the record, the number of groups which registered for the competition but did not actually participate in it was six, while four groups which took part in QA2005 did not show up in 2006. From a geographical perspective, most groups came from Europe, but in 2006 there was an increase in participants from both Asia and America [see Table 5]. Table 6. Number of submitted r uns Number of submitted runs Monolingual Cross­lingual # CLEF 2003 17 6 11 CLEF 2004 48 20 28 CLEF 2005 67 43 24 CLEF 2006 77 42 35 The increase in the number of submitted runs corresponded to that of the participants. Of higher significance is the slight decrease registered in monolingual subtasks to the advantage of bilingual ones. This indicates that QA@CLEF is becoming increasingly cross­lingual, as it was originally set out to be. 5 Evaluation The introduction of list questions, the possibility to return multiple answers, and the requirement of supporting the answers with snippets of texts from the relevant documents made the evaluation process more difficult. Moreover, in some languages the large amount of data requiring assessment made it impossible for the judging panels to correct more than one answer per question. Therefore, only the first answers were evaluated in runs that had English and Spanish as a target. In all other cases at least the first three answers were evaluated. Considering these issues, it was decided to follow the procedure utilised during the previous campaign. The files submitted by the participants in all tasks were manually judged by native speakers. Each language coordination group guaranteed the evaluation of at least one answer per question. If a group decided to assess more than one answer per question, the answers were assessed in the order they occurred in the submission file and the same number was applied to all questions, and all the runs assessed by the group. The exact answer (i.e. the shortest string of words which is supposed to provide the exact amount of information to answer the question) was assessed as: · R (Right) if correct; · W (Wrong) if incorrect; · X (ineXact) if contained less or more information than that required by the query; · U (Unsupported) if either the docid was missing or wrong, or the supporting snippet did not contain the exact answer. Most assessor­groups managed to guarantee a second judgement of all the runs, with a good average inter­ assessor agreement. As far as the evaluation measures are concerned, the list questions had to be scored separately, and different groups returned a different number of answers for originally meant Factoid and Definition questions. As a consequence, we decided to provide the following measures: · accuracy, as the main evaluation score, defined as the average of SCORE(q) over all 200 questions q; · the mean reciprocal rank (MRR) over N assessed answers per question. That is, the mean of the reciprocal of the rank of the first correct label over all questions; · the K1 measure used in earlier QA@CLEF campaigns [2] · the Confident Weighted Score (CWS) designed for systems that give only one answer per question. Answers are in a decreasing order of confidence and CWS rewards systems that give correct answers at the top of the ranking [2] Although some other kinds of measures have been proposed and used in CLEF 2005, such as a more detailed analysis/breakdown of bad answers by the Portuguese group .[7], they were not considered this year. Also, issues like providing more accurate description of what X means: too much or too little were only distinguished by the Portuguese assessors, argued for i.a. in Rocha and Santos [4]. 6 Results 80 64.5 68.95 70 60 49.47 50 45.5 41.5 39.5 40 35 35 29 29.36 27.94 30 23.7 24.83 18.48 20 17 14.7 10 0 Mono Bilingual Mono Bilingual Mono Bilingual Mono Bilingual Best CLEF03 CLEF04 CLEF05 CLEF06 Average Figur e 1: Best and average scor es in CLEF QA campaigns* *Note: the figure does not consider the run utjp061plen. As far as accuracy is concerned, a general improvement has been noticed, as Figure 1 shows. In detail, Best accuracy in the monolingual task improved by 6.9%, passing from last year’s 64.5% to 68.95%, while Best accuracy in cross­language tasks passed from 39.5% to 49.47%, recording an increment of 25.2%. As far as average performances are concerned, a slight decrease has been recorded in the monolingual tasks, which went from 29.36% to 27.94%. This probably was due to the number of newcomers which tested their systems for the first time. As a general remark, best performances has been quite stable, with most languages registering similar or better scores than last campaigns (see Figure 2). Table 7: Best accur acy scor es compar ed with K1, MRR, and CWS FILE NAME OVERALL K1 MRR CWS ACCURACY BEST syna061fr fr .txt 68.95% 0.2832 0.6895 0.56724 inao061eses.txt 52.63% 0.0716 0.5263 0.43387 ulia061frfr.txt 46.32% 0.0684 0.4632 0.46075 ulia062frfr.txt 45.79% 0.0579 0.4579 0.45546 vein061eses.txt 42.11% ­0.0657 0.4211 0.33582 alia061eses.txt 37.89% ­0.1232 0.3763 0.23630 upv_061eses.txt 36.84% 0.0014 0.3684 0.22530 ulia061enfr .txt 35.26% ­0.1684 0.3526 0.34017 Although also in 2006 campaign self confidence score was not returned by all systems, data about the confidence were plentiful, and allowed to consider the additional evaluation measures, i.e. K1, CWS and MMR. Generally speaking, systems with high accuracy scored accordingly well also with these measures, implying that best systems provide high self confidence, as Table 7 shows. 80 68,95 65,96 64,5 Best 70 2004 64 53,16 60 49,5 45,5 Best 42,33 43,5 50 42 2005 34,01 32,5 40 31,2 28,64 28,19 27,5 27,5 26,6 25,5 22,63 24,5 28 23,5 30 23 Best 20 2006 10 0 Bulg arian German Eng lish Spanish Finnish French Italian Dutch Portuguese Figur e 2: Best results in 2004, 2005 and 2006. Here below a more detailed analyses of the results in each language follows, giving more specific information on the performances of systems in the single sub­tasks and on the different types of questions, providing the relevant statistics and comments. 6.1 Bulgarian as Target At CLEF 2006 Bulgarian was addressed as a target language for the second time. This year there was no change in the number of the participants ­­ again two groups took part in the monolingual evaluation task with Bulgarian as a target language: BTB at Linguistic Modelling Laboratory, Sofia and The Joint Research Centre (JRC), Ispra. Three runs altogether were submitted – one by the first group and two by the second group with insignificant difference between them. The 2006 results are presented in Table 8 below. First, the correct answers in numbers and percentage are given (Right) per run. Then the wrong (W), inexact (X) and unsupported answers (U) are shown in numbers. Further, the number of the factoids (F), temporally restricted questions (T), definitions (D) and list questions (L) are given. Also, the percentage of the correct answers per each type is registered in Table 8. NIL questions are presented as the number of correctly and wrongly returned answers by the systems with NIL marking. It is obvious that the systems returned NIL answer also when they could not detect a possibly existing answer in the corpus themselves. In our opinion, the present NIL marking might be divided into two labels: NIL = no answer in the corpus is existing and CANNOT = the system itself cannot find an answer. In this way the evaluation would be more realistic. Main reciprocal rank score is provided in the last column of the table. As it can be seen, this year the first system performs better. However, its overall accuracy is slightly worse with respect to the 2005 best accuracy result, achieved then by IRST, Trento. Now it is 26.60 %, while in 2005 it was 27.50 %. However, the BTB 2005 year result was significantly improved. Both systems ‘crashed’ at temporally restricted questions with no single match (see the empty slots in the table). It is a step back from 2005, when both systems had some hits, best of which scored 17.65 %. List questions are also very poorly answered (1 correct answer per run). Table 8: Results at the Bulgar ian as tar get, monolingual Right W X U %F %T %D %L NIL [16] Run # % # # # [119] [26 ] [43] [12] right wrong r btb061 50 26.60 132 4 1 17.93 ­ 55.81 0.0833 11 120 0.2660 jrc061 22 11.70 162 4 0 6.90 ­ 27.91 0.0833 12 155 0.1170 jrc062 22 11.70 160 6 0 6.90 ­ 27.91 0.0833 13 154 0.1170 The only outperforming results in comparison with the last year are the following: the improvement of the definition type answers (from 42 % to 55.81 %) and the raise of the main reciprocal rank score (from 0.160 to 0.2660). The introduction of the snippet support proved out to be a good idea. There was only 1 unsupported answer in all three runs. The interannotator agreement was very high due to two reasons: first, the number of the answered questions was not very high, and second, there were strict guidelines for the interpretation of the answers, based on our last year experience. In spite of the somewhat controversial results from the participating systems this year, there is a lot of potential in the task of Bulgarian as a target language in several aspects: investing in the development of the present systems and creating new systems. We hope that Bulgarian will become even more attractive as an EU language. 6.2 Dutch as Target This year three teams that took part in the CLEF QA track used Dutch as the target language: the University of Amsterdam, the University of Groningen and the University of Roma – 3, with six runs submitted in total: three Dutch monolingual and three crosslingual (English to Dutch). All runs were assessed by two assessors, with the overall inter­assessor agreement 0.96. For creating the gold standard for Dutch, the assessments were automatically reconciled in favour of more lenient assessments: for example, in case the same answer was assessed as W (incorrect) by one assessor and as X (inexact) by another, the X judgement was included in the gold standard. The results of the evaluation of the six runs are provided in Tables 9 and 10. The columns labelled Right, W, X and U give the results for factoid, definition and temporally restricted questions. Table 9: Results at the Dutch as tar get, monolingual Right W X U % F % T % D P@N (lists) Accur acy NIL MRR Run # % # # # [146] [0] [40] [13] [10] [187] Gron061nlnl 58 31.02 115 11 3 27.40 0.00 45.00 23.08 0 0.3460 Isla061nlnl 40 21.39 141 4 2 21.23 0.00 22.50 0.00 0.1346 0.2341 Isla062nlnl 41 21.93 139 4 3 21.92 0.00 22.50 0.00 0.1346 0.2357 An interesting thing to notice about this year’s task is that the overall scores of the systems are lower, compared to the last year’s numbers (44% and 50% of correct answers to factoid questions last year). This year’s questions were created by annotators who were explicitly instructed to think of “harder” questions, that is, involving paraphrases and some limited general knowledge reasoning. It would be interesting to compare the performance of this year’s systems on last year’s questions to the previous results of the campaign. Table 10: Results at the Dutch as tar get, cr oss­lingual (English to Dutch) Right W X U % F % T % D P@N (lists) Accur acy NIL MRR Run # % # # # [146] [0] [40] [13] [10] [187] Gron061ennl 38 20.32 139 7 3 18.37 0.00 28.21 6.15 0.1481 0.2239 Roma061ennl 25 13.37 150 6 3 11.56 0.00 20.51 17.95 0.0769 0.1430 Roma062ennl 25 13.37 149 7 3 11.56 0.00 20.51 15.38 0.0769 0.1529 6.3 English as Target Cr eation of Questions. The question for creation of the questions was very similar to last year and is now a well understood procedure. This year it was required to store supporting snippets for the reference answers but this was not difficult and is well worth the trouble. As previously, we were requested to set Temporarily Restricted questions and to distribute these in a prescribed way over the various Factoid question types (PERSON, LOCATION etc). We achieved our quotas but this was extremely difficult to accomplish and we do not feel the time spent is worthwhile as the addition of temporal restrictions more than doubles the time taken to generate the questions. On the other hand, as the restrictions are frequently synthetic in nature, our knowledge of how to solve these important questions does not necessarily advance from year to year. Searching for Definition questions (or indeed any questions beyond Factoids) is always very interesting work but the method of evaluation was not clarified this year. So, while the topics we selected do follow the guidelines, we were not required to (or indeed able to) state at generation time exactly what a complete and correct answer should look like. In consequence we can not conclude much from an analysis of the answers returned by systems to such questions. Summar y Statistics for all the Runs. Overall, thirteen cross­lingual runs with English as a target were submitted. The results are shown in. Ten groups participated in seven languages, French, German, Indonesian, Italian, Romanian, Polish and Spanish. There were three groups for French, two for Spanish and one for all the rest. Results Analysis. There were three main types of question this year, Factoids, Definitions and Lists and we consider the results over these types as well as considering the best scores overall. The most indicative measure overall is a simple count of correct answers and this is what we have used. For the 150 Factoids the best system was utjp061plen (Polish­English) with 132 correct. This is by far the best and is vastly higher than last year. By comparison, the top five are utjp061plen (132), lire062fren (39), lire061fren (33), dltg061fren (32) and aliv061esen (29). The other results are not greatly different from last year. The top result of 132/150 amounts to 88%. The next best result of 39/150 is 26%. For the 40 definitions, the picture is similar. The top five results are utjp061plen (32), aliv062esen (11), lire061fren (10), aliv061esen (9), lire062fren (9) and dfki061deen (8). Again, the top result is far higher than the rest amounting to 32/40 i.e. 80% with the next being 11/40 i.e. 28%. Table 11: Results of English r uns OVERALL Right W X U % F % D P@N for L ACCURACY Run # # # # [150] [40] [10] % aliv061esen 38 142 4 6 19.33 22.50 0.0411 20.00 aliv062esen 29 156 3 2 12.00 27.50 0.0200 15.26 aske061esen 10 134 11 34 6.67 0.00 0 5.26 aske061fren 7 135 10 37 3.33 5.00 0.0100 3.68 dfki061deen 34 147 9 0 17.33 20.00 0.2000 17.89 dltg061fren 36 138 14 2 21.33 10.00 0.2000 18.95 irst061iten 24 152 3 11 16.00 0.00 0.1600 12.63 lire061fren 43 138 2 7 22.00 25.00 0.0900 22.63 lire062fren 48 130 2 10 26.00 22.50 0.0800 25.26 uaic061roen 25 150 7 8 15.33 5.00 0.1131 13.16 uaic062roen 18 171 1 0 12.00 0.00 0.0800 9.47 uind061inen 14 159 4 13 9.33 0.00 0 7.37 (utjp061plen 164 14 5 7 88.00 80.00 0.6500 86.32)*1 For each of the ten list questions, a system could return up to ten candidate answers. Considering both a simple count of correct answers and the P@N score achieved, the top five results by count are utjp061plen (18, 0.65), uaic061roen (10, 0.11), lire061fren (9, 0.09), irst061iten (8, 0.16), lire062fren (8, 0.08) and dfki061deen (6, 0.2). By either score, utjp061plen is the best while the ordering of the rest differs for the P@N score: utjp061plen (18, 0.65), dfki061deen (6, 0.2) irst061iten (8, 0.16), uaic061roen (10, 0.11), lire061fren (9, 0.09) and lire062fren (8, 0.08). Assessment Pr ocedur e. This approach to assessment was broadly similar to that of last year. However, as the format of the runs had changed, we decided not to use the NIST software but to work with the bare text files instead. It had been intended to double­judge all the questions but unexpectedly and at the last moment this proved not to be possible due to the absence of an assessor. There were 200 questions in all. One assessor judged all answers to questions 1­100 while the other two judged all answers to questions 101­200. There were considerable practical problems with the assessment of runs this year. Firstly, several runs used invalid run tags. Secondly two of the runs were answering the questions in a completely different order! Thirdly, one question in these two runs was different from the question being answered by the other systems in that position. Fourthly, one run had the fields in the wrong order. Fifthly one run used NULL instead of NIL while another run used nil. Luckily we spotted problems 2 and 3 and were able to correct them and indeed all the others but this was extremely time consuming and difficult. As in all previous years the runs were anonymised by a third party so none of the assessors knew either the origin of a run or the original source language. This year it had been decided to allow multiple answers to Factoid and Definition questions (up to ten per question). The rationale for this was never quite clear since the whole objective of Question Answering (as against Information Retrieval) is to return only the right answer. Even in cases where there are genuinely several right answers (a rare situation in our carefully designed question sets) a system should still return a correct answer in the first place. For this reason and due to our limited time and resources, we only judged the first answer returned to Factoid and Definition questions. For List questions, all candidate answers were judged, as is normal at TREC. For the questions double judged, we measured the agreement level. There were 149 differences over thirteen runs of 100 questions. This amounts to 149/1300 i.e. 11% disagreement or 89% agreement. The overall figure for last year was 93%. 1 This result is still under verification. Concerning the judgement process itself, Factoids and Lists did not present a problem as we were very familiar with them. On the other hand Definitions were in the same state as last year in that they had been included in the task without a suitable evaluation prodedure having been defined. In consequence we used the same approach as last year: If an answer contained information relevant to the question and also contained no irrelevant information, it was judged R if supported, and U otherwise. If both relevant and irrelevant information was present it was judged X. Finally, if no relevant information was present, the answer was judged W. Comment and Conclusions. The number of runs judged (13) was similar to last year (12). However, three source languages were introduced: Indonesian, Polish and Romanian. The results themselves were also broadly similar with the exception of the Polish run which was vastly higher on all question types. Definition questions remained in the same unspecified state as previously. This means that we have not been successful in stretching the boundaries of question answering beyond Factoids which are now very well understood. This is a great pity as the extraction of useful 'definition type' information on a topic is a very useful task for groups to study but it is one which needs to be carefully quantified. The introduction of snippets was very helpful at question generation time and also invaluable for judging the answers. Snippets are a great step forward for CLEF and are the most significant development for the QA Track this year. 6.4 French as Target This year (as last year) seven groups took part in evaluation tasks using French as target language: four French groups: Laboratoire d’Informatique d’Avignon (LIA), CEA­List, Université de Nantes (LINA) and Synapse Développement; one Spanish group: Universitat Politécnica de Valencia; one Japanese group; and one American group: LCC. In total, 15 runs have been returned by the participants: eight monolingual runs (FR­to­FR) and seven bilingual runs (6 EN­to­FR, 1 PT­to­FR). It appears that the number of participants for the French task is the same that last year but it’s the first time there are non­European participants. This shows there is a new major interest for the French as target language. Two groups submitted four runs, two other groups submitted two runs and three groups submitted only one run. This year and for the first time, the participants could return up to 10 answers per question. A major part of participants returned only one answer per question, only three groups returned more than one answer per question. For these three groups, ELDA (Evaluation and Language resources Distribution Agency) assessed the three first answers for Factual, Definition and Temporally restricted questions. Table 12: Results of the monolingual and bilingual Fr ench r uns. U Assessed Right Wr ong ineXact Over all Accur acy Accur acy MRR Id answer CWS K1 P@N Answer s answer answer s answer s Accur acy over F over D (F, D, Par ticipant s (F, D, T) Measur e (L) (#) s (#) (#) (#) (% ) (% ) (% ) T) (#) aske061frfr 635 27 138 12 12 14.21 16.89 4.76 0.1974 0.14211 ­­­ 0.0900 lcea061frfr 589 30 151 6 3 15.79 10.14 35.71 0.1907 0.15789 ­­­ 0.1633 lina061frfr 207 56 114 18 2 29.47 27.70 35.71 0.2947 0.25517 ­0.3777 0.3651 syna061fr fr 200 129 50 9 2 67.89 63.51 83.33 0.6789 0.55685 0.2729 0.5000 ulia061frfr 200 88 93 7 2 46.32 37.84 76.19 0.4632 0.46075 0.0684 0.5000 ulia062frfr 200 86 89 9 6 45.26 36.49 76.16 0.4501 0.45016 0.0474 0.2000 upv061frfr 200 60 119 10 1 31.58 31.08 33.33 6 0.3158 0.16389 ­0.0047 0.3000 upv062frfr 200 47 124 18 1 24.74 26.35 19.05 0.2474 0.10883 ­0.0931 0.2000 aske061enfr 640 19 157 6 8 10.00 12.16 2.38 0.1445 0.01662 ­0.2797 0.0633 lcc061enfr 578 40 125 23 2 21.05 25.00 7.14 0.2623 ­ 0.3967 syna061enf 200 86 97 6 1 45.26 37.16 73.81 0.4526 0.04856 0.45263 0.1816 ­­­ 0.2000 r syna062enfr 200 63 120 6 1 33.16 25.68 59.52 0.3316 0.33158 ­­­ 0.1000 ulia061enfr 200 66 114 7 3 34.74 26.35 64.29 0.347 0.33478 ­ 0 ulia062enfr 200 66 111 9 4 34.74 26.35 64.29 4 0.3474 0.3474 0.1789 ­0.1789 0.1000 syna061ptfr 200 94 90 4 2 49.47 41.50 76.74 0.4947 0.49474 ­­­ 0 Table 12 shows the results of the assessment of each run for each participant and for the two tasks. 80 70 60 50 2004 40 2005 2006 30 20 10 0 BEST MONO AVERAGE MONO BEST BILING AVERAGE BILING Figure 3: Best and aver age scor es for systems using French as tar get in CLEF QA campaigns Figure 3 shows the best and the average scores for systems using French as target in the last three CLEF QA campaigns. For both monolingual and bilingual tasks, the best results were obtained by a French group, Synapse Développement. Another French group, LIA, reached the 2nd position for the two tasks. For the monolingual task, the systems returned between 27 and 129 correct answers in 1st rank. For the bilingual task, the systems returned between 19 and 86 correct answers. The test set was composed of 190 Factual (F), Definition (D) and Temporally restricted (T) questions, and 10 List questions. The accuracy has been calculated over all the first answers of F, D, T questions and also the Confidence Weighted Score (CWS), the Mean Reciprocal Rank score (MRR) and the K1 measure. For the List questions, the P@N has been calculated. For the monolingual task, the best system returned 67.89 % of correct answers (overall accuracy in 1st rank). We can observe this system obtained better results for definition questions (83.33 %) than for Factoid questions (63.51 %). The LIA’ system, which reached the second position in this task, returned 46.32 % of correct answers (overall accuracy in 1st rank). We can also observe the difference between the results for the Factual questions and the results for the Definition questions: 37.84 % of correct answers for the Factual and 76.19 % for the Definition questions. For the bilingual task, the best system obtained 45.26 % of correct answers as opposed to 34.74 % of correct answers for the LIA’ system. We can remark that the best system for the bilingual task (EN­to­FR) obtained worse results than the second system for the monolingual task. This year, before the assessment, the French assessors determined some rules to face up to problems encountered the last year. Concerning Temporally restricted questions for example, to assess an answer as “Correct”, the date, the period or the event had to be present in the document returned by the systems. They decided also to check separately, at the end of the assessment, some questions which seemed difficult to them, to make sure that each answer had received the same “treatment” during the evaluation. The main problem encountered this year, was related to the assessment of the List questions. This was a new kind of questions this year and participants followed different ways to answer to these questions. Some systems returned a list of answers in a same line; others returned an answer per line. ELDA evaluated these answers according to each run (if a line contained one of correct answers or all the correct answers, these answers had been assessed as “Correct”. The best system obtained 5 correct answers out of 10 List questions in total. We can observe that the results for the List questions were not very relevant because of not much questions and not much rules. In conclusion, this year, a system obtained “excellent” results. Synapse Développement obtained 129 correct answers out of 200 (as opposed to 128 last year). This system is the best system for the French language. This year, it’s again the dominant system. In addition, we can observe the same great interest in Question Answering from the European (and now non­ European) research community for the tasks using French as target language. 6.5 German as Target Three research groups submitted runs for evaluation in the track having German as target language: The German Research Center for Artificial Intelligence (DFKI), FernUniversität Hagen (FUHA) and The Institute for Natural Language Processing in Stuttgart (IMS). All of them provided system runs for the monolingual scenario and just one group (DFKI) submitted runs for the cross­language English­German scenario. Two assessors with different profiles conducted the evaluation: a native German speaker with little knowledge of QA systems and a researcher with advanced knowledge of QA systems and a good command of German. Compared to the previous editions of the evaluation forum, this year an increase in the performance of an aggregated virtual system for both monolingual and cross­language tasks was registered, as well as for the cross­language best system’s result (Figure 4). Given the increased complexity of the task (no question type provided, supporting snippets required) and of questions (definition and list), the stability of the best monolingual results can be considered also a gain in terms of performance. 70 60 50 Best Mono 40 Aggregated Mono Best Cross 30 Aggregated Cross 20 10 0 2006 2005 2004 Figur e 4: Results evolution Except for FUHA, the other two groups provided more than one possible answer per question, of which only the first three were manually evaluated. In order to come up with a measure of performance for systems providing several answers per question, Mean Reciprocal Rank (MRR) over right answers has been considered for this purpose. Table 13: Best and Aggregated Mono; Best and Aggregated Cr oss Year Best Mono Aggr egated Mono Best Cr oss Aggr egated Cr oss 2006 42.33 64.02 32.98 33.86 2005 43.5 58.5 23 28 2004 34.01 43.65 0 0 Table 13 resumes the distribution of the right, inexact and unsupported answers over the first three ranked positions as delivered by the systems, as well as the accuracy and MRR for each of the runs. Table 14: Per for mance of evaluated systems # Right # ineXact # Unsuppor ted Run ID Accur acy MRR 1st 2nd 3r d 1st 2nd 3r d 1st 2nd 3r d dfki061dedeM 80 8 7 6 6 1 8 4 1 42.32 45.67 dfki062dedeM 63 15 3 4 5 3 8 0 2 33.33 37.83 fuha061dedeM 61 0 0 0 0 0 4 0 0 32.27 32.27 fuha062dedeM 64 0 0 1 0 0 4 0 0 33.86 33.86 ims061dedeM 25 2 3 0 1 0 8 2 0 13.22 14.28 ims062dedeM 23 3 2 0 1 0 8 2 0 12.16 13.31 dfki061endeC 62 5 7 3 4 2 6 3 0 32.8 35.36 dfki062endeC 50 10 2 5 4 2 3 2 1 26.45 29.45 Two things can be concluded from the answer distribution of Table 14: first, there are a fair number of inexact and unsupported answers that show performance could be improved with a better answer extraction; second, the fair number of right answers among the second and third ranked positions indicate that there is still place for improvements with a more focused answer selection. Table 15: System Per for mance – Details P@N Run ID Right W X U % F % T % D L NIL [20] CWS K1 # % # # # [113] [39] [37] [9] F P R dfki061dedeM 80 42.32 95 6 8 37.82 0 63.64 25.93 0.35 0.28 0.45 0 0 dfki062dedeM 63 33.33 114 4 8 30.13 0 48.48 33.33 0.32 0.27 0.4 0 0 fuha061dedeM 61 32.27 124 0 4 31.41 0 36.36 11.11 0.23 0.13 0.95 0.3 0.18 fuha062dedeM 64 33.86 120 1 4 32.69 0 39.39 11.11 0.24 0.14 0.95 0.32 0.19 ims061dedeM 25 13.22 156 0 8 14.1 0 9.09 25.42 0.2 0.12 0.55 0.07 ­0.33 ims062dedeM 23 12.16 158 0 8 12.82 0 9.09 26.43 0.19 0.12 0.5 0.06 ­0.33 dfki061endeC 62 32.8 117 3 6 28.21 0 56.25 10 0.31 0.21 0.6 0 0 dfki062endeC 50 26.45 130 5 3 21.79 0 50 10 0.33 0.22 0.65 0 0 The details of systems’ results can be seen in Table 15, in which the performance measures has been computed only for the first ranked answers to each question, except for the list questions. Interesting to observe is that none of the systems managed to correctly respond any temporal question. Table 16 describes the inter­rater disagreement on the assessment of answers in terms of question and answer disagreement. Question disagreement reflects the number of questions on which the assessors delivered different judgments and answer disagreement is a figure of the total number of answers disagreed on. Along the total figures for both types of disagreement, a breakdown at the question type level (Factoid, Definition, List) and at the assessment value level (ineXact, Unsupported, Wrong/Right) is listed. The answer disagreements of type Wrong/Right are trivial errors during the assessment process when a right answers was considered wrong by mistake and the other way around, while those of type X or U reflect different judgments whereby an assessor considered an answer inexact or unsupported while the other marked it as right or wrong. Table 16: Inter ­Assessor Agreement/Disagreement (breakdown) # Q­Disagreements # A­Disagr eements Run ID # Questions # Answer s Total F D L Total X U W/R dfki061dedeM 198 437 35 28 7 0 44 20 16 8 dfki062dedeM 198 476 28 19 6 3 40 13 19 8 fuha061dedeM 198 198 12 8 4 0 11 3 2 6 fuha062dedeM 198 198 13 8 5 0 12 4 2 6 ims061dedeM 198 432 15 13 0 2 30 13 9 8 ims062dedeM 198 436 17 15 0 2 28 5 14 9 dfki061endeC 198 405 26 20 5 1 33 12 16 5 dfki062endeC 198 402 27 21 6 0 35 21 10 4 6.6 Italian as Target Two groups participated in the Italian monolingual task, ITC­irst and the Universidad Politécnica de Valencia (UPV); while one group, the Università La Sapienza di Roma, participated in the cross­language EN­IT task. In total, five runs were submitted. 30 28,19 27,5 26,41 24,08 25 20 17,02 15 10 5 0 Mono Bilingual Mono Bilingual Best Average CLEF05 CLEF06 Figur e 4: Best and Aver age perfor mance in the Monolingual and Bilingual tasks For the first time a cross­language task with Italian as target was chosen to test a participating system. The best performance in the monolingual task was obtained by the UPV, which achieved an accuracy of 28.19%. Almost the same result was recorded last year (see Figure 4). The average accuracy in the monolingual task was 26.41%, which is an improvement of more than 2% with respect to last year’s results. The accuracy in the bilingual task was 17.02%, achieved by both submitted runs. During the years the overall accuracy has steadily decreased starting from a 25.17% in the 2004, we reached a 24.08% in the 2005 and 22.06% this year. This could be partly due to newcomers – who usually get lower scores – and first experiments with bilingual tasks. Table 17: Results of the monolingual and bilingual Italian r uns Right Wr ong ineXac Unsupp Accur a Accur acy NIL Accur acy Confiden Over all P@N for answe answe t or ted cy over D ce Run Name Accur acy L r s (#) r s (#) answe answers over F (% ) weighted (% ) r s (#) (#) (% ) Scor e Precisio Recall n 17.07 ir st06itit 43 121 10 13 22.87 25.0 0.1528 0.19602 26.83 upv_061itit 53 124 6 5 28.19 28.47 0.0833 0.12330 53 127 4 3 28.19 27.78 29.27 0.1667 0.13209 upv_062itit 32 141 4 11 17.02 15.28 24.39 0.1000 0.08433 Roma061enit 32 141 4 11 17.02 15.28 24.39 0.1500 0.08433 Roma062enit From the results shown in Table 17, it can be seen that the Universidad Politécnica de Valencia (UPV) submitted two runs in the monolingual task and achieved the best overall performance. The accuracy over Definition and Factoid questions ranged from 26.83% to 29.27%. ITC­Irst submitted one run, and achieved much better accuracy over Factoid questions (25.00%) than over Definition questions (17.07%). As previously mentioned, the Università La Sapienza di Roma submitted two runs in the cross­language EN­IT tasks, performing much better in the Definition questions (24.39%) than in the Factoid questions (15.28%). As far as List questions are concerned, all participating systems performed rather poorly, with a P@N ranging from 0.08 to 0.17. This implies that a more in­depth research on these questions and the measures for their evaluation is still needed. Table 18: Tempor ally Restr icted Questions: Right, Unsuppor ted and Wr ong Answer and Accur acy R U W Accuracy % ir st06itit 6 6 26 15.79 Roma061enit 2 4 32 5.26 Roma062enit 2 4 32 5.26 upv_061itit 8 0 30 21.05 upv_062itit 9 0 29 23.68 Temporally restricted questions represented a challenge for the systems, which generally achieved a lower than average accuracy in this sub­category. The Universidad Politécnica de Valencia achieved the best performance of 23.68% (see Table 18). The evaluation process did not presented particular problems, although it was more demanding than usual because of the necessity to check the supporting text snippet. All runs were anyway assessed by two judges. The inter­assessor agreement was averagely 90,14 %, most disagreement being between U and X. A couple of cases of disagreement between R and W were due just to trivial mistakes. 6.7 Portuguese as Target This year five research groups took part in tasks with Portuguese as target language, submitting ten runs: seven in the monolingual task, two with English as source, and one with Spanish. Two new groups joined for Portuguese: University of Porto, and Brazilian NILC, while LCC participated with an English­Portuguese run only. Universidade de Évora did not participate this year. Table 19 presents the overall results concerning the 188 non­list questions. We present values both taking into account only the first answer to each question, and – for the only system where this makes any difference – all answers, assessing as right (or partially right) if any answer, irrespective of position, was right (or partially right). We have also distinguished inexact answers (X) between too little and too much information, respectively coded as X­ and X+. Just like last year, Priberam achieved the best results by a clear margin. Also, their Spanish­Portuguese run, prib061espt, despite using a different (closely related) language as source, managed to achieve the second best result. On the other hand, overall results for both Priberam and Esfinge show but a small improvement compared to 2005. It remains to be seen whether this year’s questions displayed a higher difficulty or whether the systems themselves were subject to few changes. Table 19: Results of the r uns with Por tuguese as tar get for non­list questions: fir st answer s only, and all answer s (marked with *). There were only 18 NIL questions in the Por tuguese dataset. Confidence weighted scor e is computed for non­list questions only Accur ac NIL Accur acy Over all Accur acy R W X+ X­ U y Run Name Accur acy over D (#) (#) (#) (#) (#) over F (% ) (% ) (% ) Pr ecision Recall esfg061ptpt 50 138 7 2 3 25.0 24.34 27.08 13.59 63.64 esfg062ptpt 46 140 6 6 1 23.5 21.19 29.16 16.04 94.44 nilc061ptpt 0 189 1 8 2 0.0 0.00 0.00 ­ ­ nilc062ptpt 3 190 0 5 2 1.5 1.97 0.00 8.57 16.67 prib061ptpt 134 58 6 1 1 67.0 64.47 75.00 33.33 55.56 uporto061ptpt 23 172 0 0 0 11.5 10.06 17.02 9.74 94.74 uporto062ptpt 26 162 2 3 0 13.0 12.38 17.02 9.55 88.24 esfg061enpt 29 164 3 2 2 14.5 13.82 16.67 12.77 100.00 lcc_061enpt 18 166 2 10 4 9.0 8.63 8.70 28.57 3.67 lcc_061enpt* 61 112 3 18 7 30.5 36.1 12.5 ­ ­ prib061espt 67 124 2 2 5 33.5 26.97 54.16 14.71 27.78 We also provide in Table 20 the overall accuracy considering (and evaluating) independently all different answers provided by the systems. Table 20: Results of the r uns with Por tuguese as tar get: all answer s Over all R W X+ X­ U Run Name Accur acy (#) (#) (#) (#) (#) (% ) esfg061ptpt 50 142 11 2 3 24.04 esfg062ptpt 46 144 7 6 1 22.55 nilc061ptpt 0 188 1 8 2 0.00 nilc062ptpt 3 190 0 5 2 1.50 prib061ptpt 134 58 6 1 1 67.00 uporto061ptpt 36 173 0 0 0 17.14 uporto062ptpt 42 164 3 6 0 19.44 esfg061enpt 29 166 3 2 2 14.36 lcc_061enpt 141 1211 11 49 50 9.64 prib061espt 65 114 2 2 5 34.57 Table 21 shows the results for each answer type (loc ≡ location, mea ≡ measure, or g ≡ organisation, per ≡ person, man ≡ manner, obj ≡ object, oth ≡ other , tim ≡ time). In parentheses we display the subset of temporally­restricted questions, and we add the list questions, in order to provide the full picture. Table 21: Results of the assessment of the monolingual Por tuguese r uns: fir st answer s only, except for lists, for which (for this table) one cor rect member of the list made the answer to be considered corr ect cor r ect answer s Definition (#, 48) Factoid (#) (t.r .q. + list) (152) Total ob or ot pe j g h r loc mea or g oth per tim # 7 8 24 9 25 21 23 (4+3) 30 (5+3) 34 (10+3) 19 (0+1) 200 Run 1+0 2+0 (22+10) % esfg061ptpt 2 4 5 2 9 3 2 (0+0) 7 (1+1) 13 (3+0) 3(0+0) 50 (4+1) 25.0 esfg062ptpt 2 4 6 2 8 3 1(0+0) 7 (1+1) 10 (3+0) 3 (0+0) 46(4+1) 24.5 nilc061ptpt 0 0 0 0 0 0 0 0 0 0 0 0.0 nilc062ptpt 0 0 0 0 1 0 0 1 (1+0) 0 1(0+0) 3(1+0) 1.5 prib061ptpt 6 7 15 8 18 12 14 (0+2) 18 (1+2) 22 (2+2) 14 (0+1) 134 (3+7) 67.0 uporto061ptpt 1 0 2 5 3 2 0) 3 (1+1) 6 (3+0) 1(0+0) 23(4+1) 11.5 uporto062ptpt 1 0 2 5 4 1 0 5 (1+1) 5 (3+0) 3(0+0) 26(4+1) 13.0 combination 6 8 18 8 19 12 15 20 25 14 145 72.5 esfg061enpt 1 3 2 2 7 2 1 (0+0) 5 (1+1) 5 (2+0) 1(0+0) 19(3+1) 15.4 lcc_061enpt 1 2 2 0 2 1 1(0+0) 0 (0+1) 2 (1+0) 5(0+0) 18(1+1) 9.0 pribe061espt 3 4 12 7 7 8 5(0+0) 7 (1+0) 7 (2+0) 7(0+0) 67(3+0) 35.6 A virtual run, called combination, was included in Table 21 and computed as follows: if any of the participating systems found a right answer, it is considered right in the combination run. Ideally, this combination run measures the potential achievement of cooperation among all participants. However, for Portuguese this combination does not significantly outperform the best performance: Priberam alone corresponds to 92.4% of the combination run. We have also analysed the size in words of both answers and justification snippets, as displayed in Table 22. (Computations were made excluding NIL answers.) Interestingly, Priberam provided the shortest justifications. Table 22: Size of justifying snippets, in wor ds Non­NIL Aver age Aver age answer Aver age Aver age Run name Answer s (#) Answer s answer size snippet snippet size (#) size (R only) size (R only) esfg061ptpt 208 105 3.4 3.2 108.8 108.5 esfg062ptpt 204 98 3.8 3.5 109.1 105.5 nilc061ptpt 200 200 5.7 ­ 5.7 ­ nilc062ptpt 200 165 4.9 ­ 4.4 ­ prib061ptpt 200 170 3.7 3.8 31.5 30.3 uporto061ptpt 210 29 3.1 3.2 39.7 32.7 uporto062ptpt 216 59 3.0 2.8 43.1 33.7 esfg061enpt 202 61 3.5 3.5 95.3 106.1 Lcc_061enpt 1463 1449 5.2 4.1 35.2 34.6 prib061espt 200 166 3.5 4.3 31.3 29.1 In Table 23, we compare the accuracy of the systems for the 22 temporally restricted questions in the Portuguese question set with their scores for non­temporally restricted ones and their overall performance. Table 23: Accur acy of tempor ally restr icted questions (all answer s consider ed), compar ed to non­ tempor ally r estr icted ones, and to over all accur acy Questions with at Accuracy for Accuracy for Total accuracy Run name least one cor r ect T.R.Q. non­T.R.Q (%) answer (#) (%) (%) esfg061ptpt 4 18.18 24.73 24.04 esfg062ptpt 4 18.18 23.08 22.55 nilc061ptpt 0 0.00 0.00 0.00 nilc062ptpt 1 4.55 1.12 1.50 prib061ptpt 7 31.82 71.35 67.00 uporto061ptpt 4 18.18 10.11 10.95 uporto062ptpt 4 18.18 11.34 12.04 esfg061enpt 3 13.64 14.44 14.36 lcc_061enpt 5 2.82 4.35 4.17 prib061espt 3 13.64 35.96 33.50 Finally, a total of twelve questions were defined by the organization as requiring a list as proper answer. The fact that the systems had to find out whether multiple or single answers were expected was a new feature this year and was not conveniently handled by most systems. In fact, two systems (Priberam and NILC) completely ignored this and provided a single answer to every question, while two other systems, although attempting to deal with list questions, seemed to fail in appropriately identifying them: RAPOSA (UPorto) provided multiple answers only to non­list questions, and Esfinge produced 12 answers for ten questions. In fact, only LCC presented multiple answers systematically, yielding an average of 7.32 answers per question, while no other group exceeded 1.1. We believe further study should be devoted to the list questions for the next years, since a distinction between closed lists and open lists, although acknowledged, was not properly taken into consideration. We have thus chosen to handle all these questions alike, assigning them the following accuracy score: number of correct answers (where X counted as ½) divided by the sum of the number of existing answers in the collections and the number of wrong distinct answers provided by the system. The results are displayed in Table 24. For the case of closed lists (where "one" answer might bring all answers, such as "Lituânia, Estónia e Letónia"), we still counted the number of answers individually (3). Table 24: Results for List questions Known esfg esfg nilc nilc prib uporto uporto esfg lcc esfg Question answers 061ptpt 062ptpt 061ptpt 062ptpt 061ptpt 061ptpt 062ptpt 061enpt 061enpt 061espt 205 3 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/10 0/1 399 3 0/1 0/1 0/1 0/1 3/3 0/1 0/1 0/1 3/9 0/1 400 3 0.5/3 0.5/3 0/1 0/1 3/3 0/1 0/1 0.5/3 0/8 3/3 759 3 0/1 0/1 0.5/1 0.5/1 1/1 0/1 0/1 0/1 0.5/10 0/1 770 3 0.5/1 0.5/1 0/1 0/1 1/1 0/1 0/1 0/1 2/10 0/1 784 5 0/1 0/1 0/1 0/1 1/1 0/1 0/1 0/1 1/9 0/1 785 3 0/1 0/1 0/1 0/1 0.5/1 0/1 0/1 0/1 0/10 0/1 786 3 0/1 0/1 0.5/1 0.5/1 1/1 0/1 0/1 0/1 2/10 0/1 795 5 0/1 0/1 0/1 0/1 1/1 0/1 0/1 0/1 2/7 0/1 score 0.030 0.030 0.037 0.037 0.396 0 0 0.019 0.113 0.011 6.8 Spanish as Target The participation at the Spanish as Target subtask is still growing. Nine groups, two more than the last year, submitted 17 runs: 12 monolingual, 3 from English, 1from French and 1 from Portuguese. Table 25 and Table 26 show the summary of systems results for monolingual and cross­lingual respectively. The number of Right, Wrong (W), Inexact (X) and Unsupported (U) answers. Tables show also the accuracy (in percentage) of factoids (F), factoids with temporal restriction (T), definitions (D) and list questions (L). Best values are marked in bold face. Best performing systems have improved their performance (as seen in Figure 5), mainly with respect to factoids. However, performance when the question has a temporal restriction didn’t vary significantly. Last year, the answering of definitions with respect to persons and organizations was almost solved. In spite of the fact that this year the set of definition questions was more realistic systems have improved slightly their performance. Table 25: Results at the Spanish as tar get, monolingual Right W X U % % T % D % L NIL % F [20] Answer Run # % # # # [108] [40] [42] [10] F P R r Extr a ction pribe061 105 52,50 86 4 5 55,56 30,00 69,05 40,00 0,44 0,34 0,60 ­ 84,68 inao061 102 51,00 86 3 9 47,22 35,00 83,33 20,00 0,46 0,38 0,60 0,216 86,44 vein061 80 40,00 112 3 5 32,41 25,00 83,33 ­ 0,34 0,21 0,80 0,133 86,02 alia061 72 36,00 105 15 8 38,89 22,50 50,00 ­ 0,34 0,22 0,75 0,322 69,23 upv_061 70 35,00 119 5 6 37,04 25,00 47,62 ­ 0,43 0,33 0,65 0,194 70,71 upv_062 57 28,50 123 6 14 27,78 25,00 40,48 ­ 0,41 0,32 0,60 0,163 66,28 aliv061 56 28,00 123 8 13 29,63 22,50 35,71 ­ 0,34 0,33 0,35 0,190 65,12 aliv062 56 28,00 132 6 6 26,85 25,00 40,48 ­ 0,33 0,26 0,45 0,153 72,73 mira062 41 20,50 148 4 7 21,30 17,50 23,81 10,00 0,35 0,35 0,35 0,145 43,62 sinaiBruja06 39 19,50 146 6 9 16,67 17,50 33,33 ­ 0,23 0,13 0,90 ­ 79,59 0,119 mira061 37 18,50 154 3 6 21,30 15,00 16,67 10,00 0,34 0,26 0,50 0,136 51,39 aske061 27 13,50 143 1 29 15,74 12,50 9,52 10,00 0,08 0,20 0,05 0,199 62,79 List questions have been introduced this year so they deserve some attention regarding their evaluation. We have differentiated two types of list questions: conjunctive and disjunctive (as presented in [1]). Conjunctive list questions are asking for a set of items and they are Right if all the items are present in the answer. For example, “Nombre los tres Beatles que siguen vivos” (Name the three Beatles alive). Disjunctive list questions are asking for an undetermined number of items. For example, “Nombre luchadores de Sumo” (Name Sumo fighters). Only the first answer of each system has been evaluated in both cases. Table 26: Results at the Spanish as tar get, Cr oss­lingual Right W X U % F % T % D % L NIL [20] % Answer Run # % # # # [108] [40] [42] [10] F P R r Extr action pribe061ptes 72 36,00 123 3 2 39,81 27,50 38,10 20 0,29 0,29 0,30 ­ 78,26 alia061enes 41 20,50 134 9 16 17,59 12,50 40,48 ­ 0,31 0,19 0,80 0,142 65,08 lcc_061enes 38 19,00 141 14 7 20,37 25,00 14,29 ­ 0,35 0,35 0,35 0,067 55,07 aske061fres 23 11,50 162 ­ 15 13,89 10,00 7,14 10 0,08 0,17 0,05 0,302 53,49 aske061enes 12 6,00 178 ­ 10 6,48 2,50 7,14 10 0,10 1,00 0,05 0,091 40,00 Regarding the NIL questions, Table 25 and 26 show the harmonic mean (F) of precision (P) and recall (R). The best performing systems have increased again their performance (see Table 27) in NIL questions. The correlation efficient r between the self­score and the correctness of the answers has been increased in the majority of systems, although results are not good enough yet. This year a supporting text snippet was requested. For this reason, we have evaluated the systems capability to extract the answer when the snippet contains it. The last column of Tables 25 and 26 shows the percentage of cases where the correct answer was correctly extracted. This information is very useful to diagnose if the lack of performance is due to the passage retrieval or to the answer extraction. 100,00 90,00 83,33 80,00 80,00 70,00 70,00 60,00 55,56 2003 52,5 2004 50,00 42,00 2005 40,00 2006 32,50 31,11 29,66 30,00 24,50 24,50 20,00 10,00 0,00 Best Overall Acc. % Best in Factoids % Best in Definitions % Figur e 5: Evolution of best per for ming systems 2003­2006 Regarding Cross­Lingual runs, it is worth to mention that Priberam has achieved in the Portuguese to Spanish task a result comparable to the monolingual runs. Table 27: Evolution of best r esults in NIL questions Year F­measur e 2003 0,25 2004 0,30 2005 0,38 2006 0,46 All the answers have been assessed anonymously considering all systems’ answers simultaneously question by question. The inter­annotator agreement was evaluated over 985 answers assessed by the two judges. Only a 2.5% of the judgements were different and the resulting kappa value was 0.93. 7 Conclusions The QA track at CLEF 2006 has once again demonstrated the interest for Question Answering in languages other than English. In fact, both the number of participants and runs submitted has grown, following the positive trends of the previous campaign. Equally positive was the fact that, despite the loss of Finnish, two additional languages from Eastern Europe have been added, strengthening the cross­linguality of QA@CLEF. The balance between tradition and innovations –i.e the introduction of list questions and supporting text snippets­ has proved to be a good solution, which allows both new­comers and veterans to test their systems against adequately challenging tasks and, at the same time, to make a comparison with previous exercises. Generally speaking, the results recorded an improvement in performance, with best accuracy significantly higher than in previous campaigns both in monolingual and bilingual tasks. As far as the organisation of the campaign is concerned, the introduction of new elements such as list questions and supporting snippets has implied a significant increase of work both in the question collection and in the evaluation phase, which was particularly demanding for language groups which had a great number of participants. A better distribution of the workload and solutions to speed up the evaluation process, also with automatic assessment of part of the submissions will be essential in next campaigns. A future perspective of QA is certainly outlined by the two pilot tasks offered in 2006­i.e. AVE and WiQa­, the latter in particular representing a significant step toward a more realistic scenario, where queries are carried out on the Web. For these reasons, a quick integration of these experiments into the main task is hoped for. Acknowledgements The authors would like to thank Donna Harman for her valuable feedback and advice, and Diana Santos for her precious contribution in the organization of the campaign and the revision of this paper. Paulo Rocha is thankful to the many useful comments and overall discussion with Diana Santos for the Portuguese part. Paulo Rocha was supported by the Portuguese Fundação para a Ciência e Tecnologia within the Linguateca project, through grant POSI/PLP/43931/2001, co­financed by POSI. Bogdan Sacaleanu was supported by the German Federal Ministry of Education and Research (BMBF) through the projects HyLaP and COLLATE II. Refer ences 1. QA@CLEF 2006 Organizing Committee. Guidelines 2006. http://clef­qa.itc.it/guidelines.html 2. WiQA Website: http://ilps.science.uva.nl/WiQA/ 3. AVE Website: http://nlp.uned.es/QA/AVE/ 4. Herrera, J., Peñas A., Verdejo, F.: Question answering pilot task at CLEF 2004. In: Peters, C., Clough, P., Gonzalo, J., Jones, Gareth J.F., Kluck, M., Magnini, B. (eds.): Multilingual Information Access for Text, Speech and Images. Lecture Notes in Computer Science, Vol. 3491. Springer­Verlag, Berlin Heidelberg New York (2005) 581–590 5. Magnini, B.,Vallin, A., Ayache, C., Erbach, G., Peñas, A., de Rijke, M., Rocha, P., Simov, K., Sutcliffe, R.: Overview of the CLEF 2004 Multilingual Question Answering Track. In: Peters, C., Clough, P., Gonzalo, J., Jones, Gareth J.F., Kluck, M., Magnini, B. (eds.): Multilingual Information Access for Text, Speech and Images. Lecture Notes in Computer Science, Vol. 3491. Springer­Verlag, Berlin Heidelberg New York (2005) 371­391 6. Rocha, Paulo & Diana Santos: CLEF: Abrindo a porta à participa internacional em avaliação de RI do português. In: Diana Santos (ed.): Avaliação conjunta: um novo paradigma no processamento computacional da língua portuguesa. IST Press, Lisbon, 2006 (in press). 7. Santos, D., Rocha, P.: The Key to the First CLEF with Portuguese: Topics, Questions and Answers in CHAVE. In: Peters, C., Clough, P., Gonzalo, J., Jones, Gareth J.F., Kluck, M., Magnini, B. (eds.): Multilingual Information Access for Text, Speech and Images. Lecture Notes in Computer Science, Vol. 3491. Springer­Verlag, Berlin Heidelberg New York (2005) 821­832. 8. Spark Jones, K.: Is question answering a rational task? In: Bernardi, R., Moortgat, M. (eds): Questions and Answers: Theoretical and Applied Perspectives. Second CoLogNETElsNET Symposium. Amsterdam (2003) 24–35 9. Vallin, Alessandro, Bernardo Magnini, Danilo Giampiccolo, Lili Aunimo, Christelle Ayache, Petya Osenova, Anselmo Peñas, Maarten de Rijke, Bogdan Sacaleanu, Diana Santos, Richard Sutcliffe: Overview of the CLEF 2005 Multilingual Question Answering Track. In: Cross Language Evaluation Forum: Working Notes for the CLEF 2005 Workshop (CLEF 2005) (Vienna, Áustria, 21­23 September 2005. 10. Voorhees, E. M.: Overview of the TREC 2002 Question Answering Track. In: Voorhees, E. M. and Buckland, L. P. (eds), Proceedings of the Eleventh Text Retrieval Conference (TREC 2002 NIST Special Publication 500­251, Washington DC (2002) 115 123