-

OVERVIEW OF THE CLEF 2008 MULTILINGUAL QUESTION ANSWERING TRACK

Pamela Forner

forner@celct.it 1

Anselmo Peñas

anselmo@lsi.uned.es 5

Eneko Agirre

e.agirre@ehu.es 2

Iñaki Alegria

i.alegria@ehu.es 10

Corina Forăscu

Nicolas Moreau

moreau@elda.org 6

Petya Osenova

petya@bultreebank.org 0

Prokopis Prokopidis

prokopis@ilsp.gr 7

Paulo Ro- cha

Paulo.Rocha@di.uminho.pt 8

Bogdan Sacaleanu

bogdan@dfki.de 3

Richard Sutcliffe

Erik Tjong Kim Sang

e.f.tjong.kim.sang@rug.nl 11 0 BTB , Bulgaria 1 CELCT , Trento , Italy 2 Computer Science Department, University of Basque Country , Spain 3 DFKI , Germany 4 DLTG, University of Limerick , Ireland 5 Departamento de Lenguajes y Sistemas Informáticos, UNED , Madrid , Spain 6 ELDA/ELRA , Paris , France 7 ILSP Greece, Athena Research Center 8 Linguateca , DEI UC , Portugal 9 UAIC and RACAI , Romania 10 University of Basque Country , Spain 11 University of Groningen

2003

39 77

The QA campaign at CLEF [1], was manly the same as that proposed last year. The results and the analyses reported by last year's participants suggested that the changes introduced in the previous campaign had led to a drop in systems' performance. So for this year's competition it has been decided to practically replicate last year's exercise. Following last year's experience some QA pairs were grouped in clusters. Every cluster was characterized by a topic (not given to participants). The questions from a cluster contained co-references between one of them and the others. Moreover, as last year, the systems were given the possibility to search for answers in Wikipedia1 as document corpus beside the usual newswire collection. In addition to the main task, three additional exercises were offered, namely the Answer Validation Exercise (AVE), the Question Answering on Speech Transcriptions (QAST), which continued last year's successful pilot, and Word Sense Disambiguation for Question Answering (QA-WSD). As general remark, it must be said that the task still proved to be very challenging for participating systems. In comparison with last year's results the Best Overall Accuracy dropped significantly from 41,75% to 19% in the multi-lingual subtasks,

while instead it increased a little in the monolingual sub-tasks, going from 54% to 63,5%.

1 Introduction

QA@CLEF 2008 was carried out according to the spirit of the campaign, consolidated in previous years. Beside the classical main task, three additional exercises were proposed: • the main task: several monolingual and cross-language sub-tasks, were offered: Bulgarian, English, French, German, Italian, Portuguese, Romanian, Greek, Basque and Spanish were proposed as both query and target languages. • the Answer Validation Exercise (AVE) [2]: in its third round was aimed at evaluating answer validation systems based on textual entailment recognition. In this task, systems were required to emulate human assessment of QA responses and decide whether an Answer to a Question is correct or not according to a given Text. Results were evaluated against the QA human assessments. • the Question Answering on Speech Transcripts (QAST) [3,14]: which continued last year’s successful pilot task, aimed at providing a framework in which QA systems could be evaluated when the answers to factual and definition questions must be extracted from spontaneous speech transcriptions. • the Word Sense Disambiguation for Question Answering (QA- WSD) [4], a pilot task which provided the questions and collections with already disambiguated Word Senses in order to study their contribution to QA performance.

As far as the main task is concerned, following last year experience, the exercise consisted of topic-related questions, i.e. clusters of questions which were related to the same topic and contained co-references between one question and the others. The requirement for questions related to a topic necessarily implies that the questions refer to common concepts and entities within the domain in question. This is accomplished either by co-reference or by anaphoric reference to the topic, implicit or explicitly expressed in the first question or in its answer.

Moreover, besides the usual news collections provided by ELRA/ELDA, articles from Wikipedia were considered as an answer source. Some questions could have answers only in one collection, i.e. either only in the news corpus or in Wikipedia.

As a general remark, this year we had the same number of participants as in 2007 campaign, but the number of submissions went up. Due to the complexity of the innovation introduced in 2007 - the introduction of topics and anaphora, list questions, Wikipedia corpus - the questions tended to get a lot more difficult and the performance of systems dropped dramatically, so, people were disinclined to continue the following year (i.e. 2008), inverting the positive trend in participation registered in the previous campaigns.

As reflected in the results, the task proved to be even more difficult than expected. Results improved in the monolingual subtasks but are still very low in the cross-lingual subtasks.

This paper describes the preparation process and presents the results of the QA track at CLEF 2008. In section 2, the tasks of the track are described in detail. The results are reported in section 3. In section 4, some final analysis about this campaign is given.

2 Task Description

As far as the main task is concerned, the consolidated procedure was followed, capitalizing on the experience of the task proposed in 2007.

The exercise consisted of topic-related questions, i.e. clusters of questions which were related to the same topic and contained co-references between one question and the others. Neither the question types (F, D, L) nor the topics were given to the participants.

The systems were fed with a set of 200 questions -which could concern facts or events (F-actoid questions), definitions of people, things or organisations (Definition questions), or lists of people, objects or data (L-ist questions)- and were asked to return up to three exact answers per question, where exact meant that neither more nor less than the information required was given.

The answer needed to be supported by the docid of the document in which the exact answer was found, and by portion(s) of text, which provided enough context to support the correctness of the exact answer. Supporting texts could be taken from different sections of the relevant documents, and could sum up to a maximum of 700 bytes. There were no particular restrictions on the length of an answer-string, but unnecessary pieces of information were penalized, since the answer was marked as ineXact. As in previous years, the exact answer could be exactly copied and pasted from the document, even if it was grammatically incorrect (e.g.: inflectional case did not match the one required by the question). Anyway, systems were also allowed to use natural language generation in order to correct morpho-syntactical inconsistencies (e.g., in German, changing dem Presidenten into der President if the question implies that the answer is in nominative case), and to introduce grammatical and lexical changes (e.g., QUESTION: What nationality is X? TEXT: X is from the Netherlands EXACT ANSWER: Dutch).

The subtasks were both: • monolingual, where the language of the question (Source language) and the language of the news collection (Target language) were the same; • cross-lingual, where the questions were formulated in a language different from that of the news collection.

Two new languages have been added, i.e. Basque and Greek both as source and target languages. In total eleven source languages were considered, namely, Basque, Bulgarian, Dutch, English, French, German, Greek, Italian, Portuguese, Romanian and Spanish. All these languages were also considered as target languages. 10 Monolingual -i.e. Bulgarian (BG), German (DE), Greek (EL), Spanish (ES), Basque (EU), French (FR), Italian (IT), Dutch (NL), Portuguese (PT) and Romanian (RO); 33 Cross-lingual (as customary in recent campaigns, in order to prepare the cross-language subtasks, for which at least one participant had registered, some target language question sets were translated into the combined source languages).

Anyway, as Table 2 shows, not all the proposed tasks were then carried out by the participants.

As long-established, the monolingual English (EN) task was not available as it seems to have been already thoroughly investigated in TREC campaigns. English was still both source and target language in the cross-language tasks.

2.1 Questions Grouped by Topic

The procedure followed to prepare the test set was the same as that used in the 2007 campaign. First of all, each organizing group, responsible for a target language, freely chose a number of topics. For each topic, one to four questions were generated. Topics could be not only named entities or events, but also other categories such as objects, natural phenomena, etc. (e.g. George W. Bush; Olympic Games; notebooks; hurricanes; etc.). The set of ordered questions were related to the topic as follows: • the topic was named either in the first question or in the first answer • the following questions could contain co-references to the topic expressed in the first question/answer pair.

Topics were not given in the test set, but could be inferred from the first question/answer pair. For example, if the topic was George W. Bush, the cluster of questions related to it could have been: Q1: Who is George W. Bush?; Q2: When was he born?; Q3: Who is his wife?

The requirement for questions related to a same topic necessarily implies that the questions refer to common concepts and entities within the domain. The most common form is pronominal anaphoric reference to the topic declared in the first question, e.g.: Q4: What is a polygraph?; Q5: When was it invented?

However, other forms of co-reference occurred in the questions. Here is an example: Q6: Who wrote the song "Dancing Queen"?; Q7: How many people were in the group?

Here the group refers to an entity expressed not in the question but only in the answer. However the QA system does not know this and has to infer it, a task which can be very complex, especially if the topic is not provided in the test set.

2.2 Document collections

Beside the data collections composed of news articles provided by ELRA/ELDA (see Table 3), also Wikipedia was considered.

The Wikipedia pages in the target languages, as found in the version of November 2006, could be used. Romanian had Wikipedia2 as the only document collection, because there was no newswire Romanian corpus. The “snapshots” of Wikipedia were made available for download both in XML and HTML versions. The answers to the questions had to be taken from actual entries or articles of Wikipedia pages. Other types of data such as images, discussions, categories, templates, revision histories, as well as any files with user information and metainformation pages, had to be excluded.

One of the major reasons for using Wikipedia was to make a first step towards web formatted corpora where to search for answers. In fact, as nowadays so large information sources are available on the web, this may be considered a desirable next level in the evolution of QA systems. An important advantage of Wikipedia is that it is freely available for all languages so far considered. Anyway the variation in size of Wikipedia, depending on the language, is still problematic.

2.3 Types of Questions

As far as the question types are concerned, as in previous campaigns, the three following categories were considered: 1. Factoid questions, fact-based questions, asking for the name of a person, a location, the extent of something, the day on which something happened, etc. We consider the following 8 answer types for factoids: –

PERSON, e.g.: Q8: Who was called the “Iron-Chancellor”? A8: Otto von Bismarck. 2 http://static.wikipedia.org/downloads/November_2006/ro/ – –

TIME, e.g.: Q9: What year was Martin Luther King murdered? A9: 1968. LOCATION, e.g.: Q10: Which town was Wolfgang Amadeus Mozart born in? A10: Salzburg.

ORGANIZATION, e.g.: Q11: What party does Tony Blair belong to?: A11: Labour Party.

MEASURE, e.g.: Q12: How high is Kanchenjunga? A12: 8598m.

COUNT, e.g.: Q13: How many people died during the Terror of PoPot? A13: 1 million.

OBJECT, e.g.: Q14: What does magma consist of? A14: Molten rock. OTHER, i.e. everything that does not fit into the other categories above, e.g.: Q15: Which treaty was signed in 1979? A15: Israel-Egyptian peace treaty.

TARGET LANG. COLLECTION PERIOD

[BG] Bulgarian Sega 2002

Standart 2002

Novinar 2002 [DE] German Frankfurter Rundschau 1994

Der Spiegel 1994/1995 German SDA 1994

German SDA 1995 [EL] Greek The Southeast European Times 2002 [EN] English Los Angeles Times 1994

Glasgow Herald 1995 [ES] Spanish EFE 1994

EFE 1995 [EU] Basque Egunkaria 2001/2003 [FR] French Le Monde 1994

Le Monde 1995 French SDA 1994

French SDA 1995 [IT] Italian La Stampa 1994

Italian SDA 1994

Italian SDA 1995 [NL] Dutch NRC Handelsblad 1994/1995

Algemeen Dagblad 1994/1995 [PT] Portuguese Público 1994

Público 1995 Folha de São Paulo 1994 Folha de São Paulo 1995

SIZE 120 MB (33,356 docs) 93 MB (35,839 docs) 320 MB (139,715 docs) 63 MB (13,979 docs) 144 MB (71,677 docs) 141 MB (69,438 docs) 425 MB (113,005 docs) 154 MB (56,472 docs) 509 MB (215,738 docs) 577 MB (238,307 docs) 2. Definition questions, questions such as “What/Who is X?”, and are divided into the following subtypes:

PERSON, i.e., questions asking for the role/job/important information about someone, e.g.: Q16: Who is Robert Altmann? A16: Film maker ORGANIZATION, i.e., questions asking for the mission/full name/important information about an organization, e.g.: Q17: What is the Knesset? A17: Parliament of Israel.

OBJECT, i.e., questions asking for the description/function of objects, e.g.: Q18: What is Atlantis? A18: Space Shuttle.

OTHER, i.e., question asking for the description of natural phenomena, technologies, legal procedures etc., e.g.: Q19: What is Eurovision? A19: Song contest. 3. closed list questions: i.e., questions that require one answer containing a determined number of items, e.g.: Q20: Name all the airports in London, England. A20: Gatwick, Stansted, Heathrow, Luton and City.

As only one answer was allowed, all the items had to be present in sequence in the document and copied, one next to the other, in the answer slot.

Besides, all types of questions could contain a temporal restriction, i.e. a temporal specification that provided important information for the retrieval of the correct answer, for example:

Q21: Who was the Chancellor of Germany from 1974 to 1982? A21: Helmut Schmidt.

Q22: Which book was published by George Orwell in 1945? A22: Animal Farm.

Q23: Which organization did Shimon Perez chair after Isaac Rabin’s death?

A23: Labour Party Central Committee.

Some questions could have no answer in the document collection, and in that case the exact answer was "NIL" and the answer and support docid fields were left empty. A question was assumed to have no right answer when neither human assessors nor participating systems could find one.

The distribution of the questions among these categories is described in Table 4. Each question set was then translated into English, which worked as interlanguage during the translation of the datasets into the other tongues for the activated cross-lingual subtasks.

2.4 Formats

As the format is concerned, also this year both input and output files were formatted as an XML file. For example, the first four questions in the EN-FR test set, i.e. English questions that hit a French document collection - were represented as follows: <input> <q target_lang="FR" source_lang="EN" q_id="0001"

q_group_id="1600">Which is the largest bird in Africa?</q> <q target_lang="FR" source_lang="EN" q_id="0002"

q_group_id="1600">How many species of ostriches are there?</q> <q target_lang="FR" source_lang="EN" q_id="0003" q_group_id="1601">Who served as a UNICEF goodwill ambassador between 1988 and 1992?</q> <q target_lang="FR" source_lang="EN" q_id="0004"

q_group_id="1601">What languages did she speak?</q> ... </input> An example of system output which answered the above questions was the following: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE output SYSTEM "QA-CLEF-OUTPUT.dtd"> <output>

2.5 Evaluation

As far the evaluation process is concerned, no changes were made with respect to the previous campaigns. Human judges assessed the exact answer (i.e. the shortest string of words which is supposed to provide the exact amount of information to answer the question) as: • • • •

R (Right) if correct; W (Wrong) if incorrect; X (ineXact) if contained less or more information than that required by the query; U (Unsupported) if either the docid was missing or wrong, or the supporting snippet did not contain the exact answer.

Most assessor-groups managed to guarantee a second judgement of all the runs.

As regards the evaluation measures, the main one was accuracy, defined as the average of SCORE(q) over all 200 questions q, where SCORE(q) is 1 in the first answer to q in the submission file is assessed as R, and 0 otherwise. In addition most assessor groups computed the following measures: • Confident Weighted Score (CWS). Answers are in a decreasing order of confidence and CWS rewards systems that give correct answers at the top of the ranking [16] • the Mean Reciprocal Rank (MRR) over N assessed answers per question (to consider the three answers). That is, the mean of the reciprocal of the rank of the first correct label over all questions. If the first correct label is ranked as the 3rd label, then the reciprocal rank (RR) is 1/3. If none of the first N responses contains a correct label, RR is 0. RR is 1 if the highest ranked label matches the correct label.

3 Results

As far as accuracy is concerned, scores were generally far lower than usual, as Figure 1 shows. Although comparison between different languages and years is not possible, in Figure 1 we can observe some trends which characterized this year’s competition: best accuracy in the monolingual task increased with respect to last year, going up again to the values recorded in 2006. But systems - even those that participated in all previous campaigns - did not achieve a brilliant overall performance. Apparently systems could not manage suitably the new challenges, although they improved their performances when tackling issues already treated in previous campaigns.

More in detail, best accuracy in the monolingual task scored 63,5 almost ten points up with respect to last year, meanwhile the overall performance of the systems was quite low, as average accuracy was 23,63, practically the same as last year. On the contrary, the performances in the cross-language tasks recorded a drastic drop: best accuracy reached only 19% compared to 41,75% in the previous year, which means more than 20 points lower, meanwhile average accuracy was more or less the same as in 2007 - 13,24 compared to 10,9. M o n o

B i l i n g u a l

M o n o

B i l i n g u a l

M o n o

B i l i n g u a l

M o n o

B i l i n g u a l

M o n o

B i l i g u a l

M o n o

B i l i n g u a l CLEF 03

CLEF 04

CLEF 05

CLEF 06

CLEF 07

CLEF 08

On the contrary, Best accuracy over the bilingual tasks, decreased considerably. This is also true for average performances. This year a small increase was recorded in the bilingual tasks but it seems that the high level of difficulty of the question sets particularly impacted the bilingual tasks and the task proved to be still difficult also for veterans.

3.1 Participation

America Europe Asia Australia TOTAL 3 5 0 0 8 1 17 0 0 18 1 22 1 0 24 4 24 2 0 30 3 16 1 1 21 1 20 0 0 21

The number of participants has remained almost the same as in 2007 (see Table 5). As noticed, this is probably the consequence of the new challenges introduced last year in the exercise.

Also the geographical distribution remained almost unchanged, even though there was no participation from Australia and Asia. No runs were submitted neither for Italian or Greek tasks.

Anyway, the number of submitted runs, increased from a total of 37 registered last year to 51 (see Table 6). The breakdown of participants and runs, according to language, is shown in Table 4 (Section 2.3). As in previous campaigns, more participants chose the monolingual tasks, which once again demonstrated to be more approachable.

In the following subsections a more detailed analysis of the results in each language follows, giving specific information on the performances of the participating systems in the single sub-tasks and on the different types of questions, providing the relevant statistics and comments. 3.2 Basque as target

In the first year working with Basque as target only a research groups submitted runs for evaluation in the track having Basque as target language, the Ixa group from the University of the Basque Country. They sent four runs: one monolingual, one English-Basque and two Spanish-Basque.

The Basque question set consisted of 145 factoid questions, 39 definition questions and 16 list questions. 39 questions contained a temporal restriction, and 10 had no answer in the Gold Standard. 40 answers were retrieved from Wikipedia, the remains from the news collections. Half of the questions were linked to a topic, so the second (and sometimes the 3rd) question was more difficult to answer.

The news were from the Egunkaria newspaper during 2000, 2001 and 2002 years and the information from Wikipedia was the exportation corresponding to the 2006 year.

Table 7 shows the evaluation results for the four submitted runs (one monolingual and three cross-lingual). The table shows the number of Right, Wrong, ineXact and Unsupported answers, as well as the percentage of correctly answered Factoids, Temporally restricted questions, Definition and List questions.

The monolingual run (ixag081eueu.xml) achieved accuracy of 13%, lower than the most systems for other target languages during the evaluation of 2007 but better than some of them. It is necessary to underline that Basque is a highly flexional language, doing matching of term and entities more complex, and that ir is the first participation. The system achieved better accuracy in factoids questions (15.9%). No correct answers was retrieved for list questions. It is necessary to remark that 57 answers were NIL (only four of them were corrects), perhaps participants can improve this aspect.

Looking to the cross-lingual runs the loss of accuracy respect to the monolingual system is a bit more than 50% for the two best runs. This percentage is quite similar with runs for other target languages in 2007. The overall accuracy is the same for both (English and Spanish to Basque) but only they agree in five correct answers (each system gives other six correct answers). The second system for Spanish-Basque get poorer results and only is slightly better in inexact answers. These runs get also a lot of NIL answers.

3.3 Bulgarian as Target

This year, contrary to our optimistic expectations, only one run by one group (BTB) was performed for Bulgarian. As the table above shows, the result is far from satisfying. Again, the definitions were detected better in comparison to other question types. Also, the difference between the detection of factoids and of temporally restricted questions is negligible. The results from the previous years decreased in both directions – as participating groups and as system performance.

3.4 Dutch as Target

The questions for the Dutch subtask of CLEF-QA 2008 were written by four native speakers. They selected random articles from either Wikipedia or the news collection and composed questions based on the topics of the articles.

The quartet produced a total of 222 question-answer pairs from which they selected a set of 200 that satisfied the type distribution requirements of the task organizers. An overview of the question types and answer types can be found in Table 9.

This year, only one team took part in the question answering task with Dutch as target language: the University of Groningen. The team submitted two monolingual runs and two cross-lingual runs (English to Dutch). All runs were assessed twice by a single assessor. This resulted in a total of eight conflicts (1%). These were corrected. The results of the assessment can be found in Table 10.

The two cross-lingual runs gron081ennl andron082ennl produced exactly the same answers.

The best monolingual run (gron082nlnl) achieved exactly the same score as the best run of 2007 (25.5%). The same is true for the best monolingual run (13.5%). The fact that the two scores are in the same range as last year is no big surprise since the task has not changed considerably this year and all scores have been achieved by the same system.

Like in 2007, the system performed better for definition questions than for other question types. The definition questions could be divided in two subtypes: those that asked for a definition (26) and those that contained a definition and asked for the name of the defined object ( 12 ). The monolingual runs performed similarly for both subtypes but the cross-lingual runs did not contain a correct answer to any question of the second subtype.

None of the runs obtained any points for the list questions. The answers contained some parts that were correct but none of them were completely correct. We were unable to award points for partially correct answers in the current assessment scheme.

All the runs were produced by the same system and the differences between the runs are small. The cross-lingual runs contained seven correct answers that were not present in any of the monolingual runs (for questions 20, 25, 120, 131, 142, 150 and 200). Eight questions were only answered correctly in a single monolingual run (1, 28, 54, 72, 83, 143, 193 and 199). Thirty-five questions were answered correctly in two runs, three in three runs and seventeen in all four runs. 137 questions failed to receive any correct answer.

3.5 English as Target

Run

Creation of Questions. The task this year was exactly the same as in 2007 and moreover the three collections were the same: Glasgow Herald, LA Times and Wikipedia. However, given the considerable interest in the Wikipedia which has been shown by Question Answering groups generally, it was decided to increase the number of questions drawn from it to 75% overall, with just 25% coming from the two newspaper collections. This means that 40 of the 160 Factoids came from the newspapers, together with seven of the 30 Definitions and two of the ten Lists. These questions were divided equally between the Glasgow Herald and LA Times. All the remainder we drawn from the Wikipedia.

Considerable care was taken in the selection of the questions. The distribution by answer type was controlled exactly as in previous years. As requested by the organisers there were exactly twenty each of Factoid target type PERSON, TIME, LOCATION, MEASURE, COUNT, ORGANIZATION, OBJECT and OTHER. Similarly for Definitions there were eight PERSON, seven ORGANIZATION, seven OBJECT and eight OTHER. For Lists there were four OTHER, two each of PERSON and ORGANIZATION, and one each of LOCATION and OBJECT.

In addition to the above distribution, we also controlled the distribution of topics for the question groups, something which was made practicable by the use of the Wikipedia. Questions were drawn from a number of predefined subject fields: countries towns, roads and bridges, shops, politicians and politics, sports and sports people, foods and vegetables, cars, classical music including instruments, popular music, literature poetry and drama, philosophy, films, architecture, languages, science, consumer goods, and finally organisations. Questions were distributed among these topics. The maximum in any topic was twenty (sports) and the minimum was two (shops). For the majority there were between four and six question groups. For each such topic, one or more questions were set depending on what information the texts contained. As a change from last year, the organisers asked us to include 100 singleton topics. This effectively meant that half the questions in the overall set of 200 were simple "one-off" queries as were set in CLEF prior to 2007 and for the earlier TREC campaigns.

Questions were entered via a web interface developed by the organisers last year. However, this year they improved it considerably, for example allowing modifications to be made to existing entries. This was a great help and a commendable effort on their part.

Summary Statistics. Five cross-lingual runs with English as target were submitted this year, as compared with eight in 2007 and thirteen in 2006. Four groups participated in three languages, Dutch, German and Romanian. Each group worked with only one source language, and only DCUN submitted two runs. The rest submitted only one run.

Assessment Procedure. Last year we used the excellent Web-based assessment system developed originally for the QiQA task by University of Amsterdam. However, we were asked not to use this in 2008 because it only allows one answer per question per system to be assessed and it was required to assess multiple answers per question per system. For this reason we used a Web-based tool developed by UNED in Madrid.

All answers were double-judged. Where the assessors differed, the case was reviewed and a decision taken. There were 63 judgement differences in total. Three of the runs contained multiple answers to individual questions in certain cases, and these were all assessed, as per the requirement of the organisers. If we assume that the number of judgements was in fact 200 questions * five runs, i.e. 1,000, we can compute a lower bound for the agreement level. This gives a figure of ( 1,00063 )/1,000, i.e. 93.7%. The equivalent figure for 2007 (called Agreement Level 2 in the Working Notes for last year) was 97.6%. Given that we have computed a lower bound this year (and not therefore the exact figure) this seems acceptable.

Results Analysis. Of the five runs with English as target, wlvs081roen was the best with an accuracy of 19.00% overall. They also did very will on the definitions, scoring 66.67%. The only source language for which there was more than one run was German, for which there were three submissions from two groups. dfki081 scored the best with 14.00% and this was followed by dcun081deen with 8.00% and dcun082deen with 0.50%. dfki also did very well on definitions with an accuracy of 60.00. Interestingly, none of the systems answered any of the list questions correctly. Only dcun082deen answered one list question inexactly.

If we compare the results this year with those of last year when the task was very similar, performance has improved here. The best score in 2007 was wolv071roen with 14.00% (the best score) which has now improved to 19.00%. Similarly, dfki071deen scored 7.00% in 2007 but increased this to 14.00% this year in dfki081deen. An attempt was made to set easier questions this year, which might have affected performance. In addition, many more questions came from the Wikipedia in 2008 with only a minority being drawn from the newspaper corpora.

3.6 QA-WSD subtask

The QA-WSD task brings semantic and retrieval evaluation together. The participants were offered the same queries and document collections as for the main QA exercise, but with the addition of word sense tags as provided by two automatic word sense disambiguation (WSD) systems. Contrary to the main QA task, Wikipedia articles are not included, and thus systems need to reply to the questions that have an answer in the news document collection. The goal of the task is to test whether WSD can be used beneficially for Question Answering, and is closely related to the Robust-WSD subtask of the ad-hoc track in CLEF 2008. The exercise scenario is event-targeted QA on a news document collection. In the QA-WSD track only English monolingual and Spanish to English bilingual tasks are offered, i.e. English is the only target language, and queries are available on both English and Spanish. The queries were the same as for the main QA exercise, and the participation followed the same process, except for the use of the senseannotated data.

The goal of this task is to evaluate whether word sense information can help in certain queries. For this reason, participants were required to send two runs for each of the monolingual/bilingual tasks where they participate: one which does not use sense annotations and another one which does use sense annotations. Whenever possible, the only difference between the two runs should be solely the use or not of the sense information. Participants which send a single run would be discarded from the evaluation.

The WSD data is based on WordNet version 1.6 and was supplemented with freely available data from the English and Spanish WordNets in order to test different expansion strategies. Two leading WSD experts run their systems [17][18], and R # 8 7 R # 10 8

W # 41 42 W # 188 189

X # 0 0 X # 0 0

U # 0 0 U # 2 3 provided those WSD results for the participants to use.

The task website [4] provides additional information on data formats and resources.

Results

From the 200 questions provided to participants, only 49 queries had a correct answer in the news collection. The table below provides the results for the participant on those 49 questions.

The first run does not use WSD, while the second uses the sense tags returned by the NUS WSD system. The WSD tags where used in the passage retrieval module. The use of WSD does not provide any improvement, and causes one more error. For the sake of completeness we also include below the results on all 200 queries. Surprisingly the participant managed to find two (one in the WSD run) correct answer for the Wikipedia questions in the news collection.

3.7 French as Target

This year only one group took part in the evaluation tasks using French as a target language: the French group Synapse Développement. Last year’s second NIL 0 0 0 participant, the Language Computer Corporation (LCC, USA) didn’t send any submission this time.

Synapse submitted three runs in total: • one monolingual run: French to French (FR-to-FR), • two bilingual runs: English-to-French (EN-to-FR) and Portuguese-to

French (PT-to-FR).

In the following, these will be referred to as: • syn08frfr (for FR-to-FR), • syn08enfr (for EN-to-FR), • syn08ptfr (for PT-to-FR).

As last year, three types of questions were proposed: factual, definition and closed list questions. Participants could return one exact answer per question and up to two runs. Some questions (10%) had no answer in the document collection, and in this case the exact answer is "NIL".

The French test set consists of 200 questions: • 135 Factual (F), • 30 Definition (D), • 35 closed List questions (L).

Among these 200 questions, 66 were temporally restricted questions (T) and 12 were NIL questions (i.e. a “NIL” answer was expected, meaning that there is no valid answer for this question in the document collection).

R # 36

Table 14 shows the final results of the assessment of the 3 runs submitted by Synapse. For each run, the following statistics are provided: • The number of correct (R), wrong (W), inexact (X) and unsupported answers (U), • The accuracy calculated within each of the categories of questions:

F, D, T and L questions, • The number of NIL answers and the proportion of correct ones (i.e.

corresponding to a NIL questions), • The Confidence Weighted Score (CWS) measure.

• The accuracy calculated over all answers.

Figure 2 shows the best scores for systems using French as target in the last five CLEF QA campaigns. For the monolingual task, the Synapse system returned 113 correct answers (accuracy of 56.5%), slightly more than last year (accuracy of 54.0%). The bilingual runs performance is quite low, with an accuracy of 18.0% for EN-to-FR and 16.5% for PT-to-FR. It cannot be fairly compared to the results of CLEF2007, because Synapse didn’t submit bilingual runs last year. Last year, LCC obtained an accuracy of 41.7% for EN-to-FR, but did not submit anything this year.

It appears that the level of performance strongly depends on the type of questions. The monolingual run scores very high on the definition questions (86.7%). The lowest performance is obtained with closed list questions (37.1%).

It is even more obvious when looking at the bilingual runs. If the systems performed pretty well on the definition questions (50.0% and 43.3% for EN-to-FR and PT-to-FR respectively), they could not cope with the closed list questions. The PT-to-FR system could only give one close list correct answer. The EN-to-FR system could not even answer to any of these questions. The bilingual runs did not reach high accuracy with factoid and temporally restricted questions (50.0% and 43.3% for EN-to-FR and PT-to-FR respectively). This year, the complexity of the task, in particular regarding closed list questions, seems to have been hard to cope with for the bilingual systems.

The complexity of the task is also reflected by the number of NIL answers. The monolingual system returned 20 NIL answers (to be compared with the 12 expected). The bilingual systems returned 60 (EN-to-FR) and 67 (EN-to-FR) NIL answers, i.e. at least 5 times more as expected.

It is also interesting to look at the results when categorizing questions by the size of the topic they belong to. This year, topics could contain from 1 single question to 4 questions. The CLEF 2008 set consists of: • 52 single question topics, • 33 topics with 2 questions (66 questions in total), • 18 topics with 3 questions (54 questions in total), • 7 topics with 4 questions (28 questions in total).

Table 15, Table 16 and Table 17 give the results of each run according to the size of the topics. 1 2 3 4 1 2 3

The monolingual system (Table 15) is not sensitive to the size of the topic question set. On the opposite, the performances of the bilingual systems (Table 16 and Table 17) decrease by a half, when comparing the 1- and 2-question sets to the 3- and 4-question sets. A possible explanation is that the bilingual systems perform poorly with questions containing anaphoric references (which are more likely to occur in the 3- and 4-question sets).

In conclusion, there was unfortunately only one participant this year. In particular; it would have been interesting to see how the LCC group, which submitted a bilingual run last year, would have performed this year.

This decrease in participation can be explained by the discouragement of some participants. Some have complained that the task is each year harder (e.g. this year, there were more closed list questions and anaphoric references than last year) that can result in a decrease in the systems performances.

This year, the number and complexity of closed list questions was clearly higher than the previous year. In the same way, there were more temporally restricted questions, more topics (comprising from 2 to 4 questions) and more anaphoric references. It seems that this higher level of difficulty particularly impacted the bilingual tasks. In spite of this, the monolingual Synapse system performed slightly better than last year.

3.8 German as Target

Three research groups submitted runs for evaluation in the track having German as target language: The German Research Center for Artificial Intelligence (DFKI), the Fern Universität Hagen (FUHA) and the Universität Koblenz-Landau (LOGA). All groups provided system runs for the monolingual scenario, DFKI and FUHA submitted runs for the cross-language English-German scenario and FUHA had also runs for the Spanish-German scenario. 2008 2007 2006 2005

2004 Compared to the previous editions of the evaluation forum, this year an increase in the accuracy of the best performing system and of an aggregated virtual system for monolingual and a decrease in the accuracy of the best performing system and of an aggregated virtual system for cross-language tasks was registered. 3 1 0 2 2 0 0 4 1 0 1 1 0 1 35 14 5 9 63

Total 9 8 13 13 0 14

WIKI

The number of topics covered by the test set questions was of 120 distributed as it follows: 74 topics consisting of 1 question, 24 topics of 2 related questions, 10 topics of 3 related questions, and 12 topics of 4 related questions. The distribution of the topics over the document collections (CLEF vs. Wikipedia) is presented in Table 18. #

X # According to Table 19 the most frequent topic types were OTHER (32), OBJECT (29) and ORGANIZATION (24), with first two types more present for the Wikipedia collection of documents (WIKI).

As regards the source of the answers, 97 questions from 57 topics asked for information out of the CLEF document collection and the rest of 103 from 63 topics for information from Wikipedia. Table 20 shows a breakdown of the test set questions by the expected answer type (EAType) for each collection of data.

3.9 Portuguese as Target

The Portuguese track had six different participants: beside the veteran groups of Priberam, Linguateca, Universidade de Évora, INESC and FEUP, we had a new participants this year, Universidade Aberta. No bilingual task occurred this year.

In this fourth year of Portuguese participation, Priberam repeated the top place of its previous years, with University of Évora behind. Again we added the classification the classification X-, meaning incomplete, keeping the classification X+ for answers with extra text or other kinds of inexactness. In Table 22 we present the overall results (all tables in these notes refer exclusively to the first answer by each system). #

To provide a more direct comparison with pre-2006 results, in Table 23 we present the results both for first question of each topic (which we believe is more readily comparable to such results) and for the linked questions.

On the whole, compared to last year, Priberam and Senso (UE) improved their results, which were already the best. INESC system and Esfinge (Linguateca) also showed some improvement, at a lower level Raposa (FEUP) showed similar results. The system of Universidade Aberta appeared with good results compared to some veteran systems. We leave it to the participants to comment on whether it might have been caused by harder questions or changes (or lack thereof) in the systems. obj org oth per

Unlike last year , the results over linked questions are significatively different (and below) from those over not-linked. Question 180 was wrongly redacted, referring to Aida’s opera Verdi instead of the other way around, which also affected two linked questions. Therefore, we accepted both NIL answers to those questions, as well as correct ones.

Table 24 shows the results for each answer type of definition questions, while Table 25 shows the results for each answer type of factoid questions (including list 6 5 6 2 3 6 2 2 9 6 5 1 1 1 5 5 5 5 6 11 7 8 3 3 11 8 8 22 questions). As it can be seen, four out of six systems perform clearly better when it comes to definitions than to factoids. Particularly Senso has a high accuracy regarding definitions. 17 8 8 4 3 9 9 9 21 31 35% 20% 20% 14% 12% 24% 13% 13% 55% 82% obj 1

2 1 1 5 2 2 1 1 7 7

We included in both Table 24 and Table 25 a virtual run, called combination, in which one question is considered correct if at least one participating system found a valid answer. The objective of this combination run is to show the potential achievement when combining the capacities of all the participants. The combination run can be considered, somehow, state-of-the-art in monolingual Portuguese question answering. All definition questions were answered by at least one system. Run name diue081 esfi081 esfi082 feup081 feup081 idsa081 ines081 ines082 prib081 The system with best results, Priberam, answered correctly 64.8% the questions with at least one correct answer. In all, 130 questions were answered by more than one system.

In Table 26, we present some values concerning answer and snippet size.

Temporally restricted questions: Table 27 presents the results of the 17 temporally restricted questions. As in previous years, the effectiveness of the systems to answer those questions is visibly lower than for non-TRQ questions.

Run name

List questions: ten questions were defined as list questions all closed list factoids with two to five each3. The results haven’t improved with UE getting two correct answers. Priberam three and all other system zero. There were however seven cases of incomplete answers (i.e.. answering some elements of the list only) although only two of them with than one element of the answer.

3 There were some open list questions as well, but they were classified and evaluated as ordinary factoids.

Answer source: Table 28 presents the distribution of questions by source during their selection. The distribution of sources used by the different runs and their correctness.

3.10 Romanian as Target

In the third year of Romanian participation in QA@CLEF, and the second one with Romanian addressed as a target language, the question generation was based on the collection of Wikipedia Romanian pages frozen in November 20064- the same corpus as in the previous edition5.

Creation of Questions. The questions were generated starting from the corpus and based on the Guidelines for Question Generation6, the Guidelines for Participants7 and the final decisions taken after email discussions between the organizers. The 200 questions are distributed according to Table 29, where for each type of question and expected answer we indicate also the temporally restricted questions out of the total number of questions. Without counting the NIL questions, 100% of the questions has the answer in Wikipedia collection. NIL

PER SON 20 ( 9 ) 8 3

TIM E 23 ( 5 )

LOC.

ORG.

MEAS COU

URE NT 26 ( 4 ) 20 ( 10 )

17 ( 3 ) 1 1 ( 1 ) 6 ( 2 ) 1 22 ( 5 )

OBJE CT 18 ( 4 ) 6 2 ( 1 )

OTH ER 16 ( 4 ) 3

TOTAL 162 (44) 28 ( 2 ) 10 ( 2 ) 8 4 http://static.wikipedia.org/downloads/November_2006/ro/ 5 At http://static.wikipedia.org/downloads/ the frozen versions of Wikipedia exist for April 2007 and June 2008, for all languages involved in QA@CLEF.

6http://celct.isti.cnr.it/ClefQA/QA@CLEF08_Question_Generation_Gui delines.pdf

7http://nlp.uned.es/clef-qa/QA@CLEF08_Guidelines-forParticipants.pdf As the Guidelines for Question Generation did not change since the previous edition, there were no major difficulties in creating the Romanian gold standard for the 2008 QA@CLEF. The working version of the GS was uploaded on the question generation interface developed at CELCT (Italy), by filling all the required fields.

For the topic-related questions (clusters of up to four questions, related to one same topic) we kept about the same number as in the previous edition: in 2007 we had 122 topics and now there are 119 topics. The percentage of topic-linked questions is illustrated in Table 30, showing that 127 questions were grouped under 46 topics, hence 63.5% out of the total 200 questions were linked in topics with more than one question. 4 Qs 3 Qs 2 Qs 1 Q TOTAL 5 5 5 13 28 1 1 3 6 11 1 4 19 24 1 1 1 2 17 20 5 3

Total topics

Total questions In fact the questions contain not 127, but only 51 anaphoric elements of various types, so that 25.5% of the questions are linked through coreferential relations. The personal, possessive or demonstrative pronouns were used in most of the cases to create anaphoric relations. The antecedents are mainly the focus of the previous question, or the previous answer. Few such questions require inference in order to be correctly answered. For example in order to correctly answer the FTime question When was the first Esperanto dictionary for Romanian published? and then the L-Other Name all the grammatical cases of this artificial language., one needs to correctly link the anaphor “artificial language” to its antecedent which is “Esperanto” and not “Romanian” (also a language but not artificial); this is possible by establishing, based on a text snippet, that Esperanto is an artificial language.

The 8 NIL questions, even though they seem somehow unnatural, were created by including questions about facts impossible from a human perception; for example the question In which year did Paul Kline publish his work about the natural phenomena called hail? has no answer in any of the articles about the psychologist. Another type of NIL questions are those based on inference – the question How many bicameral Parliaments are there in Cuba? is a NIL question because in all wiki articles one can find that Cuba has a unicameral parliament. Another type of NIL questions (with answer in English, but not in Romanian) we have created cannot be good items neither in a cross-lingual evaluation where the answers are to be find in any language, nor in an evaluation based on an open text collection such as the web. The question What is a micron? has no answer in the Romanian wiki articles from 2006, but it can have an answer in other Romanian webpages, and, moreover, in the English wiki articles it has more than a correct answer depending on the domain where the term is used (in the metric system or in vacuum engineering).

For the LIST type we created only questions whose answers are to be found in one same text section. The 2007 evaluation for Romanian showed that “open list” questions (with answers in various sections of an article or even in various articles) are difficult to handle, therefore we made the LIST questions easier. Systems’ analysis and evaluation. Like in the 2007 edition, this year two Romanian groups took part in the monolingual task with Romanian as a target language: the Faculty of Computer Science from the Al. I. Cuza University of Iasi (UAIC), and the Research Institute for Artificial Intelligence from the Romanian Academy (ICIA), Bucharest. Each group submitted two runs, the four systems having an average of 2.4 answers per question for ICIA, and 1.92 for UAIC. The 2008 general results are presented in Tables 31 below.

The statistics includes a system, named combined, obtained through the combination of the 4 participating RO-RO systems. Because at the evaluation time we observed that there are correct answers not only in the first position, but also on the second or the third, the combined system considers that an answer is R if there exists at least one R answer among all the answers returned by the four systems. If there is no R answer, the same strategy is applied to X, U and finally W answers. This “ideal” system permits to calculate the percentage of the questions (and their type), answered by at least one of the four systems in any of the maximum 3 answers returned for a question.

All three systems crashed on the LIST questions. The best results were obtained by ICIA for DEFINITION questions, whereas UAIC performed best with the FACTOID questions. The combined system suggests that a joint system, developed by both groups, would improve substantially the general results for Romanian.

Using in a first stage the web interface for assessing the QA runs, developed at UNED in Spain, the assessment took into consideration one question with all its answers at the time, assuring that the same evaluation criteria are applied to all answers. The judgment of the answers was based on the same Guidelines as in 2007, therefore we kept the same criteria as in 2007, in order to assure consistency inside the Romanian language, which gives also the possibility to evaluate the systems in their evolution from one year to another. For example, one could easily see that the UAIC systems had most of the answers for the DEFINITION questions evaluated as ineXact, because the answers were judged as being “longer than the minimum amount of information required” and hence “unnecessary pieces of information were penalized”. Since all the 2007 and 2008 answers were evaluated this way, we considered it is more important to have uniformly applied rules inside one language than to change the evaluation in order to be consistent across languages. On the other hand the ICIA answers judged as ineXact are due to answers that are too long, snippets shortened as such as they do not contain the answer, or because the answer and the snippet has no connections.

U F

L Tables 31. Results in the monolingual task, Romanian as target languageM C W R S R Run icia08 1roro icia08 2roro uaic08 1roro uaic08 2roro Run Combined icia081roro icia082roro uaic081roro uaic082roro

R # [162] [47] [28] 4.938 18.51 6.173 18.51 24.69 25.5 1 32 26.54 27.6 3 60 7.143 10 0

The evaluation was made more difficult because two of the submitted runs contain the answers in a totally arbitrary order, with topic-related questions having their answers in various parts of the submitted file. If in the first stage the UNED interface was of a great help, after the xml file was generated with all the evaluations, the corrections needed a thorough manual inspection. Anyway it was nice to find out that the answer to the question Which terrorist organization does Osama bin Laden belong to? is Pentagon.

W 0.000 0.000 15 2 7.143 39.286 21 0

3.11 Spanish as Target

The participation at the Spanish as Target subtask has decreased from 5 groups in 2007 to 4 groups this year. 6 runs were monolingual and 3 runs were crosslingual. Table 32 shows the summary of systems results with the number of Right (R), Wrong (W), Inexact (X) and Unsupported (U) answers. The table shows also the accuracy (in percentage) of factoids (F), factoids with temporal restriction (T), definitions (D) and list questions (L). Best values are marked in bold face. prib081eses 86 105 5 4 41,13 41,67 inao082eses 44 152 3 1 19,35 8,33 inao081eses 42 156 1 1 15,32 8,33 qaua082eses 39 156 4 1 22,58 13,89 mira081eses 32 156 3 9 12,90 2,78 mira082eses 29 159 3 9 11,29 2,78 qaua081enes 25 173 - 2 11,29 16,67 qaua082enes 18 176 3 3 9,68 8,33 mira081fres 10 185 2 3 5,65 75 80 95 30 75 70 20 15 15 20 5 5 5 3 4 3 6 3 3 6 8 3

Run 0,4483 42,5 0,2342 22 0,2375 21 0,2217 19,5 0,1766 16 0,1591 14,50 0,1450 12,50 0,1108 9 0,0533 5

Table 33 shows that the first question of the topic group is answered much more easily than the rest of the questions which need to solve some references to previous questions and answers.

Regarding NIL questions, Table 34 shows the harmonic mean (F) of precision and recall for self-contained questions, linked questions and all questions, taking into account only the first answer. In most of the systems, NIL is not given as second or third candidate answer.

The correlation coefficient r between the self-score and the correctness of the answers (shown in Table 34) has been similar to the obtained last year, being not good enough yet, and explaining the low results in CWS and K1 [6] measures. prib081eses inao082eses inao081eses qaua082eses mira081eses mira082eses qaua081enes qaua082enes mira081fres 0,26 0,14 0,19 0,27 0,27 0,29 0,26 0,20 0,15 %Answer Extraction 90,53 80,56 80,00 80,00 73,58 67,74

With respect to the source of the answers, Table 36 shows that in this second year of using Wikipedia, this collection is now the main source of correct answers for most of the systems (with the exception of U. of Alicante).

% Of Correct % Of correct answers Answers found % Of Correct anfound in EFE in Wikipedia swers found NIL 36,97 24,14

25 48,53 23,26 21,62 52,27 48,57 33,33 60,50 68,97

70 42,65 69,77 70,27 29,55 34,29 41,67 2,52 6,90

5 8,82 6,98 8,11 18,18 17,14 25

Run prib081eses inao082eses inao081eses qaua082eses mira081eses mira082eses qaua081enes qaua082enes mira081fres

4 Conclusions

This year we proposed the same evaluation setting as in 2007 campaign. In fact, last year the task was changed considerably and this affected the general level of results and also the level of participation in the QA task. This year participation increased slightly but the task proved to be still very difficult. Wikipedia increased its presence as a source of questions and answers. Following last year’s conclusions Wikipedia seemed to be a good source for finding answers to simple factoid questions.

Moreover, the overall decrease in accuracy was probably due to linked questions. This fact confirms that topic resolution is a weak point for QA systems.

Only 5 out of 11 target languages had more than one different participating group. Thus from the evaluation methodology perspective, a comparison between systems working under similar circumstances cannot be accomplished and this impedes one of the major goals of campaigns such the QA@CLEF, i.e. the systems comparison which could determine an improvement in approaching QA problematic issues.

In six years of QA experimentation, a lot of resources and know-how have been accumulated, nevertheless systems do not show a brilliant overall performance, even those that have participated to most QA campaigns, and still seem not to manage suitably the different challenges proposed.

In conclusion, it is clear that a redefinition of the task should be thought in the next campaign. This new definition of the task should permit the evaluation and comparison of systems even working in different languages. The new setting should also take as reference a real user scenario, perhaps in a new document collection.

Acknowledgements. A special thank to Danilo Giampiccolo (CELCT, Trento, Italy), who has given his precious advise and valuable support at many levels for the preparation and realization of the QA track at CLEF 2008.

Jesús Herrera has been partially supported by the Spanish Ministry of Education and Science (TIN2006-14433-C02-01 project).

Anselmo Peñas has been partially supported by the Spanish Ministry of Science and Technology within the Text-Mess-INES project (TIN2006-15265-C06-02).

Paulo Rocha was supported by the Linguateca project, jointly funded by the Portuguese Government and the European Union (FEDER and FSE), under contract ref. POSC/339/1.3/C/NAC

1. QA@CLEF Website: http://clef-qa.itc.it/

2. AVE Website: http://nlp.uned.es/QA/ave/.

3. QAST Website: http://www.lsi.upc.edu/~qast/

4. QA-WSD Website : http://ixa2.si.ehu.es/qawsd/

5. QA@ CLEF 2007 Organizing Committee . Guidelines 2007 . http://clef-qa.itc.it/2007/download/QA@ CLEF07_Guidelines-for-Participants .pdf

6. Hartrumpf , S. , Glöckner , I. , Leveling , J.: University of Hagen at QA@ CLEF 2007: Coreference Resolution for Questions and Answer Merging . This volume.

7. Herrera , J. , Peñas

, Verdejo , F. : Question Answering Pilot Task at CLEF 2004 . In: Peters, C. , Clough , P. , Gonzalo , J. , Jones , Gareth J.F. , Kluck , M. , Magnini , B . (eds.): Multilingual Information Access for Text , Speech and Images. Lecture Notes in Computer Science , Vol. 3491 . Springer-Verlag, Berlin Hidelberg New York ( 2005 ) 581 - 590

8. Ion , R.: Word Sense Disambiguation Methods Applied to English and Romanian . PhD thesis , Romanian Academy, Bucharest ( 2007 ).

9. Ion , R. , Mititelu , V.B.: Constrained Lexical Attraction Models . In: Nineteenth International Florida Artificial Intelligence Research Society Conference , pp. 297 - 302 . AAAI Press, Menlo Park, California, USA ( 2006 ).

10. Landis , J. R. and Koch, G. G. : The measurements of observer agreement for categorical data . Biometrics , 33 ( 1997 ) 159 - 174 .

11. Laurent , D. , Séguéla , P. , Nêgre

: Cross Lingual Question Answering using QRISTAL for CLEF 2007 . This volume.

12. Magnini , B. , Giampiccolo , D. , Forner , P. , Ayache , C. , Jijkoun , V. , Osenova , P. , Peñas , A. , Rocha , P. , Sacaleanu , B. , and Sutcliffe , R.: Overview of the CLEF 2006 Multilingual Question Answering Track . In: Evaluation of Multilingual and Multi-modal Information Retrieval. Lecture Notes in Computer Science , Vol. 4730 . Springer-Verlag, Berlin Heidelberg New York ( 2007 ) 223 - 256 .

13. Peñas , A. , Rodrigo , Á., Verdejo , F. : Overview of the Answer Validation Exercise 2007 . This volume.

14. Turmo , J. , Comas , P. , Ayache , C , Mostefa, D. , Rosset , S. , Lamel , L. : Overview of QAST 2007 .

15. Vallin , A. , Magnini , B. , Giampiccolo , D. , Aunimo , L. , Ayache , C. , Osenova , P. , Peñas , A. , de Rijke , M. , Sacaleanu , B. , Santos , D. and Sutcliffe , R. .: Overview of the CLEF 2005 Multilingual Question Answering Track . In: Accessing Multilingual Information Repositories. Lecture Notes in Computer Science , Vol. 4022 . Springer-Verlag, Berlin Heidelberg New York ( 2006 ) 307 - 331 .

16. Voorhees , E.: Overview of the TREC 2002 Question Answering Track . In NIST Special Publication 500-251: The Eleventh Text REtrieval Conference (TREC 2002 ). National Institute of Standards and Technology , USA ( 2002 ).

17. Agirre , Eneko & Lopez de Lacalle, Oier ( 2007 ). UBC-ALM: Combining k-NN with SVD for WSD . Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval 2007 ), pp. 341 - 345 . Prague, Czech Republic.

18. Chan , Yee Seng , & Ng , Hwee Tou , & Zhong , Zhi ( 2007 ). NUS-PT: Exploiting Parallel Texts for Word Sense Disambiguation in the English All-Words Tasks . Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval 2007 ), pp. 253 - 256 . Prague, Czech Republic.