-

APPENDIX B

Results of the Multipl

Question Answering Track

Alessandro Vallin

vallin@itc.it 1

Jesús Herrera

jesus.herrera@lsi.uned.es 0 0 Dpto. Lenguajes y Sistemas Informáticos, UNED , Madrid , Spain 1 ITC-Irst , Trento , Italy

2004

List of Run Characteristics (*) The DAEDALUS group submitted the results after the scheduled deadline.

Results for Main Tasks In the following six pages the results for the main QA tasks are given. They are divided according to target languages, so that there is a separate table per language. Several tasks can be grouped in the same target language.

Each table provides the following information: - the name of the submitted run; - the task in which the group participated; - the number of answers contained in each submission (divided into Right, Wrong, ineXact and Unsupported). In all the tasks there were 200 questions and systems were allowed to return just one response per question. Nevertheless, some runs count less than 200 answers, because some questions that contained mistakes were discarded; - the overall accuracy of each run (i.e. the percentage of Right answers); - the accuracy over the Factoid questions; - the accuracy over the Definition questions (test sets contained around 20 of them); - the systems’ Precision and Recall in recognising the questions that did not have any answer (the correct answer-string was “NIL”); the Confidence-weighted Score, which takes into account the systems’ ability to rank the answers according to confidence. This additional measure ranges between 0 (no correct response at all) and 1 (all the answers are correct and the system is always confident about them). Since the confidence value was not mandatory, the Confidence-weighted Score was not computed for all the runs. A : e g a u g n a l t e g r a t s a ) E D ( n a m r e G d e r t o p # p 3 0 u s n U c t a # X 1 2 e n i t h # ig 0 7

5 6

R k s a = T o w C g n # ro

W l d n e e d 1 n e r f 1 4 n 4 0 0 in i i d d e e e

n n e d e r f 2 4 g n # ro

W s r e : e g a u g n a l d e t o p # p 2 2 0 5 5 1 1 2 u s n t c a # X 5 4 0 5 6 7 1 3 e n g n # ro

W s r e # sw n

A k s a T e m a N n : e g a u g n 4 0 e n i g r f r f 4 2 d e t o p # p 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 u s n U c a # X 5 11 13 9 41 11 12 8 7 13 10 6 11 8 61 9 e n i r u c c A g n # ro

W s r e > = L N l n l n 1 4 0 s > = L N l n l n 2 l l a 5 c 0 0 .2 e 0 R A n U : e g t h # ig 6 2 0

5 2 3 R An additional pilot task was set up only for Spanish. Differently from the main tasks, list questions and questions that required more sophisticated temporal reasoning were proposed.

The following table describes the results of the run alivpilot, submitted by the University of Alicante, that was the only participating team. Results have been grouped by type of question (definition, factoid, list, temporally restricted by date, temporally restricted by event and temporally restricted by period).

In addition, a couple of the posed questions had no answer in the corpus (NIL) but the system did not recognise them.

The table provides the following information: - the number of questions; - the number of known distinct answers, i.e., the number of different and correct answers retrieved by the University of Alicante system in its exercise and by humans during the pre-assessment process; - the number of given answers; - the number of questions with at least 1 correct answer, i.e., questions with at least 1 answer assessed as

Right; - the number of given correct answers; - the system's recall in recognising correct answers, i.e., the ratio between the number of given correct answers and the number of known distinct answers; - the system's precision in recognising correct answers, i.e., the ratio between the number of given correct answers and the number of given answers; - the K-measure1 value; this metrics ranges in [-1, 1] and rewards systems that: • answer as many questions as possible, • give as many different right answers for each question as possible, • give the smaller number of wrong answers to each question, • assign higher values of the score to right answers, • assign lower values of the score to wrong answers, • give answer to the questions having less known answers; - the correlation coefficient (r) between the confidence score and human assessment; human assessment equals 1 when an answer is assessed as Right and 0 otherwise; r gives an idea about the quality of the system's selfscoring.

Definition Factoid List Date

Temp. Event

Period Total # questions † r is Not Available because 0 was given for every component of any variable. recall precision

K N/A † -0.089 0.284 N/A 0.255 0.648

0 .246