<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>APPENDIX B</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Results of the Multipl</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Question Answering Track</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Vallin</string-name>
          <email>vallin@itc.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jesús Herrera</string-name>
          <email>jesus.herrera@lsi.uned.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dpto. Lenguajes y Sistemas Informáticos, UNED</institution>
          ,
          <addr-line>Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ITC-Irst</institution>
          ,
          <addr-line>Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2004</year>
      </pub-date>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>List of Run Characteristics
(*) The DAEDALUS group submitted the results after the scheduled deadline.</p>
      <p>Results for Main Tasks
In the following six pages the results for the main QA tasks are given. They are divided according to target
languages, so that there is a separate table per language. Several tasks can be grouped in the same target
language.</p>
      <p>Each table provides the following information:
- the name of the submitted run;
- the task in which the group participated;
- the number of answers contained in each submission (divided into Right, Wrong, ineXact and
Unsupported). In all the tasks there were 200 questions and systems were allowed to return just one response
per question. Nevertheless, some runs count less than 200 answers, because some questions that contained
mistakes were discarded;
- the overall accuracy of each run (i.e. the percentage of Right answers);
- the accuracy over the Factoid questions;
- the accuracy over the Definition questions (test sets contained around 20 of them);
- the systems’ Precision and Recall in recognising the questions that did not have any answer (the correct
answer-string was “NIL”);
the Confidence-weighted Score, which takes into account the systems’ ability to rank the answers according
to confidence. This additional measure ranges between 0 (no correct response at all) and 1 (all the answers are
correct and the system is always confident about them). Since the confidence value was not mandatory, the
Confidence-weighted Score was not computed for all the runs.
A
:
e
g
a
u
g
n
a
l
t
e
g
r
a
t
s
a
)
E
D
(
n
a
m
r
e
G
d
e
r
t
o
p
# p 3 0
u
s
n
U
c
t
a
# X 1 2
e
n
i
t
h
# ig 0 7</p>
      <p>5 6</p>
      <p>R
k
s
a =
T
o w
C
g
n
# ro</p>
      <p>W
l
d
n
e
e
d
1
n
e
r
f
1
4
n
4
0 0
in i i
d d
e e e</p>
      <p>n
n
e
d
e
r
f
2
4
g
n
# ro</p>
      <p>W
s
r
e
:
e
g
a
u
g
n
a
l
d
e
t
o
p
# p 2 2 0 5 5 1 1 2
u
s
n
t
c
a
# X 5 4 0 5 6 7 1 3
e
n
g
n
# ro</p>
      <p>W
s
r
e
# sw
n</p>
      <p>A
k
s
a
T
e
m
a
N
n
:
e
g
a
u
g
n
4
0
e
n
i
g
r
f
r
f
4
2
d
e
t
o
p
# p 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
u
s
n
U
c
a
# X 5 11 13 9 41 11 12 8 7 13 10 6 11 8 61 9
e
n
i
r
u
c
c
A
g
n
# ro</p>
      <p>W
s
r
e
&gt;
=
L
N
l
n
l
n
1
4
0
s
&gt;
=
L
N
l
n
l
n
2
l
l
a 5
c 0 0 .2
e 0
R
A
n
U
:
e
g
t
h
# ig 6 2 0</p>
      <p>5 2 3
R
An additional pilot task was set up only for Spanish. Differently from the main tasks, list questions and questions
that required more sophisticated temporal reasoning were proposed.</p>
      <p>The following table describes the results of the run alivpilot, submitted by the University of Alicante, that was
the only participating team. Results have been grouped by type of question (definition, factoid, list, temporally
restricted by date, temporally restricted by event and temporally restricted by period).</p>
      <p>In addition, a couple of the posed questions had no answer in the corpus (NIL) but the system did not recognise
them.</p>
      <p>The table provides the following information:
- the number of questions;
- the number of known distinct answers, i.e., the number of different and correct answers retrieved by the
University of Alicante system in its exercise and by humans during the pre-assessment process;
- the number of given answers;
- the number of questions with at least 1 correct answer, i.e., questions with at least 1 answer assessed as</p>
      <p>Right;
- the number of given correct answers;
- the system's recall in recognising correct answers, i.e., the ratio between the number of given correct answers
and the number of known distinct answers;
- the system's precision in recognising correct answers, i.e., the ratio between the number of given correct
answers and the number of given answers;
- the K-measure1 value; this metrics ranges in [-1, 1] and rewards systems that:
• answer as many questions as possible,
• give as many different right answers for each question as possible,
• give the smaller number of wrong answers to each question,
• assign higher values of the score to right answers,
• assign lower values of the score to wrong answers,
• give answer to the questions having less known answers;
- the correlation coefficient (r) between the confidence score and human assessment; human assessment equals
1 when an answer is assessed as Right and 0 otherwise; r gives an idea about the quality of the system's
selfscoring.</p>
    </sec>
    <sec id="sec-2">
      <title>Definition Factoid List</title>
    </sec>
    <sec id="sec-3">
      <title>Date</title>
      <p>Temp. Event</p>
      <p>Period
Total
# questions
† r is Not Available because 0 was given for every component of any variable.
recall
precision</p>
      <p>K
N/A †
-0.089
0.284
N/A
0.255
0.648</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>P</surname>
          </string-name>
          <year>0</year>
          .246
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>