<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anselmo Peñas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Álvaro Rodrigo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felisa Verdejo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dpto. Lenguajes y Sistemas Informáticos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>anselmo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>alvarory</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>felisa}@lsi.uned.es</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Question Answering Track</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Answer Validation Exercise</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mapping</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Human Judgements</institution>
          ,
          <addr-line>R,W,X,U</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Question Answering</institution>
          ,
          <addr-line>Evaluation, Textual Entailment, Answer Validation</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Systems' Validation</institution>
          ,
          <addr-line>ACCEPT, REJECT</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2007</year>
      </pub-date>
      <abstract>
        <p>The Answer Validation Exercise at the Cross Language Evaluation Forum is aimed at developing systems able to decide whether the answer of a Question Answering system is correct or not. We present here the exercise description, the changes in the evaluation methodology with respect to the first edition, and the results of this second edition (AVE 2007). The changes in the evaluation methodology had two objectives: the first one was to quantify the gain in performance when more sophisticated validation modules are introduced in QA systems. The second objective was to bring systems based on Textual Entailment to the Automatic Hypothesis Generation problem which is not part itself of the Recognising Textual Entailment (RTE) task but a need of the Answer Validation setting. 9 groups have participated with 16 runs in 4 different languages. Compared with the QA systems, the results show an evidence of the potential gain that more sophisticated AV modules introduce in the task of QA. The first Answer Validation Exercise (AVE 2006) [7] was activated last year in order to promote the development and evaluation of subsystems aimed at validating the correctness of the answers given by QA systems. In some sense, systems must emulate human assessment of QA responses and decide whether an answer is correct or not according to a given text. This automatic Answer Validation is expected to be useful for improving QA systems performance [5]. However, the evaluation methodology in AVE 2006 did not permit to quantify this improvement and thus, the exercise has been modified in AVE 2007. Figure 1 shows the relationship between the QA main track and the Answer Validation Exercise. The main track provides the questions made by the organization and the responses given by the participant systems once they are judged by humans.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Questions</title>
      </sec>
      <sec id="sec-1-2">
        <title>Systems’ answers Systems’ Supporting Texts Human Judgements (R,W,X,U)</title>
        <p>(ACCEPT,</p>
      </sec>
      <sec id="sec-1-3">
        <title>REJECT)</title>
      </sec>
      <sec id="sec-1-4">
        <title>Evaluation</title>
      </sec>
      <sec id="sec-1-5">
        <title>AVE Track results</title>
        <p>Another difference in the exercise with respect to the AVE 2006 is the input to the participant systems. Last year
we promoted an architecture based on Textual Entailment trying to bring research groups working on machine
learning to Question Answering. Thus, we provided the hypothesis already built from the questions and answers
[6] (see Figure 2). Then, the exercise was similar to the RTE Challenges [1] [2] [3], where systems must decide
if there is entailment or not between the supporting text and the hypothesis.</p>
        <p>In this edition, on the contrary, we left open the problem of Automatic Hypothesis Generation for those
systems based on Textual Entailment. In this way, the task is more realistic and close to the Answer Validation
problem, where systems receive a triplet (Question, Answer, Supporting text) instead a pair (Hypothesis, Text)
(see Figure 2).</p>
      </sec>
      <sec id="sec-1-6">
        <title>Question</title>
      </sec>
      <sec id="sec-1-7">
        <title>Candidate answer</title>
      </sec>
      <sec id="sec-1-8">
        <title>Supporting Text</title>
      </sec>
      <sec id="sec-1-9">
        <title>Automatic</title>
      </sec>
      <sec id="sec-1-10">
        <title>Hypothesis</title>
      </sec>
      <sec id="sec-1-11">
        <title>Generation</title>
        <p>Answer Validation</p>
      </sec>
      <sec id="sec-1-12">
        <title>Hypothesis</title>
        <p>AVE 2007</p>
      </sec>
      <sec id="sec-1-13">
        <title>Textual</title>
      </sec>
      <sec id="sec-1-14">
        <title>Entailment AVE 2006</title>
      </sec>
      <sec id="sec-1-15">
        <title>ACCEPT,</title>
      </sec>
      <sec id="sec-1-16">
        <title>REJECT</title>
        <p>Section 2 describes the exercise in more detail. The development and testing collections are described in
Section 3. Section 4 discusses the evaluation measures. Section 5 offers the results obtained by the participants
and finally Section 6 present some conclusions and future work.</p>
        <p>&lt;q id="116" lang="EN"&gt;
&lt;q_str&gt;What is Zanussi?&lt;/q_str&gt;
&lt;a id="116_1" value=""&gt;
&lt;a_str&gt;was an Italian producer of home
appliances&lt;/a_str&gt;
&lt;t_str doc="Zanussi"&gt;Zanussi For the Polish film
director, see Krzysztof Zanussi. For the hot-air
balloon, see Zanussi (balloon). Zanussi was an
Italian producer of home appliances that in 1984 was
bought&lt;/t_str&gt;
&lt;/a&gt;
&lt;a id="116_2" value=""&gt;
&lt;a_str&gt;who had also been in Cassibile since August
31&lt;/a_str&gt;
&lt;t_str doc="en/p29/2998260.xml"&gt;Only after the
signing had taken place was Giuseppe Castellano
informed of the additional clauses that had been
presented by general Ronald Campbell to another
Italian general, Zanussi, who had also been in</p>
        <p>Cassibile since August 31.&lt;/t_str&gt;
&lt;/a&gt;
&lt;a id="116_4" value=""&gt;
&lt;a_str&gt;3&lt;/a_str&gt;
&lt;t_str doc="1618911.xml"&gt;(1985) 3 Out of 5 Live
(1985) What Is This?&lt;/t_str&gt;
&lt;/q&gt;
&lt;/a&gt;</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Exercise Description</title>
      <p>In this edition, participant systems received a set of triplets (Question, Answer, Supporting Text) and
they must return a value for each triplet rejecting or accepting it. More in detail, the input format was a set of
pairs (Answer, Supporting Text) grouped by Question (see Figure 3). Systems must consider the Question and
validate each of the (Answer, Supporting Text) pairs. The number of answers to be validated per question
depended on the number of participant systems at the Question Answering main track.</p>
      <p>Participant systems must return one of the following values for each answer according to the response
format (see Figure 4):
q_id a_id [SELECTED|VALIDATED|REJECTED] confidence</p>
      <p>VALIDATED. Indicates that the answer is correct and supported by the given text. There is no
restriction in the number of VALIDATED answers (from zero to all).</p>
      <p>SELECTED indicates that the answer is VALIDATED and it is the one chosen as the output of a
hypothetical QA system. The SELECTED answers are evaluated against the QA systems of the Main
Track. No more than one answer per question can be marked as SELECTED. At least one of the
VALIDATED answers must be marked as SELECTED.</p>
      <p>REJECTED indicates that the answer is incorrect or there is no enough evidence of its correctness.</p>
      <p>There is no restriction in the number of REJECTED answers (from zero to all).</p>
      <p>This configuration permitted us to compare the AV systems responses with the QA ones, and obtain some
evidences about the gain in performance that sophisticated AV modules can give to QA systems (see below).</p>
    </sec>
    <sec id="sec-3">
      <title>3. Collections</title>
      <p>Since our objective was to compare AVE results with the QA main track results, we must ensure that
we give to AV systems no extra information. The fact of grouping all the answers to the same question could
lead to provide extra information based on counting answer redundancies that QA systems might not be
considering. For this reason we removed duplicated answers inside the same question group. In fact, if an answer
was contained in another answer, the shorter one was removed. Finally, NIL answers, void answers and answers
with a supporting snippet larger than 700 characters (maximum permitted in the main track) were discarded for
building the collections. This processing lead to a reduction in the number of answers to be validated (see Tables
1 and 2): from 11.2% in the Italian test collection to 88.3% in the Bulgarian development collection.</p>
      <p>For the assessments, we reused the QA judgements because they were done considering the supporting
snippets in a similar way the AV systems must do. The relation between QA assessments and AVE judgements
was the following:
• Answers judged as Correct have a value equal to VALIDATED
• Answers judged as Wrong or Unsupported have a value equal to REJECTED
• Answers judged as Inexact have a value equal to UNKNOWN and are ignored for evaluation purposes.
• Answers not evaluated at the QA main track (if any) are also tagged as UNKNOWN and they are also
ignored in the evaluation.</p>
      <sec id="sec-3-1">
        <title>3.1. Development Collections</title>
        <p>Development collections were obtained from the QA@CLEF 2006 [6] main track questions and answers. Table
1 shows the number of questions and answers for each language together with the percentage that these answers
represent over the number of answers initially available, and the number of answers with VALIDATED and
REJECTED values.</p>
        <p>Questions 187 200 200 200 192 198 200</p>
        <p>Answers (final) 504 1121 1817 1503 476 528 817
% over available answers 31.5% 62.28% 53.44% 50.1% 47.6% 44% 40.85%
VALIDATED 135 130 265 263 86 100 153</p>
        <p>REJECTED 369 991 1552 1240 390 428 664</p>
        <p>Table 1. Number of questions and answers in the AVE 2007 development collections
These collections were available for participants after their registration at CLEF at http://nlp.uned.es/QA/ave/</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Test Collections</title>
        <p>Test collections were obtained from the QA@CLEF 2007 main track. In this edition, questions were
grouped by topic [4]. The first question of a topic was self contained in the sense that there is no need of
information outside the question to answer it. However, the rest of the topic questions can refer to implicit
information linked to the previous questions and answers of the topic group (anaphora, co-reference, etc.).</p>
        <p>For the AVE 2007 test collections we only made use of the self-contained questions (the first one of
each topic group) and their respective answers given by the participant systems in QA.</p>
        <p>The change of the task produced a lower participation in the main track because systems were not tuned
on time and this fact, together with the consideration of less number of questions and the elimination of
redundancies led to a reduction of the evaluation corpora in AVE 2007.</p>
        <p>Table 2 shows the number of questions and the number of answers to be validated (or rejected) in the
test collections together with the percentage that these answers represent over the answers initially available.
h
s
i
l
g
n
E
56
70
11.67%
49
21
n
a
i
n
a
m
o</p>
        <p>R
100
127
52.05%
45
58
24
4. Evaluation of the Answer Validation Exercise</p>
        <p>
          In [7] was argued why the AVE evaluation is based on the detection of the correct answers. Instead of
using an overall accuracy as the evaluation measure, we proposed the use of precision (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), recall (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) and
Fmeasure (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) (harmonic mean) over answers that must be VALIDATED. In other words, we proposed to quantify
systems ability to detect whether there is enough evidence to accept an answer.
        </p>
        <p>Results can be compared between systems but always taking as reference the following baselines:
1. A system that accepts all answers (return VALIDATED or SELECTED in 100% of cases)
2. A system that accepts 50% of the answers (random)
1 Assessments not available at the this report was submited
precision =
| predicted _ correctly _ as _ SELECTED _ or _VALIDATED |</p>
        <p>
          | predicted _ as _ SELECTED _ or _VALIDATED |
recall =
| predicted _ correctly _ as _ SELECTED _ or _VALIDATED |
| CORRECT _ answers |
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
F =
2·recall· precision
recall + precision
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
        </p>
        <p>However, this is an intrinsic evaluation that is not enough for comparing AVE results with QA results in
order to obtain some evidence about the goodness of incorporating more sophisticated validation systems into
the QA architecture. Some recent works [5] have shown how the use of textual entailment can improve the
accuracy of QA systems. Our aim was to obtain evidences of this improvement in a comparative and shared
evaluation.</p>
        <p>
          For this reason, a new measure (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ), very easy to understand, was applied in AVE 2007. Since answers
were grouped by questions and AV systems were requested to SELECT one or none of them, the resulting
behaviour is comparable to a QA system: for each question there is no more than one SELECTED answer. The
proportion of correctly selected answers is a measure comparable to the accuracy used in the QA Main Track
and, therefore, we can compare AV systems taking as reference the QA systems performance over the questions
involved in AVE test collections.
        </p>
        <p>
          qa _ accuracy =
| answers _ SELECTED _ correctly |
| questions |
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
        </p>
        <p>
          This measure has an upper bound given by the proportion of questions that have at least one correct
answer (in its corresponding group). This upper bound corresponds to a perfect selection of the correct answers
given by all the QA systems at the main track. The normalization of qa_accuracy with this upper bound is given
in (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ). We will refer to this measure also as percentage of the perfect selection (normalized_qa_accuracy x 100).
normalized _ qa _ accuracy =
| answers _ SELECTED _ correctly |
| questions _ with _ correct _ answers |
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
        </p>
        <p>
          Besides the upper bound, results of qa_accuracy can be compared with the following baseline system: A
system that validates 100% of the answers and selects randomly one of them. Thus, this baseline can be seen as
the average proportion of correct answers per question group (
          <xref ref-type="bibr" rid="ref6">6</xref>
          ).
        </p>
        <p>random _ qa _ accuracy =
1</p>
        <p>
          ∑
| questions | q∈questions
| correct _ answers _ of (q) |
| answers _ of (q) |
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Results</title>
      <p>Nine groups (2 less than the past edition) have participated in four different languages. Table 3 shows the
participant groups and the number of runs they submitted per language. Again, English and Spanish were the
most popular with 8 and 5 runs respectively.</p>
      <p>Tables 4-7 show the results for all participant systems in each language. Results cannot be compared between
languages since the number of answers to be validated and the proportion of the correct ones are different for
each language (due to the real submission of the QA systems). Together with the systems precision, recall and
Fmeasure, the two baselines values are shown: the results of a system that always accept all answers (validates
100% of the answers), and the results of a hypothetical system that validates the 50% of answers.
2
1
2
2
2
2
Fernuniversität in Hagen
U. Évora
U. Iasi
DFKI
INAOE
U. Alicante
Text Mess project
U. Jaén
UNED</p>
      <p>In our opinion, F-measure is an appropriate measure to identify the systems that perform better, measuring
their ability to detect the correct answers and only them. However, we wanted to obtain some evidence about the
improvement that more sophisticated AV systems could provide to QA systems. Tables 8-11 show the rankings
of systems (merging QA and AV systems) according to the QA accuracy calculated only over the subset of
questions considered in AVE 2007. With the exception of Portuguese were there is only one participant group,
there are AV systems for each language able to achieve more than 70% of the perfect selection. In German and
English, the best AV systems obtained better results than the QA systems, achieving a 93% of the perfect
selection in the case of German.</p>
      <p>In general, the groups that participated in both QA Main Track and AVE, obtained better results with the AV
system than with the QA one. This can be due to two factors: Or they need to extract more and better candidate
answers, or they do not use their own AV module to rank them properly in the QA system.</p>
      <p>All the participant groups in AVE 2007 reported the use of an approach based on Textual Entailment. 5
of the 9 groups (FUH, U. Iasi, INAOE, FUH, U. Évora and DFKI) have also participated in the Question
Answering Track, showing that techniques developed for Textual Entailment are in the process of being
incorporated in the QA systems participating at CLEF.</p>
      <p>e
n
e
t
f
i
d
a
z
e
l
l
e
t
o
g
i
r
d
o
r
1
_
r
e
n
k
c
o
l
g
i
2
_
r
e
n
k
c
o
l
g
i
s
a
i
a
s
j
a
q
t
l
c
g
a
m
e
f
o</p>
      <p>Table 12 shows the techniques used by AVE participant systems. In general, the groups that performed
some kind of syntactic or semantic analysis worked in the Automatic Hypothesis Generation as a combination of
the question and the answer. However, in some cases the hypothesis generated was directly in a logic form
instead of a textual sentence.</p>
      <p>All the participants reported the use of lexical processing. Lemmatization and part of speech tagging
were commonly used. In the other side, only few systems used first order logic representations, performed
semantic analysis and took the validation decision with a theorem prover.</p>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusions</title>
      <p>In this second edition of the Answer Validation Exercise, techniques developed for Recognizing
Textual Entailment have been employed widely, although the exercise was defined more closely to the real
answer validation application.</p>
      <p>We have refined the evaluation methodology in order to consider the QA systems performance as a
reference for AV systems evaluation. Thus, new measures have been defined together with their respective
baselines: qa_accuracy and the percentage of the perfect selection (normalized_qa_accuracy).</p>
      <p>With respect to the development of test collections, the new evaluation framework led us to reduce
redundancies in the sets of answers. This process reduces the size of the testing collections discarding around
50% of candidate answers. The training and testing collections resulting from AVE 2006 and 2007 are available
at http://nlp.uned.es/QA/ave for researchers registered at CLEF.</p>
      <p>Results show that AV systems are able to detect correct answers improving the results of QA systems.
In fact, except for Portuguese (where there is only one participant at AVE), all the systems are far from the
random behaviour and closer to the perfect selection (from 70% to 93%).</p>
      <p>All systems utilize lexical processing, most of them introduce a syntactic level and only few make use
of semantics and logic. Groups that participated in both QA and AVE tracks show better performance in the
selection of answers than the results obtained by the whole QA system. This fact points to the need of
considering the evidences given by the AV modules in order to generate more and better candidate answers. In
this way, the approach of looping the AV module with the generation of candidate answers should be considered
instead of the solely approach based on the ranking of candidate answers.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work has been partially supported by the Spanish Ministry of Science and Technology within the
Text-Mess-INES project (TIN2006-15265-C06-02), the Education Council of the Regional Government of
Madrid and the European Social Fund. We are grateful to all the people involved in the organization of the QA
track (specially to the coordinators at CELCT, Danilo Giampiccolo and Pamela Forner).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Roy</given-names>
            <surname>Bar-Haim</surname>
          </string-name>
          , Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini and
          <string-name>
            <given-names>Idan</given-names>
            <surname>Szpektor</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>The Second PASCAL Recognising Textual Entailment Challenge</article-title>
          .
          <source>In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment</source>
          , Venice, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Ido</given-names>
            <surname>Dagan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Oren</given-names>
            <surname>Glickman</surname>
          </string-name>
          and
          <string-name>
            <given-names>Bernardo</given-names>
            <surname>Magnini</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <source>The PASCAL Recognising Textual Entailment Challenge. Lecture Notes in Computer Science</source>
          , Volume
          <volume>3944</volume>
          ,
          <string-name>
            <surname>Jan</surname>
            <given-names>2006</given-names>
          </string-name>
          , Pages
          <fpage>177</fpage>
          -
          <lpage>190</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Danilo</given-names>
            <surname>Giampiccolo</surname>
          </string-name>
          , Bernardo Magnini, Ido Dagan and
          <string-name>
            <given-names>Bill</given-names>
            <surname>Dolan</surname>
          </string-name>
          .
          <source>The Third PASCAL Recognizing Textual Entailment Challenge. ACL-PASCAL Workshop on Textual Entailment and Paraphrasing</source>
          .
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Danilo</given-names>
            <surname>Giampiccolo</surname>
          </string-name>
          et al.
          <year>2007</year>
          .
          <article-title>Overview of the CLEF 2007 Multilingual Question Answering Track</article-title>
          .
          <source>Working Notes of CLEF</source>
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>S.</given-names>
            <surname>Harabagiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hickl</surname>
          </string-name>
          .
          <article-title>Methods for Using Textual Entailment in Open-Domain Question Answering</article-title>
          .
          <source>In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL</source>
          , pages
          <fpage>905</fpage>
          -
          <lpage>912</lpage>
          , Sydney,
          <year>2006</year>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Bernardo</given-names>
            <surname>Magnini</surname>
          </string-name>
          , Danilo Giampiccolo, Pamela Forner, Christelle Ayache, Valentin Jijkoun, Petya Osenova, Anselmo Peñas, Paulo Rocha, Bogdan Sacaleanu, and Richard Sutcliffe,
          <year>2007</year>
          .
          <article-title>Overview of the CLEF 2006 Multilingual Question Answering Track</article-title>
          .
          <source>CLEF 2006, Lecture Notes in Computer Science LNCS 4730</source>
          . Springer-Verlag, Berlín
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Anselmo</given-names>
            <surname>Peñas</surname>
          </string-name>
          , Álvaro Rodrigo, Valentín Sama, Felisa Verdejo,
          <year>2007</year>
          .
          <article-title>Overview of the Answer Validation Exercise 2006</article-title>
          .
          <source>CLEF 2006, Lecture Notes in Computer Science LNCS 4730</source>
          . Springer-Verlag, Berlín
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>