=Paper=
{{Paper
|id=Vol-1173/CLEF2007wn-QACLEF-TellezValeroEt2007
|storemode=property
|title=INAOE at AVE 2007: Experiments in Spanish Answer Validation
|pdfUrl=https://ceur-ws.org/Vol-1173/CLEF2007wn-QACLEF-TellezValeroEt2007.pdf
|volume=Vol-1173
|dblpUrl=https://dblp.org/rec/conf/clef/Tellez-ValeroMV07
}}
==INAOE at AVE 2007: Experiments in Spanish Answer Validation==
<pdf width="1500px">https://ceur-ws.org/Vol-1173/CLEF2007wn-QACLEF-TellezValeroEt2007.pdf</pdf>
<pre>
       INAOE at AVE 2007: Experiments in Spanish Answer Validation
                  Alberto Téllez-Valero, Manuel Montes-y-Gómez, Luis Villaseñor-Pineda
                                    Laboratorio de Tecnologías del Lenguaje
                    Instituto Nacional de Astrofísica, Óptica y Electrónica (INAOE), Mexico.
                                {albertotellezv, mmontesg, villasen}@inaoep.mx

                                                       Abstract
       This paper describes the INAOE’s answer validation system evaluated at the Spanish track of the
       AVE 2007. This system is based on a supervised learning approach that considers two kinds of at-
       tributes. On the one hand, some attributes indicating the textual entailment between the given sup-
       port text and the hypothesis constructed from the question and answer. On the other hand, some
       new features denoting certain answer restrictions as imposed by the question’s type and format. In
       order to extract all these attributes the system uses different tools such as a lemmatizer, a POS tag-
       ger, a NER procedure and a superficial syntactic parser. Experimental results are encouraging;
       they show that the proposed system achieved a 52.91% of F-measure and that it outperformed the
       standard baseline by 15 percentage points.


Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Information Search and
Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database Management]: Languages—
Query Languages

General Terms
Measurement, Performance, Experimentation

Keywords
Answer Validation, Question Answering, Textual Entailment Recognition, and Supervised Learning.

1    Introduction
Given a question, a candidate answer and a support text, an answer validation system must decide whether ac-
cept or reject the candidate answer. In other words, it must determine if the specified answer is correct and sup-
ported.
   Answer validation systems have been traditionally based on the idea of recognizing the textual entailment be-
tween the support text and an affirmative sentence (called hypothesis) created from the combination of the ques-
tion and the answer. In order to accomplish this recognition they have probed several approaches, ranging from
simple ones taking advantage of lexical overlaps to more complexes founded on the use of a logic representation
[6].
   The approach based on lexical overlaps is quite simple, but surprisingly it has achieved very competitive re-
sults. Representative methods of this approach determine that H (the hypothesis) is entailed from T (the support
text) only considering characteristics such as named entity overlaps [8], n-gram overlaps [4], as well as the size
of the longest common subsequence (LCS) [1,5].
   The simplicity is the strength of this approach, but also it is its weakness. For instance, [8] can easily recog-
nize the textual entailment between “Lucy visits some friends” (H) and “Lucy goes with some friends” (T). How-
ever, it fails when there is a high word overlap between H and T, but there does not exist an entailment relation;
for example between H and the text “Lucy has some wonderful friends”. The method presented in [4] has the
same problem. Nevertheless it is much restrictive on evaluating the overlap between H and T, and therefore it
tends to produce better results.
   The methods based on the use of LCSs [1,5] are less sensible to the word overlap rate, but they are still very
sensible to changes in the word order (like in the use of active and passive voices). For instance they will be
unsuccessful in recognizing the entailment between “Lucy visits some friends” (H) and “Some friends are visited
by Lucy” (T).
   Finally, all overlap-based methods have problems to deal with situations where the answer does not satisfied
simple type restrictions imposed by the question. For instance, in the example of Table 1, the candidate answer is
clearly incorrect, but it will be validated because the high lexical similarity between H and T.
                           Table 1. Incorrect answer validation using overlap-based methods
           Question:              What is the world record in the high jump?
           Answer:                Javier Sotomayor
           Support text (T):      The world record in the high jump, obtained by Javier Sotomayor, is 2.45 meters.
           Hypothesis (H):        Javier Sotomayor is the world record in the high jump

   The system described in this paper adopts several ideas from recent systems (in particular from [1,4,5]). It is
based on a supervised learning approach that uses a combination of all previous used features (word overlaps and
LCSs). In addition, it also includes some new characteristics that allow reducing previously discussed problems.
In particular:
    • It considers only content words for the calculus of word overlaps and the LCS. This represents a middle
      point between the usage of all words [4] and just named entities [8].
    • It computes the LCS taking into consideration POS tags. This characteristic makes possible obtaining larger
      subsequences and also dealing with the synonymy phenomena.
    • It makes a simple syntactic transformation over the generated hypothesis in order to simulate the usage of
      active and passive voices. In other words, it generates two hypotheses (H and H´) that combine in a differ-
      ent way the given question and answer. The inclusion of this additional hypothesis improves in many cases
      the calculus of the LCS.
    • It uses some manually constructed lexical patterns to help treating support texts containing an apposition
      phrase. This idea was taken from the COGEX logic-based system [9]. Its goal is to make explicit the rela-
      tion between the two elements of the apposition phase.
  • Finally, it includes some new features denoting certain answer restrictions as imposed by the question’s
     class. This new features avoid validating answers that does not correspond to the expected semantic type of
     answer.
   The following sections give some details on the proposed system. In particular, section 2 describes the main
characteristics of our system, whereas section 3 presents our evaluation results on the AVE 2007 Spanish track.
Finally, section 4 discusses some general conclusions about the performance of the proposed approach.

2      Our Answer Validation System
Figure 1 shows the general architecture of our system. It consists of three main phases: preprocessing, feature
extraction and classification. The following subsections describe these processes.

                                           Preprocessing        Feature extraction      Classification


                    QUESTION

                                                                             Text
                                             Hypotheses     H, H´            patterns
                         ANSWER              Construction

                                                  T                  Textual
                               SUPPORT                              Entailment
                                 TEXT                                Analysis
                                                                                        Validation
                                                                                        Decision
                                                                    Question-
                                                                     Answer
                                                                    Analysis


                                                                                           ACCEPTED or REJECTED
                                                                                               % confidence


                                Figure 1. INAOE’s system for Spanish answer validation
2.1 Preprocessing
The main task of this initial phase is to construct two distinct hypotheses combining the given question and an-
swer. In order to do that it firstly applies a superficial syntactic analysis over the question1. Then, using the ob-
tained syntactic tree, it generates both hypotheses.
   The first hypothesis (H) is constructed by replacing the nominal phrase that contains the interrogative particle
by the given answer. For instance, given the question “How many inhabitants are there in Longyearbyen?” and
the answer “180 millions of inhabitants”, this approach allows generating the hypothesis H = “180 millions of
inhabitants are there in Longyearbyen”.
   The second hypothesis (H´) is obtained doing a simple transformation on H. The idea is to detect the main
verb phrase of the H (that is the main verb phrase of the question) and then interchange its surrounding nominal
phrases. This way the second hypothesis for our example is H´ = “in Longyearbyen are there 180 millions of
inhabitants”.

2.2 Feature Extraction
As we previously mentioned, our system considers two kinds of attributes. On the one hand, some attributes
indicating the textual entailment between the support text and the constructed hypotheses. On the other hand,
some new features that denote certain answer restrictions imposed by the question’s type. These attributes are
extracted by two different modules of our system, the textual entailment analysis and the question-answer analy-
sis.
Textual Entailment Analysis
The textual entailment analysis of the pairs (T, H) and (T, H´) consists of two stages: (i) compute the term over-
lap, and (ii) calculate the term sequence overlap. In order to avoid a high matching caused by functional terms
(such as prepositions and determiners), in both cases we only consider the occurrence of content terms (nouns,
verbs, adjectives and adverbs). Besides, we use the word lemmas, which allow getting a better term overlap.
   The term overlap between the pair (T, H) is computed by a simple counting of the common content words in
the support text (T) and the hypothesis (H)2. The following features are generated from this analysis:
    (1) The rate of noun overlap between (T, H)
    (2) The rate of verb overlap between (T, H)
    (3) The rate of adjective overlap between (T, H)
    (4) The rate of adverb overlap between (T, H)
    (5) The rate of date overlap between (T, H)
    (6) The rate of number overlap between (T, H)
   Similar to [1,5] we compute the term sequence overlap by extracting the longest common subsequence (LCS).
However, different from these previous approaches, our method only considers the occurrence of content words
and allows the inclusion of POS tags inside the sequence.
   In this case, it is necessary to compute the LCS from (T, H) as well as from (T, H´). Nevertheless, only the
longest subsequence is used. This way we generate the following feature from this analysis:
    (7) The size of the LCS between (T, H) or (T, H´) divided by the size of H
   It is important to remember that the presence of apposition phrases in the support text causes detriment on the
LCS. In order to solve this problem we propose using some manually-constructed transformation patterns such
as “The <AGENT>, <AGENT>, → The <AGENT> <V> <AGENT>”.
   The example of Table 2 illustrates the application of these patterns as well as the inclusion of POS tags inside
the LCSs.
Question-Answer Analysis
There are two common situations related with the presence of an incorrect answer. The first one is that the se-
mantic class of the extracted answer does not correspond to the expected class of answer (in accordance to the
given question). For instance, having the answer “yesterday” for the question “How many inhabitants are there
in Longyearbyen?”.
   The second situation occurs when the question asks about a specific fact and the answer makes reference to
another different one. For instance, answering “eight” to the example question, using as support text “…when
eight animals parade by the principal street in Longyearbyen, a town of a thousand of inhabitants”.

1 This analysis was done by Freeling [2], an open source suite of language analyzers.
2 It is not necessary to compute the term overlap between (T) and (H´) since it will be exactly the same.
                 Table 2. Application of transformation patterns and POS tags in the LCS calculus
                  Question:             What is the quinua?
                  Answer:               Cereal
                  Support text (T):     The quinua, an American cereal of great nutritional value, …
                                                     Original analysis
                  Hypotheses:           Cereal is the quinua (H), The quinua is cereal (H´)
                  LCS (of size = 2):    Quinua cereal
                                           Using the transformation patterns
                  Support Text (T´):    The quinua (V) an American cereal of great nutritional value …
                  Hypotheses:           Cereal is the quinua (H), The quinua is cereal (H´)
                  LCS (of size = 3):    Quinua (V) cereal


    Our system includes two new features that attempt to capture these situations:
    (8) A Boolean value indicating if a general-class restriction is satisfied, and
    (9) A Boolean value indicating if a specific-type restriction is satisfied
   The general-class answer restriction is TRUE if the semantic class of the extracted answer and the expected
class of the answer are equal; other case it is set to FALSE. We consider three general classes: quantity, date, and
proper noun. The question classification (i.e., the definition of the expected class of the answer) is done using the
KNN supervised algorithm3 with K = 1.
   In order to determine the specific target fact concerning the question it is necessary to perform the following
procedure: (i) construct the syntactic tree of the question, and (ii) extract the principal noun from the noun phrase
that contains the interrogative particle. Applying this procedure over the example question, the word “inhabi-
tants” was selected as the specific target fact.
   Once extracted the specific target fact from the question, it is possible to evaluate the specific-type answer re-
striction. Its value is set to TRUE if the specific target fact happens in the support text, in the immediate answer
context (one content word to the right or left). In any other case its value is set to FALSE. Therefore, the candi-
date answer “eight” has its value set to FALSE since its immediate context (“eight animals”) does not contains
the noun “inhabitants”. On the contrary, the candidate answer “thousand” will be have its value set to TRUE,
since the noun “inhabitants” occurs in its immediate context (“town thousand inhabitants”).
   It is important to notice that not for all questions it is possible to establish a specific target fact (e.g., consider
the question “When was Amintore Fanfani born?”). In these cases we considered –by default– that all candidate
answers satisfied the specific-type restriction.

2.3 Classification
This final module generates the answer validation decision by means of a supervised learning approach, in par-
ticular, by a support vector machine classifier. This classifier decides if the answer is validated or rejected on the
basis of the nine previously described features along with the following two additional ones:
    (10) The question category (i.e., factoid, definition, or list)
    (11) The question interrogative particle (i.e., who, where, when, etc.)
   An evaluation of the proposed features during the development phase, using the information gain algorithm,
shows us that the nouns overlap and the LCS size are the most discriminative features. The general ranking of
the eleven features in decreasing order is as follows: 1, 7, 6, 11, 10, 8, 2, 5, 9, 4, and 3.

3     Experimental Evaluation
3.1 Training and Test Sets
The training set available for the AVE 2007 Spanish task consists of 1817 answers, where 15% are validated
answers and the rest 85% are rejected. In order to avoid the low recall in the validated answers we assembled a
more balanced training set. Basically, we joined some answers from the training sets of the AVE 2006 and 2007.
This new training set contains 2022 answers, where 44% are validated and 56% rejected.
   On the other hand, the evaluation set for the Spanish AVE 2007 contains 564 answers (22.5% validated and
77.5% rejected) corresponding to 170 different questions.
   Details on these sets are described in [7].

3 For the training process we considered all questions from the previous question answering CLEF campaigns.
3.2 Results
This section describes the experimental results of our participation at the AVE 2007 Spanish task. This year we
submitted two different runs. The first run (RUN 1) considered the system just as it was described in the previ-
ous section. On the other hand, the second run (RUN 2) used a different learning method; instead of using a
single support vector machine classifier, it employed an ensemble of this classifier. This ensemble was imple-
mented using the AdaBoostM1 algorithm in Weka [3].
   Table 3 shows the evaluation results corresponding to our two submitted runs. It also shows (in the last row)
the results for a 100% YES baseline (i.e., an answer validation system that validated all given answers). The
results indicate that our methods achieved a very high recall and a middle level precision, which means that it
validates must of the correct answers, but also some incorrect ones. It is important to point out that our best re-
sult (RUN 1) outperformed the baseline by 15 percentage points; the same proportion than the best evaluated
system at the AVE 2006 [6].
                               Table 3. General evaluation of the INAOE’s system
                                     TP     FP     TN      FN    Precision        Recall   F-measure
                       RUN 1         109    176    248     18        38.25%       85.83%    52.91%
                       RUN 2         91     131    293     36        40.99%       71.65%    52.15%
                       100% YES      127    424     -      -         23.05%       100%      37.46%

   This year the AVE organizers decide to include a new evaluation measure. This new measure, called qa-
accuracy, aims to evaluate the influence of the answer validation systems to the question answering task. In order
to compute this measure the answer validation systems must to select only one validated answer for each ques-
tion. This way, the qa-accuracy expresses the rate of correct selected answers.
   Table 4 presents the qa-accuracy results of our two runs. It also shows the results obtained by an “ideal” an-
swer validation system (i.e., a system that, when possible, always selects a correct answer). Here, it is necessary
to clarify that because only 101 questions (from the whole set of 170) has a correct candidate answer, it is impos-
sible to obtain a 100% qa-accuracy.
   The results of Table 4 are not conclusive. However, it is interesting to comment that our QA system (that was
the best one in 2005 and the second best one in 2006 in the Spanish QA task) [10] obtains a 35.88% of accuracy
on the same questions set and ignoring evaluate the correct NIL questions. This fact indicates –in some way–
that answer validation is useful, and that it could produce interesting improvements over current QA systems.
                        Table 4. Evaluation results obtained by the qa-accuracy measure
                                                            Selected Answers
                                           Total   Right       Wrong     Inexact     QA-accuracy
                           RUN 1           129      76          47            6          44.71%
                           RUN 2           107      62          40            5          36.47%
                           IDEAL           101      101          -            -          59.41%

In order to do a detail evaluation of our system we also measured its precision over the subset of 101 questions
that have a candidate corrected answer. In this case, RUN 1 validated the correct candidate answer for 75% of
the questions, and RUN 2 for 61%. For the rest of the questions (a subset of 69 questions), where does not exist
any correct candidate answer, the RUN 1 correctly answered NIL in 49% of the cases, whereas the RUN 2 cor-
rectly responded NIL in 61% of the questions.

4    Conclusions
This paper described the INAOE’s answer validation system that was evaluated at the Spanish track of the AVE
2007. This system adopts several ideas from recent overlap-based methods; basically, it is based on a supervised
learning approach that uses a combination of all previous used features, in particular, the word overlaps and the
longest common subsequences. However it includes some new notions that extend and improve these previous
methods. For instance: (i) it considers only content words for the calculus of word overlaps and the LCS; (ii) it
computes the LCS taking into consideration POS tags; (iii) it makes a syntactic transformation over the gener-
ated hypothesis in order to simulate the active and passive voices; (iv) it uses some manually constructed lexical
patterns to help treating support texts containing an apposition phrase; (v) it includes some new features denoting
certain answer restrictions as imposed by the question’s class.
   The evaluation results are encouraging; they show that the proposed system achieved a 52.91% of F-measure
and that it outperformed the standard baseline by 15 percentage points. Moreover, they indicate that our system
is especially precise (75% of accuracy) in selecting the correct answer for a question when such answer exists
inside the set of candidate answers.

Acknowledgements. This work was done under partial support of CONACYT (project grant 43990 and
scholarship 171610). We also like to thanks to the CLEF organizing committee as well as to the EFE agency for
the resources provided.


References
1.  Bosma W., and Callison-Burch C. Paraphrase Substitution for Recognizing Textual Entailment, In Working
    notes for the Cross Language Evaluation Forum Workshop (CLEF 2006), Alicante, España, September
    2006.
2. Carreras, X., I. Chao, L. Padró and M. Padró. FreeLing: An Open-Source Suite of Language Analyzers,
    Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04). Lisbon,
    Portugal. 2004.
3. Freund Y., and Schapire R. Experiments with a new boosting algorithm, Proc International Conference on
    Machine Learning, pages 148-156, Morgan Kaufmann, San Francisco, 1996.
4. Herrera J., Rodrigo A., Peñas A., and Verdejo F. UNED Submission to AVE 2006, In Working notes for the
     Cross Language Evaluation Forum Workshop (CLEF 2006), Alicante, España, September 2006.
5. Kozareva Z., Vázquez S., and Montoyo A. Adaptation of a Machine-learning Textual Entailment System to
    a Multilingual Answer Validation Exercise, In Working notes for the Cross Language Evaluation Forum
    Workshop (CLEF 2006), Alicante, España, September 2006.
6. Peñas A., Rodrigo A., Sama V., and Verdejo F. Overview of the Answer Validation Exercise 2006, In Work-
    ing notes for the Cross Language Evaluation Forum Workshop (CLEF 2006), Alicante, España, September
    2006.
7. Peñas A., Rodrigo A., Sama V., and Verdejo F. Overview of the Answer Validation Exercise 2007, In Work-
    ing notes for the Cross Language Evaluation Forum Workshop (CLEF 2007), Budapest, Hungary, Septem-
    ber 2007. In this volume.
8. Rodrigo A., Peñas A., and Verdejo F. The Effect of Entity Recognition in Answer Validation, In Working
    notes for the Cross Language Evaluation Forum Workshop (CLEF 2006), Alicante, España, September
    2006.
9. Tatu M., Iles B., and Moldovan D. Automatic Answer Validation using COGEX, In Working notes for the
    Cross Language Evaluation Forum Workshop (CLEF 2006), Alicante, España, September 2006.
10. Téllez A., Juárez A., Hernández G., Delicia C., Villatoro E., Montes M., and Villaseñor L. INAOE’s Par-
    ticipation at QA@CLEF 2007, In Working notes for the Cross Language Evaluation Forum Workshop
    (CLEF 2007), Budapest, Hungary, September 2007. In this volume.

</pre>