<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The contribution of the University of Alicante to AVE 2007</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Question Answering</institution>
          ,
          <addr-line>Answer Validation, Recognizing Textual Entailment, Lexical Similarity, Syntactic Trees</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ndez, Daniel Micol, Rafael Mun~oz, and Manuel Palomar Natural Language Processing and Information Systems Group Department of Computing Languages and Systems University of Alicante San Vicente del Raspeig</institution>
          ,
          <addr-line>Alicante 03690</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we discuss a system used to recognize entailment relations within the AVE framework. This system creates representations of text snippets by means of a variety of lexical measures and syntactic structures. Once these representations have been created, we compare the corresponding to the text and to the hypothesis and we try to determine if there is an entailment relation between the text and the hypothesis. The hypotheses have been generated by merging the answers with their corresponding questions, applying a set of regular expression aimed at this issue. In the performed experiments our system obtained a maximum F-measure score of 0.40 and 0.39 for the development and test English corpora, respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>Algorithms</kwd>
        <kwd>Semantic Similarity</kwd>
        <kwd>Experimentation</kwd>
        <kwd>Measurement</kwd>
        <kwd>Performance</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The Answer Validation Exercise (AVE) is a two-year-old track within the Cross-Language
Evaluation Forum (CLEF) 2007. AVE provides an evaluation framework for answer validations in
Question Answering (QA) systems. This automatic answer validation would be useful for
improving the performance of QA systems, helping humans in the assessment of QA systems output and
developing better criteria for collaborative ones.</p>
      <p>Systems must emulate human assessment of QA responses and decide if an answer to a question
is correct or not according to a given text. This year, the participants receive a set of triplets
(Question, Answer, Supporting Text) and they must return a boolean value for each triplet showing
whether the answer is supported by the text. This shows that the AVE task is very related to the
recognition of textual entailments, since it can be considered as a kind of such relations.</p>
      <p>
        With our participation, we want to evaluate our system within the very realistic environment
that AVE provides. In addition, AVE boots the direct applicability of our system in the ¯eld of QA,
which is very appealing. Our system is designed to recognize textual entailment relations. In fact,
we have participated in ACL-PASCAL Third Recognising Textual Entailment (RTE) Challenge
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] this year [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. To apply our system to the AVE competition we had to do some adjustments
that will be explained in detail later.
      </p>
      <p>The remainder of this paper is structured as follows. The following section presents our
approach for our participation in AVE. Third section illustrates the experiments carried out and the
results obtained. Finally, fourth section shows the conclusions and proposes future work based on
our actual research.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The AVE Approach</title>
      <p>The proposed approach attempts to detect when the text, which could be consider as a passage
returned by a QA system, entails or implies the answer given and, if this occurs, the answer is
then justi¯ed. To determine if this relation appears, our approach will detect lexical and syntactic
implications between two text snippets (the text or the passage and the hypothesis that will be
created by both the question and the answer). We propose several methods that mainly rely
on lexical and syntactic inferences in order to address the recognition task. Next subsections
summarize the procedure followed to apply our approach to AVE.
2.1</p>
      <p>Corpora Processing
The corpora provided by the AVE organizers has the following format:
&lt;q id=1 lang=EN&gt;
&lt;q_str&gt;Who was Yasser Arafat?&lt;/q_str&gt;
&lt;a id=1 value=XXX&gt;
&lt;a_str&gt;Palestine Liberation Organization Chairman&lt;/a_str&gt;
&lt;t_str doc=XXX&gt;President Clinton appealed ... &lt;/t_str&gt;
&lt;/a&gt;
&lt;a id=2 value=XXX&gt; ... &lt;/a&gt;
....
&lt;/q&gt;
where each question (tag q) contains a string (q_str), which is the question formulated in natural
language. In addition, q can have one or more answers and each answer (a_str) is associated with
a text (t_str) that will be required to determine if the answer is entailed to the question.</p>
      <p>Since our system is designed to determine implications between two text snippets, the best way
to adapt the AVE corpus to our system seems to be the following: for each answer and question,
convert them into an a±rmative sentence and detect if there is entailment with its associated text.</p>
      <p>Therefore, we generated a set of regular expressions to manage these situations. Table 1 shows
such regular expressions together with the number and percentage of solved question-answer pairs.
AVE organizers provide a set of patterns intended for this purpose, although we used our own due
to when these patterns were published we have already adapted our system to the aforementioned
regular expressions.</p>
      <p>Our system was applied to the English corpora from AVE. These corpora contain 1121
questionanswer pairs for the development corpus and 202 for the test one.</p>
      <p>Finally, to complete the explanation of the corpora processing, we would like to mention what
occurs when a pair does not match any of the generated regular expressions. We propose two
solutions: in the ¯rst one, called automatic, the tokens of the answer are linked together with
the tokens of the corresponding question, while for the second solution, called semi-automatic, we
have done a review of these pairs manually creating the a±rmative sentences.</p>
      <p>Regular expression
(What) (nS+) (.+)
(Which) (nS+) (nS+) (.+)
(Who) (nS+) (.+)
(Where) (nS+) (.+)
(How many) (nS+) (.+)
(How much) (.+)
Total</p>
      <sec id="sec-2-1">
        <title>Number of Q-A pairs solved (%)</title>
        <p>
          English development corpus English test corpus
350 (31:22%) 92 (45:54%)
165 (14:72%) 10 (4:95%)
179 (15:97%) 36 (17:82%)
76 (6:78%) 8 (3:96%)
96 (8:56%) 14 (6:93%)
12 (1:07%) 0 (0:0%)
878 (78:32%) 160 (79:21%)
The core of our system is composed of two modules, each of which attempts to recognize the textual
entailment relation from di®erent perspectives, which are the lexical one and the syntactic. In this
section we will describe both of them in little detail. For further information, please refer to [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
and [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
2.2.1
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Lexical module</title>
        <p>
          The performance of this method relies on the computation of a wide variety of lexical measures,
which basically consists of overlap metrics. Some researchers have already used this kind of metrics
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. However, our approach does not use any semantic knowledge.
        </p>
        <p>Prior to the calculation of the measures, all texts and the hypotheses created merging the
question-answer pairs by means of regular expressions are tokenized and lemmatized. Later on, a
morphological analysis is performed as well as a stemmization. Once these steps are completed,
we create several data structures that contain the tokens, stems, lemmas, functional1 words and
the most relevant2 ones corresponding to the text and the hypothesis. The lexical measures
will be applied over these structures and this will allow us to know which of them are more
suitable to recognize entailment relations. The followings paragraphs describe the lexical measures
implemented in our system.</p>
        <p>² Simple matching: word overlap between text and hypothesis is initialized to zero. If a
word in the hypothesis appears also in the text, an increment of one unit is added. The ¯nal
weight is normalized dividing it by the length of the hypothesis.
² Levenshtein distance: it is similar to simple matching. However, in this case we use the
mentioned distance as the similarity measure between words. When the distance is zero, the
increment value is one. On the other hand, if such value is equal to one, the increment is
0.9. Otherwise, it will be the inverse of the obtained distance.
² Consecutive subsequence matching: this measure assigns the highest relevance to the
appearance of consecutive subsequences. In order to perform this, we have generated all
possible sets of consecutive subsequences, from length two until the length in words, from
the text and the hypothesis. If we proceed as mentioned, the sets of length two extracted
from the hypothesis will be compared to the sets of the same length from the text. If the
same element is present in both the text and the hypothesis set, then a unit is added to the
accumulated weight. This procedure is applied to all sets of di®erent length extracted from
the hypothesis. Finally, the sum of the weight obtained from each set of a speci¯c length is
normalized by the number of sets corresponding to this length, and the ¯nal accumulated
1As functional words we consider nouns, verbs, adjectives, adverbs and ¯gures (number, dates, etc).
2Considering only nouns and verbs.</p>
        <p>
          weight is also normalized by the length of the hypothesis in words minus one. One should note
that this measure does not consider non-consecutive subsequences. In addition, it assigns
the same relevance to all consecutive subsequences with the same length. Furthermore, the
more length the subsequence has, the more relevant it will be considered.
² Tri-grams: two sets containing tri-grams of letters belonging to the text and the hypothesis
were created. All the occurrences in the hypothesis' tri-grams set that also appear in the
text's will increase the accumulated weight in a factor of one unit. The calculated weight is
then normalized dividing it by the total number of tri-grams within the hypothesis.
² ROUGE measures: ROUGE measures have already been tested for automatic evaluation
of summaries and machine translation [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. For this reason, and considering the impact of
n-gram overlap metrics in textual entailment, we believe that the idea of integrating these
measures3 in our system is very appealing. We have implemented these measures as de¯ned
in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>In order to detect entailment relations, several machine learning classi¯ers were considered,
being the Bayesian Network the best for our needs. We have used the Bayesian Network
implementation from Weka [9], considering each lexical measure as a feature for the training and test
stages of our system.
2.2.2</p>
      </sec>
      <sec id="sec-2-3">
        <title>Syntactic module</title>
        <p>This module aims to provide a good accuracy rate by using few syntactic modules that behave
collaboratively. These include tree construction, ¯ltering and graph node matching.
² Tree generation: the ¯rst module constructs the corresponding syntactic dependency trees.</p>
        <p>
          For this purpose, MINIPAR [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] output is generated and afterwards parsed for each text and
hypothesis of our corpus. Phrase tokens, along with their grammatical information, are
stored in an on-memory data structure that represents a tree, which is equivalent to the
mentioned syntactic dependency tree.
² Tree ¯ltering: once the tree has been constructed, we may want to discard irrelevant data
in order to reduce our system's response time and noise. For this purpose we have generated
a database of relevant grammatical categories (see Table 2) that will allow us to remove from
the tree all those tokens whose category does not belong to such list. The resulting tree will
have the same structure as the original, but will not contain any stop words nor irrelevant
tokens, such as determinants or auxiliary verbs.
² Graph node matching: in this stage we proceed to perform a graph node matching process,
termed alignment, between both the text and the hypothesis4. This operation consists in
¯nding pairs of tokens in both trees whose lemmas are identical, no matter whether they are
in the same position within the tree. Some authors have already designed similar matching
techniques, such as the ones described in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. However, these include semantic constraints
that we have decided not to consider. The reason of this decision is that we desired to
overcome the recognition task from an exclusively syntactic perspective.
        </p>
        <p>Let ¿ and ¸ represent the text's and hypothesis' syntactic dependency trees, respectively. We
assume we have found a word, namely ¯, present in both ¿ and ¸. Now let ° be the weight
assigned to ¯'s grammatical category (Table 2), ¾ the weight of ¯'s grammatical relationship
(Table 3), ¹ an empirically calculated value that represents the weight di®erence between
tree levels, and ±¯ the depth of the node that contains the word ¯ in ¸. We de¯ne the
function Á(¯) = ° ¢ ¾ ¢ ¹¡±¯ as the one that calculates the relevance of a word in our system.</p>
        <p>The experiments performed reveal that the optimal value for ¹ is 1:1.</p>
        <p>3The considered measures were ROUGE-N with n=2 and n=3, ROUGE-L, ROUGE-W and ROUGE-S with s=2
and s=3.</p>
        <p>4One should remember that the hypothesis has been created from the pair question-answer by means of regular
expressions (see section 2.1)</p>
      </sec>
      <sec id="sec-2-4">
        <title>Grammatical category</title>
        <p>Verbs, verbs with one argument, verbs
with two arguments, verbs taking
clause as complement
Nouns, numbers
Be used as a linking verb
Adjectives, adverbs, noun-noun
modi¯ers
Verbs Have and Be</p>
      </sec>
      <sec id="sec-2-5">
        <title>Weight</title>
        <p>1:0</p>
        <p>For a given pair (¿ , ¸), we de¯ne the set » as the one that contains all words present in both
trees, being » = ¿ \ ¸ 8® 2 ¿; ¯ 2 ¸. Therefore, the similarity rate between ¿ and ¸, denoted
by the symbol Ã, would be Ã(¿; ¸) = Pº2» Á(º). One should note that a requirement of
our system's similarity measure would be to be independent of the hypothesis length. Thus,
we must de¯ne the normalized similarity rate, as Ã(¿; ¸) = PP¯º22¸» ÁÁ((º¯)) . Once the similarity
value has been calculated, it will be provided to the user together with the corresponding
text-hypothesis pair identi¯er. It will be his responsibility to choose an appropriate threshold
that will represent the minimum similarity rate to be considered as entailment between text
and hypothesis. All values that are under such a threshold will be marked as not entailed.</p>
        <p>The development corpus will help us to establish this threshold properly.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments and Results</title>
      <p>In AVE, all pairs must be tagged with one of the following values:
² VALIDATED indicates that the answer is correct and supported although not the one
selected.
² SELECTED indicates that the answer is VALIDATED and it is the one chosen as the output
of an hypothetical QA system. One of the VALIDATED answers per question should be
marked as SELECTED.
² REJECTED indicates that the answer is incorrect or there is not enough evidence of its
correctness.</p>
      <p>Since our system returns a numeric value to determine the entailment, we decided to mark as
SELECTED the pair with the highest true entailment score among all pairs that belong to the
same question. If it is the case that two or more pairs have the highest score, then one of them is
randomly chosen.</p>
      <p>Regarding the framework that the AVE organizers propose to evaluate the systems, apart from
the well-known measures of Precision, Recall and F over the YES pairs5, we would like to point
5The YES pairs are those which are considered as VALIDATED or SELECTED.
out a new measure, called Q-A accuracy. This measure only considers the accuracy obtained from
correct SELECTED values and attempts to simulate the decision that could be made by a QA
system. However, for our system it is quite di±cult to establish one of the VALIDATED values as
a SELECTED since di®erences between true entailment scores are usually minimal. This happens
due to the fact that no semantic knowledge is considered. Therefore, although the system is able
to determine lexical and syntactic implications, in the case of SELECTED values this does not
seem to be enough.</p>
      <p>Table 4 shows the di®erent experiments carried out and the results obtained for our system.
The proposed baseline was generated setting all pairs as VALIDATED, which was useful to evaluate
the gain of the remainder experiments.</p>
      <sec id="sec-3-1">
        <title>Corpus</title>
        <sec id="sec-3-1-1">
          <title>Development</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>Test</title>
          <p>Run
baseline
lex automatic
lex semi-automatic
syn automatic
syn semi-automatic
baseline
lex semi-automatic
syn semi-automatic</p>
          <p>Two main experiments were carried out for our participation in AVE. The ¯rst one applies the
lexical module to detect VALIDATED and SELECTED pairs, whereas the second one only uses
syntactic information (obtained by the syntactic module) to solve implications. These runs were
named lex and syn respectively. A simple combination of both modules, for instance to decide the
judgment depending on the accuracy of each one for true and false implications, does not improve
the results. Therefore we believe that subsequent work could be the combination of these modules
in a collaborative way rather than by means of other simpler techniques.</p>
          <p>Moreover, each run (lex or syn) was processed with the two types of corpus, automatic and
semi-automatic, created from the original AVE corpora (see section 2.1). Table 4 reveals that,
although the semi-automatic experiments obtain better results, the e®ort needed to generate this
corpus is not worth in comparison with the gain of accuracy obtained.</p>
          <p>The approach that achieved better results is lex. This is due to the fact that there are some
cases where the hypothesis' construction does not make sense and consequently the syntactic tree
is incorrectly generated. These situations occur when the answer has a grammatical category
inconsistent with the one expected by the question (for instance, if the answer is a quantity or
date when the question asks for a person name).
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Future Work</title>
      <p>This paper presents two independent approaches considering mainly lexical and syntactic
information. Throughout this paper we expose and analyze a wide variety of lexical measures as well
as syntactic structure comparisons that attempt to recognize the textual implications required for
the AVE task.</p>
      <p>
        The approach that obtained the best results was the lexical one, being the optimal for our
participation, and obtaining an F-measure score of 0.40 and 0.39 for the development and test
corpus, respectively. However, we would like to point out that the results obtained in challenges or
competitions about recognizing entailment relations depend on the idiosyncrasies of the corpora
used. For instance, whereas AVE generates its corpora directly from the output of several QA
systems, the RTE challenge constructs the corpora by means of a review process of several
annotators and from di®erent sources (see RTE-3 overview [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and our participation in this challenge
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]).
      </p>
      <p>Future work can be related to the development of a semantic module. This module will be
able to construct characterized representations based on the text using named entities and role
labeling in order to extract semantic information from the pair of text snippets. In addition,
once the semantic module was implemented, subsequent work will be to combine these modules
in an e±cient way. Each module should perform the recognition individually as well as support it
together with the rest of the modules.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This research has been partially funded by the QALL-ME consortium, which is a 6th Framework
Research Programme of the European Union (EU), contract number FP6-IST-033860 and by the
Spanish Government under the project CICyT number TIN2006-1526-C06-01. It has also been
supported by the undergraduate research fellowships ¯nanced by the Spanish Ministry of
Education and Science, and the project ACOM06/90 ¯nanced by the Spanish Generalitat Valenciana.
[9] Ian H. Witten and Eibe Frank. Data Mining: Practical machine learning tools and techniques.
2nd Edition, Morgan Kaufmann, San Francisco, 2005.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>O</given-names>
            <surname>¶ scar Ferra</surname>
          </string-name>
          ¶ndez, Daniel Micol, Rafael Mun~oz, and
          <article-title>Manuel Palomar. DLSITE-1: Lexical analysis for solving textual entailment recognition</article-title>
          .
          <source>In Proceedings of the 12th International Conference on Applications of Natural Language to Information Systems</source>
          , Paris, France,
          <year>June 2007</year>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O</given-names>
            <surname>¶ scar Ferra</surname>
          </string-name>
          ¶ndez, Daniel Micol, Rafael Mun~oz, and
          <string-name>
            <given-names>Manuel</given-names>
            <surname>Palomar</surname>
          </string-name>
          .
          <article-title>A perspective-based approach for solving textual entailment recognition</article-title>
          .
          <source>In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing</source>
          , pages
          <volume>66</volume>
          {
          <fpage>71</fpage>
          , Prague, Czech Republic,
          <year>June 2007</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Danilo</given-names>
            <surname>Giampiccolo</surname>
          </string-name>
          , Bernardo Magnini, Ido Dagan, and
          <string-name>
            <given-names>Bill</given-names>
            <surname>Dolan</surname>
          </string-name>
          .
          <article-title>The third pascal recognizing textual entailment challenge</article-title>
          .
          <source>In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing</source>
          , pages
          <volume>1</volume>
          {
          <fpage>9</fpage>
          , Prague, Czech Republic,
          <year>June 2007</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Chin-Yew Lin</surname>
          </string-name>
          .
          <article-title>ROUGE: A Package for Automatic Evaluation of Summaries</article-title>
          . In Stan Szpakowicz Marie-Francine Moens, editor,
          <source>Text Summarization Branches Out: Proceedings of the Association for Computational Linguistics Workshop</source>
          , pages
          <volume>74</volume>
          {
          <fpage>81</fpage>
          ,
          <string-name>
            <surname>Barcelona</surname>
          </string-name>
          , Spain,
          <year>July 2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Dekang</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <article-title>Dependency-based Evaluation of MINIPAR</article-title>
          .
          <source>In Workshop on the Evaluation of Parsing Systems</source>
          , Granada, Spain,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Micol</surname>
          </string-name>
          ,
          <string-name>
            <surname>O¶</surname>
          </string-name>
          <article-title>scar Ferr¶andez, Rafael Mun~oz, and Manuel Palomar. DLSITE-2: Semantic similarity based on syntactic dependency trees applied to textual entailment</article-title>
          .
          <source>In Proceedings of the TextGraphs-2 Workshop</source>
          , pages
          <volume>73</volume>
          {
          <fpage>80</fpage>
          ,
          <string-name>
            <surname>Rochester</surname>
          </string-name>
          , New York, United States of America,
          <year>April 2007</year>
          .
          <article-title>The North American Chapter of the Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Jeremy</given-names>
            <surname>Nicholson</surname>
          </string-name>
          , Nicola Stokes, and
          <string-name>
            <given-names>Timonthy</given-names>
            <surname>Baldwin</surname>
          </string-name>
          .
          <article-title>Detecting Entailment Using an Extended Implementation of the Basic Elements Overlap Metrics</article-title>
          .
          <source>In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment</source>
          , pages
          <volume>122</volume>
          {
          <fpage>127</fpage>
          ,
          <string-name>
            <surname>Venice</surname>
          </string-name>
          , Italy,
          <year>April 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Rion</given-names>
            <surname>Snow</surname>
          </string-name>
          , Lucy Vanderwende, and
          <string-name>
            <given-names>Arul</given-names>
            <surname>Menezes</surname>
          </string-name>
          . E®
          <article-title>ectively using syntax for recognizing false entailment</article-title>
          .
          <source>In Proceedings of the North American Association of Computational Linguistics</source>
          , pages
          <volume>33</volume>
          {
          <fpage>40</fpage>
          , New York City, New York, United States of America,
          <year>June 2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>