<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>and Felisa Verdejo. Testing the reasoning for ques-
tion answering validation. Journal of Logic and Computation</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>University of Hagen at CLEF 2008: Answer Validation Exercise</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ingo Glo¨ ckner</string-name>
          <email>iglockner@fernuni-hagen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Experimentation, Measurement, Verification</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Intelligent Information and Communication Systems (IICS), FernUniversita ̈t in Hagen</institution>
          ,
          <addr-line>58084 Hagen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2008</year>
      </pub-date>
      <abstract>
        <p>RAVE (Real-time Answer Validation Engine) is a logic-based answer validator/selector designed for application in real-time question answering. RAVE uses the same toolchain for deep linguistic analysis and the same background knowledge as its predecessor (MAVE), which took part in the AVE 2007. However, a full logical answer check as in MAVE was not considered suitable for real-time answer validation since it requires parsing of all answer candidates. Therefore RAVE uses a simplified validation model where the prover only checks if the support passage contains a correct answer at all. This move from logic-based answer validation to logical validation of supporting snippets permits RAVE to avoid any parsing of answers, i.e. the system only needs a parse of the question and pre-computed snippet analyses. In this way very low validation/selection times can be achieved. Machine learning is used for assigning local validation scores using both logic-based and shallow features. The resulting local validation scores are improved by aggregation. One of the key features of RAVE is its innovative aggregation model, which is robust against duplicated information in the support passages. In this model, the effect of aggregation is controlled by the lexical diversity of the support passages for a given answer. If the support passages have no terms in common, then the aggregation has maximal effect and the passages are treated as providing independent evidence. Repetition of a support passage, by contrast, has no effect on the results of aggregation at all. In order to obtain a richer basis for aggregation, an active validation approach was chosen, i.e. the original pool of support passages in the AVE 2008 test set was enhanced by retrieving additional support passages from the CLEF corpora. This technique already proved effective in the AVE 2007. The development of RAVE is not finished yet, but the system already achieved an F-score of 0.39 and a selection rate of 0.61 compared to optimal selection. Judging from last year's runs of MAVE (with a 0.93 selection rate and F-score of 0.72), this may look disappointing. However, the AVE task for German was much more difficult this year, and the F-score gain of RAVE (over the 100% yes baseline) and qa-accuracy gain (compared to random selection) are better than in last year's runs of MAVE.1 1Funding by the DFG (Deutsche Forschungsgemeinschaft) under contract HE 2847/10-1 (LogAnswer) is gratefully acknowledged.</p>
      </abstract>
      <kwd-group>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>3 [Information Storage and Retrieval]</kwd>
        <kwd>Information Search and Retrieval</kwd>
        <kwd>Information filtering</kwd>
        <kwd>Selection process</kwd>
        <kwd>I</kwd>
        <kwd>2</kwd>
        <kwd>4 [Artificial Intelligence]</kwd>
        <kwd>Knowledge Representation Formalisms and Methods</kwd>
        <kwd>Predicate Logic</kwd>
        <kwd>Semantic networks</kwd>
        <kwd>I</kwd>
        <kwd>2</kwd>
        <kwd>7 [Artificial Intelligence]</kwd>
        <kwd>Natural Language Processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Answer validation for question answering (QA) systems is often construed as a problem of recognizing
textual entailment (RTE). In this approach, the question (e.g. ‘When was the Eiffel tower constructed?’)
and the answer candidate to be validated (e.g. ‘1889’) are turned into a textual hypothesis (‘The Eiffel
tower was constructed in 1889’) which is then checked against a supporting snippet extracted from the
document collection – a task which can be implemented as a logical entailment test. The starting point
for the current work is MAVE [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], an answer validator for German which embarks on this logic-based
approach. While MAVE achieved excellent results in the AVE 2007, the system was not yet suitable for
answer validation in interactive QA. When used for real-time question answering, the validator must be
able to evaluate hundreds of candidate answers in a few seconds. To achieve this, a departure from the RTE
approach is inevitable. The reason is that there is no time available to construct a logical representation
for the answer candidates, i.e. syntactic-semantic parsing of hundreds of answers candidates (or hundreds
of textual hypotheses constructed from question and answers) at query time is not feasible for real-time
QA. There is sufficient time for parsing the question, however (since only one question must be analyzed
for each query of a user), and there is also sufficient time for parsing all documents and possible support
passages (since the corpus is known in advance, so the linguistic analysis of all documents in the corpus
can be done at indexing time). Based on these observations, a new approach based on logical passage
validation was proposed [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The basic idea is that of limiting the logical validation to a logical verification
of the provided snippets: the system does not form a hypothesis from question and answer, but only tries to
prove the logical representation of the question from the logical representation of the support passage. In
this way the system can determine if the snippet contains a correct answer at all (not necessarily identical
to the answer candidate). If the snippet does not contain a correct answer, then the snippet does not provide
evidence for any answer candidate extracted from it, and the answer-support pair can be rejected on logical
grounds. If the considered snippet does contain a correct answer, however, then additional (extra-logical)
criteria are needed for verifying that a given answer candidate is indeed the correct answer contained in the
passage. By avoiding the parsing of answers, the current validator, RAVE, can meet the requirements on
processing speeds: an exact logical validation for 200 retrieved passages is accomplished in 2.4 seconds on
average [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. There is also an anytime variant of the method which allows the system to generate validated
answers with a user-specified maximum latency [
        <xref ref-type="bibr" rid="ref1 ref7">1, 7</xref>
        ].
      </p>
      <p>Another important improvement of RAVE compared with its predecessor is the treatment of replicated
information in aggregation. This is perhaps not so important for the Wikipedia corpus (which contains few
redundancy) but it is crucial when working with news corpora (that may contain several duplicates and
variants of the same news) and when setting up a multi-stream system since the same answer candidate
together with the same supporting snippets can occur in several QA streams. The problem that a simple
aggregation model faces here is ‘spurious’ aggregated evidence, i.e. repetition of a support passage (or of
minor variants) could result in an unwanted increase of the confidence score. To eliminate this problem,
a replicationt-olerant aggregation model was designed for RAVE. In this model, repetition of an
answersupport passage pair does not affect aggregation results at all.2 On the other hand, the effect of aggregation
is maximal when the aggregated supporting snippets have no terms in common.</p>
      <p>
        The paper is organized as follows: Sect. 2 explains the system architecture of RAVE. The focus is
placed on the validation core; for details on the actual use of RAVE in QA systems see [
        <xref ref-type="bibr" rid="ref1 ref7">1, 7</xref>
        ]. Sect. 3
presents the results of RAVE in the AVE 2008, together with ablation studies which reveal the effect of
various system components. The paper closes with a discussion of the progress made and also sketches
necessary improvements.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>System description</title>
      <sec id="sec-2-1">
        <title>Overview</title>
        <p>The architecture of the RAVE system is shown in Fig. 1. The input to the system comprises a question
together with answer candidates for the question and the supporting text snippets. In order to introduce
2To be precise, the contribution of repeated answer-snippet pairs is the maximum score of any of these items; see Sect. 2.9
more redundancy for aggregation, a first step of support pool enhancement is used which retrieves
additional supporting passages for the candidate answers from the document collections. Notice that this step is
not part of the core system since there is usually sufficient redundancy in the QA streams. The question is
subjected to a deep linguistic analysis. In the AVE, parsing the supporting snippets during validation was
also necessary due to the presence of illegal document ids and corrupted snippets in the test set. In regular
operation, however, RAVE never parses any snippets at query time, since the deep parse of the snippet
can always be fetched from the pre-analyzed document collections. The question classification serves to
identify the descriptive core of the question, and it also determines the expected answer type (EAT) and
the question category. Depending on the question classification, the system performs a number of sanity
tests on the answer candidates which eliminate trivial answers, for example. The remaining answers are
validated based on the results of shallow feature extraction (like lexical overlap) and logic-based feature
extraction. A local score is then computed by applying a classifier obtained by machine learning.
Aggregation is used to determine a combined score for each answer which captures the joint evidence of all snippets
supporting the answer (including the retrieved, auxiliary snippets introduced by the support pool
enhancement). After aggregation, these auxiliary support passages are eliminated in the support pool reduction
step since they have served their purpose of refining answer evidence. The final step then involves the
selection of the best answer based on the aggregated evidence for the answer and the justification strengh of
the considered validation item in the test set. The remaining answers are classified as validated or rejected
depending on a given threshold.</p>
        <p>In formal terms, the AVE task to be solved by RAVE can be described as follows. The AVE test set
consists of validation items i ∈ I given by the question string qi, the answer candidate ai and the supporting
snippet si. For convencience, let Q = {qi : i ∈ I} denote the set of all questions in the test set and let
Iq = {i ∈ I : qi = q} be the set of all validation items for a given question q ∈ Q.</p>
        <p>
          The goal of the answer validation and selection task is that of assigning a validation decision vi ∈
{REJECTED, SELECTED, VALIDATED} and a confidence value ci ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] to each i ∈ I such that at
most one answer for each question is selected, i.e. |{i ∈ Iq : vi = SELECTED}| ≤ 1. Moreover answers
can only be validated if an answer has been selected as the best answer, i.e. if {i ∈ Iq : vi = SELECTED} =
∅, then {i ∈ Iq : vi = VALIDATED} = ∅ as well.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Support Pool Enhancement</title>
        <p>
          As in the last year, the AVE 2008 test set has also been constructed in such a way that every answer
candidate occurs only once, i.e. there is virtually no redundancy available in the test set. Typical answer
selection techniques rely heavily on redundancy, though, since the existence of many support passages for
a given answer often indicates that the answer is correct. In a logic-based answer validation and selection
setting, redundancy also helps increase robustness: If several snippets are available which support the same
answer, then there is a better chance that a deep linguistic analysis exists for at least one of the support items
for an answer. On the other hand, if there is only one support item for the answer (i.e. no redundancy) and
if parsing of the support item fails, then the prover cannot be applied for logical validation. Thus, it makes
sense to search for additional supporting snippets for each answer in the document collections, and add
these snippets to the support pool as auxiliary validation items.3 Thus, RAVE starts from the pool of
‘regular’ or ‘original’ validation items i ∈ Iq for a question q. Then Aq = {αi : qi = q} is the set of
answer candidates q in the test set. An existing QA environment – in this case the IRSAW system [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] – is
used to actively search for additional supporting snippets for each of the answer candidates in Aq. These
are filtered from the set of answer/support pairs generated by the IRSAW QA streams for the question. This
process results in a number of new support items for the answers in Aq. In order to improve recall, answers
are clustered into groups of minor variants with the same ‘answer key’ κ(ai) by applying a simplification
function κ on the answers.4 Not only exact matches of answer candidates pass the filter, but also those
which only share the same answer key. This process results in a set of auxiliary validation items i ∈ Iq
0
with qi = q and κ(ai) = κ(aj ) for some original answer aj ∈ Aq, and a supporting snippet si for ai
found by the IRSAW QA system for the considered question. The original and auxiliary validation items
are joined into the enhanced validation pool Iq∗ = Iq ∪ Iq0 for q that contains all original and auxiliary
support passages for the answers to be validated. Notice that exact duplicates are not included into Iq0 , i.e.
each combination of answer key and supporting snippet is added only once.5 Though RAVE does not use
support pool enhancement when applied in actual QA systems, its discernment of ‘regular’ support items
(which may be shown to the user) and ‘auxiliary’ support items (which only serve to improve selection,
but may not be shown to the user) can be very useful in practice. One obvious application is the use of
auxiliary document collections whose licensing conditions do not permit snippets from these collections to
be shown to third parties. When treated as a source of ‘inofficial’ support items that are only internally used
for deciding which answer candidates are the correct ones, such a corpus with rigid licensing conditions
can still prove useful. Another obvious scenario is multilingual QA. In this case, considering snippets in
a language that the user does not understand can still be valuable for improving results by aggregation.
However, presenting such snippets to the user would be pointless.
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Deep Linguistic Analysis</title>
        <p>
          RAVE uses WOCADI [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], a syntactic-semantic parser for German, for computing deep linguistic analyses
of the question and of the documents. As mentioned above, parsing of the snippets at validation time is
normally not needed, since all documents are known in advance and can be pre-analyzed. A full parse is
found for about 60% of the sentences in the CLEF corpora; in this case a semantic representation in the
MultiNet formalism [8] is constructed which forms the basis for logical validation. If a full parse fails, then
WOCADI still produces results of a morpho-lexical analysis (lemmata, possible word senses, numerals,
3Such an ‘active validation approach’ already proved useful for MAVE in the AVE 2007.
        </p>
        <p>
          4The chosen simplification function [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] drops accents, removes whitespace and eliminates insignificant words from the answers.
For example, κ(im Jahr 2001) = 2001 = κ(2001).
        </p>
        <p>5Thus for all i, j ∈ Iq∗, if κ(ai) = κ(aj) and si = sj, then i = j.
named entity types, results of compound decomposition) which can be utilized for implementing a fallback
validation method that works for arbitrary sentences.
2.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Question Classification</title>
        <p>A rule-based approach is used for question classification, with a total of currently 127 classification rules.
Consider the query Nenne mir drei Beispiele fu¨r Vulkaninseln! (‘Name three examples of volcanic
islands!’), for example. The classification rules of RAVE then determine the expected answer type (EAT; in
this case: island-name ), the query category (factual), the desired number of results (three) and the
descriptive core of the query (Vulkaninseln, i.e. volcanic islands). The rules also remove parts of the query which
are not part of the descriptive question core. In this case, Nenne drei Beispiele (‘Name three examples’) will
be removed from the logical representation of the query and also from the set of word senses and numbers
used for the lexical overlap test.
2.5</p>
      </sec>
      <sec id="sec-2-5">
        <title>Sanity Tests</title>
        <p>A number of sanity tests are applied to the answers and support passages in order to eliminate false
positives. The MAVE system used ‘soft’ sanity checks (which merely reduce the confidence score of an answer
candidate); these checks were applied late (after aggregation) in order to avoid the effect of a failed sanity
check to be evened out by aggregation. RAVE, by contrast, uses strict sanity checks which are applied at
the very moment that an answer or passage enters into the system. In this case, aggregation cannot
compromise the effect of a failed sanity check because validation items which fail one of the tests are discarded
immediately. The RAVE prototype currently uses the following sanity checks:
• A test for trivial answers, which checks if the answer merely repeats content of the question.
Example: Was ist die Arche Noah? (‘What is Noah’s Ark?’) – Die Arche Noah (‘Noah’s Ark’).
• A test for non-informative definitions , which eliminates incomplete answers to definition questions
involving relational nouns or nominalizations. Examples: Wer war Antonio Gaudi? (‘Who was
Antonio Gaudi?’) – Gegenspieler (‘opponent’); Was ist ein Echolot? (‘What is Sonar?’) – zur
Detektion (‘for detection’).
• A test for failure of temporal restrictions, currently implemented as a simple year check on support
passages. Example: Wer war Russlands Verteidigungsminister 1994? (‘Who was Russia’s Minister
of Defence in 1994?’) – rejected snippet: Russlands Verteidigungsminister Gratschow zu Besuch in
Bonn (‘Russia’s Minister of Defence, Gratschow, is visiting Bonn’).
• Two tests for incompatible measurement units, which apply only to MEASURE questions:
– The question requests a certain class of measurement units (e.g. length units) by naming the
measurement dimension (‘What length. . . ’, ‘How long. . . ’), but the answer contains an
incompatible measurement unit. Example: Bei welcher Temperatur schmilzt Eisen? (‘At which
temperature does iron glow?’) – 1,86 m (‘1.86 m’).
– The question requests a specific measurement unit, but the measurement unit in the answer
differs. Example: Mit wie viel Dollar ist der UNESCO-Friedenspreis dotiert? (‘How many
Dollars is the UNESCO Prize for Peace endowed with?’) – 1 Million DM.
2.6</p>
      </sec>
      <sec id="sec-2-6">
        <title>Shallow Feature Extraction</title>
        <p>RAVE extracts 17 shallow features which only depend on the results of morpho-lexical analysis and on the
question classification. Therefore these shallow features can be computed for arbitrary snippets.</p>
        <p>The following passage features only depend on the question and on the support passage (i.e. the need
not be recomputed for each answer candidate):
• failedMatch Number of lexical concepts and numerals in the question which cannot be matched with
the candidate document.
• matchRatio Relative proportion of lexical concepts and numerals in the question which find a match
in the candidate document.
• failedNames Proper names mentioned in the question, but not in the passage.
• containsBrackets Indicates that the passage contains a pair of parentheses.
• knownEat Indicates that the expected answer type is known, i.e. question classification has
succeeded.
• testableEat Indicates that the expected answer type is fully supported by the current implementation
of the answer type check.
• eatFound Indicates that an occurrence of the expected answer type has been found in the snippet.
• isDefQuestion Indicates that the question is a definition question.
• defLevel Indicates that the snippet contains a defining verb or apposition (defLevel = 2) or a relative
clause (defLevel = 1); else defLevel = 0.</p>
        <p>
          The matching technique used to obtain the values of these shallow features also takes into account
synonyms and nominalizations [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The containsBrackets feature was introduced since the queried
information is often contained in parentheses, e.g. Mount Everest (4,848 m). Thus the presence of brackets in the
passage affects the relevance judgement.
        </p>
        <p>The above answer-type related passage features are independent of any answer candidate. They
describe aspects of the question classification and check for the presence of an expression with the expected
answer type in the passage. The implementation of the answer type check in RAVE is not complete yet.
While all answer types can be extracted from a successful parse of the snippet, the system is currently only
able to extract proper name types (i.e., types of actual named entities) from answers and from snippets with
a failed parse. The feature testableEat indicates if the expected answer type can always be checked (i.e. if
the EAT is a named entity type), or if the answer type check is only possible for a support passage with
known parse.</p>
        <p>When using the validator in actual QA systems, an additional irScore feature is used which contains
the original retrieval score of the support passage assigned by the retrieval system. Incorporating a retrieval
score based on the tf/idf measure has a positive effect (this is known from tests on the training data), but in
the AVE test set, the feature is of course not available.</p>
        <p>In addition to the passage-related features, RAVE also uses several answer-related features which
depend on the answer candidate. These features must be computed for each considered answer/passage
combination.</p>
        <p>• awFailedMatch Number of lexical concepts and numerals in the answer which cannot be matched
with the snippet sentence.
• awMatchRatio Relative proportion of lexical concepts and numerals in the answer which find a match
in the snippet sentence.
• awFailedNames Proper names mentioned in the answer, but not in the snippet sentence.
• synthFailedMatch Number of lexical concepts and numerals in the question or in the answer which
cannot be matched with the snippet sentence.
• synthMatchRatio Relative proportion of lexical concepts and numerals in the question or in the
answer which find a match in the candidate document.
• synthFailedNames Proper names mentioned in the question or in the answer, but not in the snippet
sentence.
• awLength Length of the answer, i.e. the number of characters.</p>
        <p>• awEatMatch Indicates that the actual answer type of the answer matches the expected answer type.
When using the validator in actual QA systems, there is also a producerScore feature which represents
the relevance score assigned by the answer source that generated the answer candidate. RAVE then uses
separate models for each stream in order to account for individual characteristics of each QA source. This
customization is not possible in the AVE task since the QA sources and their characteristics are not known.
In order to obtain a more reliable basis for validation, RAVE also computes a set of logic-based features,
given that parsing of question and snippet has succeeded. The basic idea of the logical passage validation
approach is that proving the question from the supporting snippet can reveal whether the snippet contains
a correct answer at all. RAVE thus attempts to prove the logical representation of the question from the
logical representation of the supporting snippet and from its background knowledge.6 The logical query is
represented as a conjunction of literals. Consider the query Nennen Sie einige einfache Elementarteilchen.
(‘Please name some elementary particles’). The logical query which results from parsing and subsequent
question analysis is: prop(F OCU S, einfach.1.1) ∧ sub(F OCU S, elementarpartikel.1.1). The example
demonstrates how synonyms are used for normalization of word senses: elementarteilchen.1.1 (elementary
particle) is replaced by the canonical elementarpartikel.1.1. The FOCUS variable represents the queried
information. If a proof of the query succeeds, then the FOCUS variable gets bound to an entity in the
logical representation of the snippet. The WOCADI parser provides word alignment information which can
be used to recover the substring of the snippet which corresponds to the binding of the FOCUS variable.
Due to the availability of alignment information, this logic-based answer extraction can be performed very
quickly. It is used to define several extraction-related features. The basic idea behind these features is
that the ability of the system to verbalize the answer binding might also have something to say about the
relevance of the passage.</p>
        <p>
          RAVE extracts useful information even in the case that a proof of the complete query fails. In this case
the prover returns the longest proven fragment of the query along with the answer substitution determined
from the proof of this fragment. RAVE then extracts the number of proven literals and other features from
the results of this partial proof. The system also supports relaxation (i.e. a failed proof can be restarted after
simplifying the query by skipping a problematic literal), but this mechanism was switched off in the AVE
since it has high computational cost and only a modest effect on validation quality [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
• skippedLitsLb Number of known failed literals. Since relaxation is switched off, this feature is 0 if a
complete proof succeeds and 1 if a complete proof fails.
• skippedLitsUb Number of known failed literals, plus literals with unknown status. Since no
relaxation is used, this feature equals the number of all literals minus the length of the longest proven
fragment of the query.
• litRatioLb Fraction of actually proved literals compared to the total number of query literals, i.e.
        </p>
        <p>1 − skippedLitsUb/allLits.
• litRatioUb Fraction of potentially provable literals vs. all literals, i.e. 1 − skippedLitsLb/allLits.
• boundFocus Indicates that a binding for the queried variable was found.
• extractedFocus Signals that RAVE has managed to extract the substring of the snippet which
corresponds to the computed binding of the FOCUS variable.
• npFocus Indicates that the answer string obtained by the logical answer extraction method is a
nominal phrase (NP).
• focusEatMatch Indicates that the answer type of the answer binding found by the prover matches the
expected answer type.
• focusDefLevel Indicates that the answer binding found by the prover corresponds to an apposition
(result: 2) or to an NP with a relative clause (result: 1), else 0.</p>
        <p>
          6RAVE uses the same sources of background knowledge as the MAVE system described in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>It should be pointed out that all logic-based features depend only on the question and on the supporting
snippet, i.e. one proof per snippet is sufficient to determine these features. No attempt is made at this
time to compare the answer string extracted as a side-effect of logical passage validation with the answer
candidates to be checked. There is apparent room for improvements here, and features which relate the
answer candidates to the results of logical validation/extraction will be added in a later version of RAVE.
2.8</p>
      </sec>
      <sec id="sec-2-7">
        <title>ML-Based Local Scoring</title>
        <p>
          A machine learning approach is used on order to assign an evidence score ηi ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] to each i ∈ Iq
∗
which estimates the probability that answer ai is both correct and properly supported by snippet si. A
basic technique suitable for the task was developed in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], using the Weka toolbench [10]. Bagging of
decision trees is used for learning a model which then serves for obtaining probability estimates. Due to
the unbalanced training data (there are typically few positive exemplars and a large number of negative
exemplars), cost-sensitive learning by re-weighting of training examples was applied [10, p. 165]. In order
to emphasize the results of interest (i.e. the correct answers), a weight of β = 0.3 for false positives and
of 1.0 for false negatives was used throughout.7 Class probabilities were obtained from the numbers of
positive and negative examplars at the leaves of the decision trees in the usual way. However, as a result of
the re-weighting of training examples, these numbers refer to the re-weighted training set and do not reflect
the true class probabilities. This effect is corrected by applying the function ρ(x) = β x/(1 − x + β x)
to the relative frequencies of the YES class at the leaves of the decision trees. In particular, for β = 0.3,
one obtains ρ(0.5) ≈ 0.23 as the appropriate threshold for the validation decision. This basic approach
was used for learning separate models for factual vs. definition questions and for the full set of features vs.
shallow-only features. However, the AVE 2008 development set was considered too small for providing
useful training data. Therefore existing training sets for the IRSAW system [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] were re-used as training sets
for RAVE. These training sets cover a total of 10,447 annotated answer candidates and 27,919 passages
generated in an IRSAW run on the CLEF 2007 questions.8 Notice that all these models were trained
on single-sentence snippets. For the moment, RAVE handles multi-sentence snippets by iterating over
all sentences and choosing the maximum sentence score for ηi. RAVE also remembers the sentence that
produced the maximum score; this best sentence is later needed for aggregation. Moreover there is a special
treatment of full-sentence answers to definition questions. If a full-sentence answer to a definition question
is recognized, and if the answer sentence is part of the snippet, then RAVE judges the correctness of the
long answer by a classifier trained to recognize sentences which contain answers to definition questions. If
the score obtained in this way exceeds the result obtained by the regular method determining the evaluation
score, then the result of the full-sentence classifier is chosen for ηi.
2.9
        </p>
      </sec>
      <sec id="sec-2-8">
        <title>Aggregation</title>
        <p>So far, we have local scores ηi that express the direct evidence for an answer ai judging from a particular
supporting snippet si. Intuitively, the plausibility of an answer candidate increases when there are multiple
support passages for the answer. In order to capture the joint evidence γ(a) for an answer judging from all
snippets supporting it, an aggregation technique is used. The aggregation model of RAVE was designed
to be robust against replicated information. That is, multiple copies of the same supporting snippet should
not result in an increase of the aggregated score since such copies do not provide independent evidence.
Assuming that more diversity of the snippets usually means better independence of the sources, strongly
overlapping snippets should also have few effect on the aggregated score, while snippets with no common
terms at all should achieve the strongest gain of the aggregation result. These requirements can be met by
the following aggregation model.</p>
        <p>7The Weka Bagging learner was chosen for learning the classifier, using default settings (averaging over results of 10
decision trees computed by the Weka REPTree decision tree learner). It was wrapped in a Weka CostSensitiveClassifier
configured for reweighting of training examples. The following Weka command line was used to generate the classifiers:
weka.classifiers.meta.CostSensitiveClassifier -cost-matrix "[0.0 0.3; 1.0 0.0]" -S 1 -W
weka.classifiers.meta.Bagging - -P 100 -S 1 -I 10 -W weka.classifiers.trees.REPTree
-M 2 -V 0.0010 -N 3 -S 1 -L -1</p>
        <p>8An answer was annotated YES if the answer is correct and supported, and NO otherwise. All answer candidates were produced
by the MIRA answer extractor of IRSAW which appears representative of mainstream QA technology.</p>
        <p>We assume a simplification mapping κ from answers to simplified answer keys (see Sect. 2.2; in
the simplest case, κ can also be the identity). The set of answer keys for a given question is Kq =
{κ(ai) : i ∈ Iq}. For k ∈ Kq, let Iq,k = {i ∈ Iq : κ(ai) = k}, i.e. Iq,k is the set of all support items for
the considered answer key. For each i ∈ Iq, let Ωi be the set of term occurrences9 in the supporting snippet
si. For a term occurrence ω ∈ Ωi, let t(ω) be the corresponding term. Further let Ti = {t(ω) : ω ∈ Ωi} be
the set of all terms in the support passage and Tk = S {Ti : i ∈ Iq,k} the set of all terms which occur in a
passage for answer key k ∈ Kq. We abbreviate
μ(k, t) = min {(1 − ηi)ν : i ∈ Iq,k, t ∈ Ti} ,
where occ(t, i) = |{ω ∈ Ωi : t(ω) = t}| is the occurrence count of term t in snippet i, and ηi is the
correctness probability of the passage estimated by the ML classifier. The aggregated support for an answer
key k ∈ Kq is then given by
γ(k) = 1 − Y μ(k, t) .</p>
        <p>t∈Tk
We extend this to extracted answers by stipulating that γ(a) = γ(κ(a)).</p>
        <p>In general, the effect of aggregation will be strongest when two aggregated passages have no terms in
common (i.e. the passages represent independent evidence), and there will be no effect of aggregation at
all when the same passage is encountered repeatedly. The following properties of the method are obvious:
a. Suppose that i, i0 ∈ Iq∗ with κ(ai) = κ(ai0 ), si = si0 (i.e. the two snippets are identical and support
equivalent answers). Further suppose without loss of generality that ηi ≥ ηi0 (otherwise i and i0 can
be swapped).10 Let γ(ai) be the aggregation result obtained from Iq∗ and γ0(ai) be the aggregation
result obtained from Iq∗ \ {i0}. Then γ0(ai) = γ(ai), i.e. the aggregation result is not affected by the
duplicate support item i. Moreover, when there are repeated snippets, then the maximum score of
these support items will enter into the aggregation.
b. It generally holds that γmax(k) ≤ γ(k) ≤ γindep(k), where
γmax(k) = max {ηi : i ∈ Iq,k}
γindep(k) = 1 −</p>
        <p>Y (1 − ηi) .
i∈Iq,k
c. The equality γ(k) = γindep(k) holds exactly if either ηi = 1 for some i ∈ Iq,k, or if Ti ∩ Tj = ∅ for
all i, j ∈ Iq,k with ηi &gt; 0 and ηj &gt; 0. In particular, when there is no overlap between the snippets
at all, then the evidence provided by the snippets is treated as independent.</p>
        <p>The actual implementation of the method in RAVE is slighly more complicated because RAVE uses word
senses and numerals as the terms (rather than word forms or word stems). The point is that the occurrence of
a word only corresponds to a unique word sense if the WOCADI parser can determine a complete parse of
the snippet. When a parse fails, however, then the possible word senses for a word cannot be disambiguated.
The original approach is therefore changed in such a way that multiple word-sense alternatives for a given
word occurrence can be handled as well. A word occurrence ω now corresponds to a finite set of terms
T (ω) = {t1, . . . , tr} with r &gt; 0, as compared to the single term t(ω) in the original approach. If T (ω)
contains more than one alternative, then all elements in T (ω) are considered as occurring with weight
1/r. Now abbreviating Ti = S {T (ω) : ω ∈ Ωi}, we adapt the definition of the term occurrence count for
t ∈ Ti as follows,
occ(t, i) =</p>
        <p>X
{ω∈Ωi:t∈T (ω)}
1/ |T (ω)| .</p>
        <p>9A term occurrence is a pair (t, i) where t is a term and i is the position of the occurrence in the passage. In RAVE, word senses
and numerals (rather than words or word stems) are used as the terms.</p>
        <p>10If different QA streams with different ML-based models are used, the same answer/snippet pair can be assigned different scores
ηi, ηi0 since the streams may differ in reliability.
(1)
(2)
(3)
(4)
(5)
Another modification of the basic method is concerned with the considered segment of the snippet. Rather
than extracting word senses and numerals from the whole snippet, RAVE only considers a single, ‘best’
sentence, viz the sentence which attained the best score when determining ηi. This modification is
motivated as follows: if the supporting snippets are longer than necessary, then the excess parts of the snippets
can wrongly suggest diversity (because the text surrounding the relevant information is different) though
the relevant sentence is in fact only duplicated. In order to avoid such negative effects of long snippets that
also contain irrelevant sentences, a single most relevant sentence is picked from each snippet.</p>
        <p>Notice that the actual implementation does not use equations (1) and (2) directly, but rather switches to
logarithms in order to improve numerical stability. Thus, the following equations are used:
μ0(k, t) = min {ν · ln(1 − ηi) : i ∈ Iq,k, t ∈ Ti} ,
If ηi for some i ∈ Iq,k, we stipulate that γ(k) = 1, as before.
After aggregation, the auxiliary supporting snippets are no longer needed. They are dropped to prevent
them from showing up in the final output of RAVE. This is achieved by reducing the support pool from Iq
∗
to the original cases in Iq again.
The global answer score γ(a) obtained by aggregation abstracts from individual support items. However,
the validation/selection decisions apply to answer/snippet pairs (ai, si). Therefore the final selection score
σi should not only depend on the aggregated score γ(ai) but also take the justification strength ηi of the
given snippet si into account. The submitted runs use the following formula for computing the σi score:
σi =</p>
        <p>ηi γ(ai)
max {ηj : j ∈ Iq∗, κ(ai) = κ(aj )}
This means that the ‘best’ support item in Iq∗ which supports a given answer a, i.e. i∗ ∈ Iq∗ with ai∗ = a and
ηi∗ = max {ηj : j ∈ Iq∗, κ(ai) = κ(aj )}, is boosted to the aggregated score σi∗ = γ(ai∗ ). The formula
shows some undesirable effects, though. For example, if the best support item for the considered answer
a is an auxiliary support item i∗ ∈ Iq0 with a direct score ηi∗ = 1, then the aggregation result γ(a) has no
effect at all, i.e. σi = ηi for all i ∈ Iq∗ with κ(ai) = κ(a). In order to avoid this effect, the current version
of RAVE (after the AVE) now uses the following improved formula:
(6)
(7)
(8)
(9)
(10)
σinew =</p>
        <p>ηi γ(ai)
max {ηj : j ∈ Iq, κ(ai) = κ(aj )}
where the index j ranges only over support items in the original set Iq. The modified formula will
boost the best original support item for a given answer, i.e. i.e. i∗ ∈ Iq with ai∗ = a and ηi∗ =
max {ηj : j ∈ Iq, κ(ai) = κ(aj )}, yielding the aggregated value σi∗ = γ(a).</p>
        <p>
          Instead of (9), it is also possible to use a simple weighted average of aggregated and direct score, viz
σ(λ) = λγ(ai) + (1 − λ)ηi
i
for some γ ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ].
        </p>
        <p>
          Based on the assignment of final selection scores σi, the system determines a choice of iopt ∈ Iq which
maximizes σi, i.e. σiopt = max {σi : i ∈ Iq}. The chosen iopt is marked as viopt = SELECTED if
σiopt ≥ θsel, where θsel ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] is the selection threshold; otherwise iopt is marked as viopt = REJECTED.
In particular, θsel = 0 can be used in order to force selection (i.e. the best validation item for a question
will always be selected); this is usually a good choice for maximizing the number of correct selections. In
experiments aiming at a good F-score, a threshold of θsel = 0.23 ≈ ρ(0.5) was used.
model
Run1
Run2
        </p>
        <p>
          The non-best items i ∈ Iq \ {iopt} are classified as follows: if viopt = REJECTED, i.e. no selection
has been made, then vi = REJECTED for all i ∈ Iq \ {iopt} as well. On the other hand, if a selection has
been made, i.e. viopt = SELECTED, then we set vi = VALIDATED if σi ≥ θval and vi = REJECTED
otherwise, where θval ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] is the decision threshold for validating the non-best items. In the experiments,
θval = 0.23 ≈ ρ(0.5) was used throughout.
        </p>
        <p>The confidence into this validation decision is given by ci = σi in the positive case that vi = SELECTED
or vi = VALIDATED, and by ci = 1 − σi if vi = REJECTED.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <sec id="sec-3-1">
        <title>Preliminaries</title>
        <p>The AVE 2008 test set for German contains 1027 validation items for 119 questions. 111 of these
answer/support items are classified as correct, 854 as wrong and 66 as undecided. A 100% YES baseline (all
items accepted) thus achieves a precision of 0.12 and F-score of 0.21. A correct answer exists for 62 of
the questions, i.e. the qa-accuracy (correct selections divided by number of questions) is bounded by 0.52.
Random selection of an answer yields a qa-accuracy of 0.11 and selection rate (compared to optimal
selection) of 0.21. These numbers are very different from the situation in 2007 where the test set for German
had a precision of 0.25 and F-score of 0.40 for the 100% YES baseline, and a qa-accuracy of 0.28 and
selection rate as high as 0.52 for random selection.</p>
        <p>The support pool enhancement by mining for additional supporting snippets in the QA@CLEF corpora
for German resulted in 579 auxiliary answer-snippet pairs and an enhanced pool with 1,606 items total.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Results of AVE 2008 Runs</title>
        <p>The results of RAVE in the AVE 2008 are shown in Table 1. The following column labels are used: f-score
(F-score, i.e. harmonic mean of precision and recall), f-gain (actual F-score divided by F-score for the
100% YES baseline), prec (precision for the YES class), p-gain (actual precision divided by precision of
the 100% YES baseline), recall for the YES class, qa-acc (qa-accuracy, i.e. correct selections divided by
number of questions), qa-perf (estimated qa-performance as defined in the AVE 2008 task description),
sel-rate (selection rate, i.e. successful selections divided by optimal selections), s-gain (selection gain,
i.e. actual selection rate divided by the selection rate for random selection). Both submitted runs used
equation (8) for combining the local snippet score and the aggregated score. Run1 also included the special
treatment of full-sentence answers to definition questions. Since it was not clear from the QA@CLEF
2008 guidelines if such answers would be accepted or not, the second run was configured such that all
fullsentence answers were rejected. However, the low scores for Run2 demonstrate that defining sentences
were generally accepted by the annotators. Therefore only Run1, whose treatment of defining sentences
agrees with the official annotation, will be considered further in the following ablation experiments.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Ablation Studies and Results for System Variants</title>
        <p>11The following bugs have been fixed: An error in the test for non-informative answers to definition questions resulted in many
correct answers to be dropped. The lexical overlap test takes compound decompositions into account, but the implementation did not
scores, see equation (9), ‘F’ means the use of θsel = 0.23 for F-score oriented runs, and ‘Q’ signals the use
of θsel = 0 for runs aiming at qa-accuracy. The RQ method corresponds to that used for Run1.</p>
        <p>We will first consider effects related to aggregation. Table 3 lists the results obtained when using
equation (10) instead of (9) for a weighted average of the local score and the aggregated score with different
values λ ∈ {0, 0.25, 0.5, 0.75, 1} for the weight of the aggregated score. (This is symbolized by the letter
‘W’ and the value of λ shown as the suffix of the runs.) Choosing γ = 0 in the WF0 and WQ0 runs means
that no aggregation is used at all. Comparing WF0 with RF, and WQ0 with WF, we find a strong effect of
aggregation (plus 8 percent points of F-score and a similar increase in the selection rate). The best F-score
of 0.47 is reached for λ = 0.75 and the best selection rate (of 0.65) for λ = 1. In this case, the selection
and validation decision depends only on the aggregated evidence for the answer ai, and the evidence ηi
from the considered snippet si itself has zero weight. Surprisingly, this method still has relatively good
F-score. The selection results for RF/WF1 and RQ/WQ1 are identical.</p>
        <p>Table 4 shows the results obtained when replacing the replication-tolerant aggregation scheme of Sect.
2.9 with either ‘best evidence’ aggregation score γmax given by (3), letter ‘B’, or with the independent
evidence aggregation score γindep given by (4), symbolized by ‘I’. It turns out that γmax generally performs
worse than the standard method of RAVE. The results of γindep are closer to those of the standard method
of RAVE. Since γindep is not stable against duplication of snippets, these results profit from the fact that
Iq∗ does not contain exact duplicates (but there still can be overlapping snippets).</p>
        <p>Table 5 shows the results of RAVE without active validation, i.e. when working with the original
answer-support items in Iq only. Since RAVE aggregates results for answer keys rather than exact
answer strings, it can exploit some minimal redundancy still present in the AVE08 test set. Therefore results
are better than in the WF0 and WQ0 runs that use no aggregation at all, but considerably worse than those
for support pool enhancement; see F-scores for RF (0.45), PRF (0.40) and WF0 (0.37).</p>
        <p>
          Table 6 shows the results of RAVE when run with the prover switched off. There is a drastic loss in
selection rate (e.g. a decrease by 15 percent points for SRQ compared to RQ), but surprisingly, a consistent
improvement in the F-score. The SWF1 run even achieved the best F-score of all system variants of
RAVE that were tested. These findings run counter to experience from many experiments on annotated
CLEF07 data all of which show a better F-score when adding logic-based features [
          <xref ref-type="bibr" rid="ref3 ref5">3, 5</xref>
          ]. A possible
work. Synonym normalization of word senses was switched on for all questions, but should not be used for definition questions. The
implementation of the important awEatMatch feature had to be corrected. Finally equation (8) was replaced by the improved (9).
model
BRF
BWF.75
BWF1
IRF
IWF.75
IWF1
BRQ
BWQ.75
BWQ1
IRQ
IWQ.75
IWQ1
model
PRF
PWF.75
PWF1
PRQ
PWQ.75
PWQ1
explanation is that eight of the eleven runs submitted for German were produced by QA systems which use
variants of RAVE as the validator – and repeating the same, logic-based validation criterion already used
when generating the runs is hardly effective in detecting the remaining false positives. The shallow-only
approach, by contrast, is implemented by a different classifier trained on a larger data set.12 It is probably
only this independence benefit which reflects in the improved F-scores of the shallow technique.
        </p>
        <p>
          Finally, Table 7 shows the results of the MAVE system [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] on the AVE 2008 test set, using clustering
of answers (‘C’) and optimizing either F-score or qa-accuracy (‘F’ vs. ‘Q’). Despite its use of a full
logical answer-validation (i.e., hypothesis-snippet proofs), MAVE does not outperform RAVE with its simple
logical passage test based on question-snippet proofs. The results may look disappointing compared to the
0.72 F-score and 0.93 selection rate of MAVE in the AVE 2007. However, the former F gain of 0.79 and
selection gain of 1.8 in the AVE 2007 are considerably lower than current results in the AVE 2008.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Suitability for Real-Time Validation</title>
        <p>Since RAVE is designed for real-time answer validation in interactive QA systems, the actual processing
time needed for validation and aggregation is also of interest. For the standard method (RQ run), validation
and aggregation/selection took an average 126 ms per question (or 9.35 ms per validation item).13 When
switching off the support pool enhancement (PRQ run), validation time drops to an average 76 ms per
question (or 8.8 ms per validation item). For comparison, the validation time for shallow-only validation
without using the prover is 77 ms per question (or 5.7 ms per validation item) in the SRQ run, and 50 ms
12The data set used for training the shallow-only classifier also includes all answer candidates supported by non-parseable snippets.
13These validation times do not include the time needed for parsing, since parsing reduces to the analysis of the question in normal
operation of RAVE. For the same reason, the time needed for support pool enhancement was also excluded. All processing times
were measured by running RAVE in a single thread on an Athlon64X2 4800+ CPU with 2.4 GHz clock rate.
per question (or 5.8 ms per validation item) without support pool enhancement. The extra effort required
for applying the prover to a validation item suitable for logical processing was an average 5.4 ms.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>The main objective for the current work was developing a logic-based answer validator suitable for use in
real-time QA. The design of RAVE results from the premise that optimizing the prover is not sufficient to
achieve this goal. Therefore the RTE paradigm for answer validation was replaced by a simplified model
which only uses logic for passage validation. The actual processing times of RAVE on the AVE 2008 data
confirm that the method can be applied in real-time QA even for large numbers of answer candidates.</p>
      <p>Another key feature of RAVE is the replication-tolerant aggregation model. Redundancy is usually
helpful for answer validation. However, redundancy only means stronger evidence when the sources are
sufficiently independent. In particular, multiple copies or variants of the same document (e.g. news report
or press release) should not provide more weight than a single occurrence of the original document. The
aggregation model of RAVE leverages redundancy without being misled by replicated content. Experiments
confirm the superiority of the model over several alternative approaches.</p>
      <p>An interesting aspect of RAVE is the discernment of regular and auxiliary support passages; the latter
are only used for aggregation and never actually shown to the user. For example, snippets from corpora
with restrictive licensing can be valuable sources of information but may not be presented to the user for
legal reasons. The same holds for supporting passages in a language unknown to the user; in this case,
presenting the snippet in the final result would be pointless.</p>
      <p>RAVE achieved an F-score of 0.39 and qa-accuracy of 0.32 in the AVE 2008. These scores are lower
than those of MAVE in the last year. However, the test sets for German had very different characteristics in
these two years, and the F-score and qa-accuracy are not commensurable for different test sets. The F-score
gain compared to the 100% YES baseline [9], and the selection gain compared to random selection, are
better suited for a comparison across test sets since they relate the validation and selection results to the
obvious baselines. From this perspective, the RAVE results for 2008 even look better than those of MAVE
for 2007. Moreover RAVE outperforms MAVE on the 2008 test set. It must be admitted, though, that
the best individual QA system in the test set performed better than RAVE: It achieved a qa-accuracy of
0.38 (or selection rate of 0.73), compared to the qa-accuracy of 0.34 and selection rate of 0.65 of RAVE in
the RQ run. This clearly indicates that improvements of the validator are necessary. Generally speaking,
RAVE has good features for recognizing passages with a correct answer but (currently) poor means for
verifying that a given candidate equals this correct answer. Therefore the most important change to RAVE
will be the addition of features which judge the compatibility of the answer candidate with the result of the
question-passage proof. The answer-type check must also be improved since the current implementation is
very limited.</p>
      <p>On the other hand, the AVE imposed some adverse conditions on the validator so that RAVE will likely
perform better in actual applicatons. Firstly, the features used by RAVE normally include the retrieval
score of the passage retrieval system and a producer score assigned by the QA source; these features are
not available in the AVE test set. Cross-validation on the training set shows that support for these features
might improve the F score by up to 8 percent points. Second, the AVE development set for German was too
small for training the classifiers. Therefore an existing training set of sufficient size had to be chosen whose
characteristics are not necessarily similar to those of the AVE 2008 test set. In practice, RAVE is normally
adjusted to each QA source by learning a separate model for each QA system. Such customization was not
possible in the AVE since the QA sources and their characteristics were not disclosed.</p>
      <p>
        RAVE is already part of two question answering systems. It serves as the answer validator in a
multistream system (IRSAW) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], where it provides incremental, any-time answer validation capability.
Moreover RAVE is the core of the evolving LogAnswer system [
        <xref ref-type="bibr" rid="ref1 ref5">1, 5</xref>
        ], which uses the question-passage proofs
of RAVE for simultaneously extracting answer bindings and validating the corresponding answers. It is
too early at this time to decide which of the two approaches is superior to the other. However, the existing
integration of RAVE in both kinds of systems will make it possible to address this issue in the near future.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Ulrich</given-names>
            <surname>Furbach</surname>
          </string-name>
          , Ingo Glo¨ckner, Hermann Helbig, and
          <article-title>Bjo¨rn Pelzer. The LogAnswer project at QA@CLEF 2008</article-title>
          .
          <source>In Working notes for the CLEF 2008 workshop</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Ingo</given-names>
            <surname>Glo</surname>
          </string-name>
          ¨ckner. University of Hagen at QA@
          <article-title>CLEF 2007: Answer validation exercise</article-title>
          .
          <source>In Working Notes for the CLEF 2007 Workshop</source>
          , Budapest,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Ingo</given-names>
            <surname>Glo</surname>
          </string-name>
          <article-title>¨ckner. Towards logic-based question answering under time constraints</article-title>
          .
          <source>In Proc. of the 2008 IAENG Int. Conf. on Artificial Intelligence and Applications (ICAIA-08)</source>
          , Hong Kong,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Ingo</given-names>
            <surname>Glo</surname>
          </string-name>
          <article-title>¨ckner, Sven Hartrumpf, and Johannes Leveling. Logical validation, answer merging and witness selection: A study in multi-stream question answering</article-title>
          .
          <source>In Proc. RIAO-07</source>
          , Pittsburgh,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Ingo</given-names>
            <surname>Glo</surname>
          </string-name>
          <article-title>¨ckner and Bjo¨rn Pelzer. Exploring robustness enhancements for logic-based passage filtering</article-title>
          .
          <source>In Proc. of KES-2008 , Lecture Notes in Computer Science</source>
          . Springer,
          <year>2008</year>
          (to appear).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Sven</given-names>
            <surname>Hartrumpf</surname>
          </string-name>
          .
          <article-title>Hybrid Disambiguation in Natural Language Analysis</article-title>
          . Der Andere Verlag, Osnabru¨ck, Germany,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Sven</given-names>
            <surname>Hartrumpf</surname>
          </string-name>
          , Ingo Glo¨ckner, and Johannes Leveling. University of Hagen at QA@
          <article-title>CLEF 2008: Efficient Question Answering with Question Decomposition and Multiple Answer Streams</article-title>
          .
          <source>In Working notes for the CLEF 2008 workshop</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>