=Paper=
{{Paper
|id=Vol-1172/CLEF2006wn-QACLEF-Glockner2006
|storemode=property
|title=University of Hagen at QA@CLEF 2006: Answer Validation Exercise
|pdfUrl=https://ceur-ws.org/Vol-1172/CLEF2006wn-QACLEF-Glockner2006.pdf
|volume=Vol-1172
|dblpUrl=https://dblp.org/rec/conf/clef/Glockner06a
}}
==University of Hagen at QA@CLEF 2006: Answer Validation Exercise==
University of Hagen at QA@CLEF 2006: Answer Validation Exercise Ingo Glöckner Intelligent Information and Communication Systems (IICS) University of Hagen (FernUniversität in Hagen) 58084 Hagen, Germany ingo.gloeckner@fernuni-hagen.de Abstract The paper features MAVE (MultiNet answer verification), a system for validating results of question-answering systems which serves as a testbed for robust knowledge processing based on multi-layered extended semantic networks (MultiNet [4]). The system utilizes logical in- ference rather than graph matching for recognising textual entailment, which makes it easily extensible by further knowledge encoded in logical axioms. In order to ensure robustness, the prover is embedded in a relaxation loop which subsequently skips ‘critical’ literals until a proof of the reduced query succeeds. MAVE uses the number of skipped literals as a robust indicator of logical entailment. Although the detection of ‘critical’ literals does not necessarily result in the minimal number of non-provable literals, the skipped literal indicator performed very well in the AVE experiments. MAVE parses the hypothesis strings and is therefore sen- sitive to syntactic errors in hypothesis generation. A regular correction grammar is applied to alleviate this problem. The system also parses the original question (when available) and tries to prove it from the snippet representation. The skipped-literal count for the question is then combined with a similarity index for hypothesis vs. found answer. This method boosts recall by up to 13%. Additional indicators are used for recognizing false positives. The filter increases precision by up to 14% without compromising recall of the system. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing—Linguistic processing; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Information filtering, Se- lection process; H.3.4 [Information Storage and Retrieval]: Systems and Software—Performance eval- uation (efficiency and effectiveness); I.2.3 [Artificial Intelligence]: Deduction and Theorem Proving— Inference engines; I.2.4 [Artificial Intelligence]: Knowledge Representation Formalisms and Methods— Predicate logic, Semantic networks; I.2.7 [Artificial Intelligence]: Natural Language Processing Keywords Answer Validation, Question Answering, Recognising Textual Entailment (RTE), Robustness, MultiNet 1 Introduction 1.1 System description MAVE is a knowledge-based system for answer validation which uses deep linguistic processing and log- ical inference for recognising textual entailment. The system assumes XML input as specified in the AVE 2006 description. A validation candidate p consists of a hypothesis string constructed by substituting the answer of the QA system into the original question, and by the extracted snippet which contains the rel- evant text passage. Optionally, the question string can also be given. MAVE uses a standard cascaded architecture for processing validation candidates. The sequence of processing steps involves preprocess- ing, linguistic analysis, and indicator extraction (the results of logical inference are also represented by indicators). The final aggregation step determines the YES/NO decision and confidence score from a com- bination of the indicators. These processing stages and the involved components of MAVE are detailed in the following. Preprocessing The hypotheses are very often syntactically ill-formed or semantically defective due to the primitive mechanism used for generating hypotheses from fixed templates. For example, the hypothesis might contain the phrase ‘im Jahre 26. April 1986’ (En: ‘in the year April 26, 1986’) rather than listing the date only. Sequences of two prepositions are also often found, e.g. ‘in in Lillehammer’ rather than ‘in Lillehammer’. These errors deteriorate the results of syntactic parsing, and a set of regular expressions is therefore applied in the preprocessing stage which corrects many of them. For example, sequences of local prepositions are reduced to a single local preposition. The snippet strings often need correction as well. Syntactic errors in the snippets mostly result from wrong XML-text conversion (contents of XML elements are concatenated without proper punctuation), from wrong segmentation (snippet extraction disregards sentence boundaries), from named entity recognizers which collapse multi-word units into new tokens like ‘Gianni Versace’, and from primitive mechanisms for reference resolution which simply replace pronouns by a named entity they most likely refer to. Some of these errors can be fixed by regular correction patterns, but the detoriation of extracted snippets by artefacts of the QA systems remains an issue. Linguistic analysis The WOCADI parser [1] is used for natural-language analysis. Linguistic process- ing starts with a morphological and lexical analysis, i.e. lemmatization, decomposition of noun compounds, and assignment of possible word meanings. These word meanings are expressed by lexical concept con- stants from the HaGenLex computational lexicon for German [3]. WOCADI then attempts a deep syntactic analysis, but it is robust against non-grammatical input and often returns a chunk parse (incomplete parse) in the event of such syntactic errors. The parser also constructs a semantic representation of the natural language expression, which is expressed in the MultiNet formalism [4] (a modern dialect of semantic net- works). This process includes the disambiguation of word meanings, semantic role labeling, and facticity labeling of all discourse entities (which helps to discern real situations from hypothetical or negated cases). WOCADI handles intrasentential coreference resolution and also prepares the necessary data for a subse- quent intersentential coreference resolution (e.g. for interpreting referring pronouns). The output format of the WOCADI parser, which contains the results of all analysis levels, is documented in [2]. Post-processing of representations and query construction Snippets are allowed to span multiple sen- tences. The representations of individual snippet sentences must hence be integrated into a global meaning representation of the snippet text [5]. This assimilation process involves a resolution of intersentential anaphora, which is based on the coreference information generated by WOCADI. The post-processing fur- ther involves a simplification and normalization of the generated semantic networks. A list of synonyms and near-synonyms is used for replacing all lexical concept constants with the canonical synset represen- tative. This eliminates the need of handling synonyms on the level of knowledge processing. The resulting semantic network for a snippet is represented as a logical conjunction of elementary facts which correspond to the edges of the network and to additional node attributes. While the snippet facts are dynamically added to the knowledge base, the hypothesis representation (or the question representation) must be turned into a logical query, i.e. a conjunction of literals in which all entities mentioned in the hypothesis are represented by variables. A proof of such a literal will bind the variables to the appropriate discourse entities mentioned in the snippet or to a skolem term built from such entities. The resulting logical query is further refined by eliminating complicated constructions. This process eliminates modalities ‘müssen’ (En: ‘must’) and ‘können’ (En: ‘can’), for example. Indicator extraction A total of 46 indicators is extracted from the hypothesis, the snippet and from the question string, which include character-based indicators (like the length of the snippet or the proportion of non-alphabetic characters), lemma-based indicators (like the hypo-triviality ratio described below), concept-based indicators (like the presence of concepts which indicate negative polarity contexts), syntax- based indicators (e.g. number of sentences in the snippet), indicators computed from the semantic network representation (like the indicator hypo-lexical-focus described below), and finally proof-based and matching-based indicators. It was not yet possible to consider all of these indicators in the scoring function of MAVE because there was no development collection for German. The eleven indicators which were chosen for the scoring criterion are explained at the end of this section. Robust entailment proving The most important indicators for deciding answer correctness are deter- mined by the MAVE prover, an improved version of the resolution-based inference engine for MultiNet described in [5]. The theorem prover expects queries to be expressed as a conjunctive list of literals. The knowledge base is comprised by variable-free facts (these describe the MultiNet representations of the snip- pet, as well as lexico-semantic relationships and bindings of rule schemata), and by implicative rules which derive a conjunction of conclusion literals from a conjunction of premise literals. All variables which only occur in the conclusion are considered existentially quantified. Because all such variables are replaced with skolem terms, application of an axiom will always result in a fully instantiated (variable-free) conclusion. The prover uses iterative deepening to find answers, i.e. a bounded depth first search with depth increment after each iterative deepening cycle. The iterative deepening process is stopped when a given maximum depth limit of 5 is reached or when a time limit is exceeded (20 seconds in the experiments). Performance of the prover is optimized by term indexing techniques and by literals sorting. Thus, the literals are as- signed costs according to the estimated effort for proving the literal given the current instantiation pattern of the literal, e.g. pattern (sub var non-var) for the literal (sub X lesen.1.1). The literals are then ordered in increasing order of estimated effort, where the estimate for each literal Li takes into account the instantiation pattern of bound vs. variable arguments which results from proving literals L1 , . . . , Li−1 . In order to achieve robust logical inference, the theorem prover is embedded in a relaxation loop which drops part of the conjunctive query when a proof of the current query fragment fails or can not been found within the time limit. By subsequently removing ‘critical’ literals from the query, this process always finds a (possibly empty) query fragment provable from the assumed knowledge. The number of skipped query literals which are not contained in the provable fragment will be used as a numeric indicator of logical entailment which is robust against slight errors in the logical representation of hypothesis and snippet and also against gaps in knowledge modeling. The selection of the ‘critical’ literal (which will be skipped in the next iteration of the relaxation loop), is based on the least-effort ordering of the literals. The prover keeps track of the longest prefix L1 , . . . , Lk of the ordered literal list for which a proof was found. When a full proof fails, the literal Lk+1 will be skipped and a new proof of the smaller fragment will be started. Notice that literals Lk+2 , . . . , LN must be re-ordered because the instantiation patterns can change when deleting Lk+1 from the literal list. The relaxation loop does not necessarily determine the smallest possible number of non-provable literals. This will not affect the success of the approach in practice, however (see Sect. 3). Aggregation of indicators An indicator evaluation language was developed which makes it easy to ex- periment with scoring criteria. In this language, all indicators can be treated like variables which are automatically bound to the appropriate value for each considered item in the test collection. In addition, aggregation functions like maximum, average etc. are available in the scoring language. The language also offers means for dealing with undefined indicator values. Expressions in this scoring language can then be translated into active scoring functions (which are ordinary functions in the Scheme programming language). Both the YES/NO decision of the run and the confidence score are computed from expressions in this scoring language. 1.2 Knowledge resources The background knowledge base of the first MAVE prototype comprises 2,484 basic facts which inter- relate 4,666 lexicalized concepts. About 2,000 of these lexico-semantic relationships were supplied by the computational lexicon HaGenLex. A typical example is the MultiNet relation CHEA, which holds be- tween events described by verbs like ‘aktivieren’ (En: ‘to activate’) and the corresponding abstractum ‘Aktivierung’ (En: ‘activation’). The remaining 500 facts capture important relationships not covered by HaGenLex, e.g. about the connection between states (‘Spain’), state adjectives (‘Spanish’), and inhabitant names (‘Spaniard’). Apart from the facts, MAVE uses 103 implicative rules. 38 of these rules are so-called ‘R-axioms’, which define key properties of the MultiNet relations. The remaining 65 rules include meaning postulates for frequent verbs not covered by the HaGenLex relations, and also a rudimentary domain mod- elling which covers a few important topics like the birth and death of a person, being awarded a prize, and being a member or a leader of an organization. These topics were identified based on the known questions of QA@CLEF 2004 and 2005. HaGenLex is also the source of more than 1,000 high-quality synonyms which are used for normalizing concept names. 525 additional synonyms or near-synonyms for frequent terms were manually compiled from OpenThesaurus synsets. The resulting system comprises 636 synonym classes. Not only the represen- tations of query and snippet, but also the fact knowledge base and the implicative axioms are automatically normalized by replacing known synonyms with a canonical representative for the synset. This normaliza- tion increases coverage of this model because distinct lexical constants are mapped to a single canonical representation. The background knowledge which refers to the canonical representation is then applicable to all lexical variants of expressing the underlying concept. Due to lack of time, it was not possible to integrate additional resources like GermaNet. This means that knowledge modeling in the first MAVE prototype is rather sparse. However, most of the hypothesis-snippet entailments or question-snippet entailments in the AVE test set are not very knowledge intensive, and the robust inference mechanism of MAVE is specifically designed for absorbing a few gaps in knowledge modeling. 1.3 Indicators and scoring function Defining the scoring criterion The following basic scoring criterion is used by MAVE. The returned number reflects the imperfection score of the considered hypothesis-snippet pair. A perfect logical entail- ment is scored 0, while an entailment relationship with slight errors (detailed below) will be scored by a small integer depending on the number of detected mismatches. A large value of 1000 will be returned to indicate that the pair should be rejected in any case. The imperfection score imp-score(p) of hypothesis pair p is given by the expression: (max (* (max (? (>= hypo-triviality 0.8) 0) (= hypo-num-sentences 2)) 1000) (min (? (max (* (max (< hypo-collapse 0.7) hypo-lexical-focus) 1000) (+ hypo-missing-constraints hypo-failed-literals)) 1000) (? (max (* question-lexical-focus (= hypo-delta-matched 0) 1000) (+ question-failed-literals question-missing-constraints (- hypo-delta-concepts hypo-delta-matched))) 1000))) Indicator evaluation was implemented in the Scheme programming language which explains the prefix notation of operators in the above expression. Note that the indicators can be undefined in certain cases. The functions max, +, - and * are assumed to be defined only if all arguments are defined. The special expression (? A B) is used to handle undefined values. It returns A if A is defined and B otherwise. The basic indicators on which the computation of the imp-score is based are defined as follows. • hypo-triviality – Lemma repetition ratio. This criterion serves to identify trivial hypotheses like ‘Gianni Versace was Gianni Versace’ which are logically valid but not acceptable as an answer. In order to detect such wrong positives, the hypo-triviality criterion measures the ratio of lemmata which occur an even number of times.1 Trivial hypotheses of the above kind result in a high repetition score. Empirically, a threshold of hypo-triviality ≥ 0.8 was found to identify such hypotheses reliably. In the case that there is no sufficient data for computing the hypo-triviality indicator (because the syntactic processing failed completely), the assumption of a non-trivial hypothesis is a reasonable default. This assump- tion is captured by the expression (? (>= hypo-triviality 0.8) 0), i.e. no error condition by default. • hypo-num-sentences – The number of sentences in the hypothesis. Due to odd punctuation and syntax errors, the parser might wrongly consider a hypothesis as consist- ing of two sentences although in fact we only have single-sentence hypothesis. We therefore know that there is something wrong with the analysis of the parser if hypo-num-sentences = 2. Such hypotheses get the default evaluation NO. 1 Certain lemmata like German ‘sein’ (En: ‘be’) will be filtered out before determining this ratio. • hypo-collapse – Query extraction control index When there is no perfect parse of a sentence, the result of semantic analysis can be incomplete. Com- pared to the correct query representation, the constructed logical query will contain too few literals in this case and thus be underspecific. In order to detect this cause of false positives, we define a query extraction control measure #hypothesis-literals hypo-collapse = #hypo-content-words which relates the size of the constructed query (number of literals in the representation of the hypoth- esis) and the number of content words in the hypothesis string. A value of hypo-collapse < 0.7 was found to be a reliable indicator for the case that the logical representation is too small compared to the size of the hypothesis. A proof of the logical hypothesis is not useful for judging answer correctness in this case because it likely corresponds to a false positive. • hypo-lexical-focus – Hint at potential false positive This binary criterion identifies a situation where the logical hypothesis representation is too permis- sive compared to the intended meaning of the hypothesis. This problem in query construction affects mainly hypotheses involving measurable properties (like ‘8585 meters high’). A modification of the parser will eliminate the need to consider this case in the future. • hypo-missing-constraints – Control index for numerical constraints Apart from regular words, the hypothesis can also contain numbers. These numbers must occur as restrictions in the constructed logical representations (for example, a restriction of the year to 1994). Although the syntactic-semantic parser normally treats these numeric constraints as expected, it turned out that they are occasionally dropped when a complete parse of the sentence fails. The indicator hypo-missing-constraints was introduced to recognize this case. It is defined as hypo-missing-constraints = #numerical-tokens-in-hypothesis −#numerical-constraints-in-logical-hypothesis. The imp-score criterion will treat each missing constraint like a skipped literal which could not be established by logical inference. • hypo-failed-literals – Number of non-provable literals As explained above, the prover is embedded in a relaxation loop which skips non-provable literals un- til a proof of the rest literals succeeds. In this way, we obtain the indicator hypo-failed-literals (number of non-provable literals in the hypothesis representation). A value of 0 means strict logical entailment, while a small integer indicates a few mismatches or knowledge gaps. • question-failed-literals – Number of non-provable literals for the question The MAVE system also tries to prove the original question (rather than the hypothesis) from the snippet. The indicator question-failed-literals counts the number of literals in the logical question representation which had to be skipped in order to prove the remaining question literals. • question-missing-constraints – Control index for numerical constraints in the question The indicator counts the number of numeric tokens in the original question which are not expressed in the logical representation of the question. The evaluation criterion imp-score treats these missing constraints like a skipped literal which cannot be established in the logical proof. See indicator hypo-missing-constraints for motivation. • question-lexical-focus – Hint at potential false positive Due to a bug in semantics construction, the logical representation of questions involving measur- able adjectives is too permissive in a few cases, which means a higher risk of a false positive. These situations are detected by the binary indicator question-lexical-focus, the analogue of hypo-lexical-focus for questions. The problem will be fixed in future revisions of MAVE and the indicators can then be omitted. • hypo-delta-concepts – Number of answer words in the hypothesis A proof of the original question will only support the given hypothesis if the answer from which the hypothesis has been built is similar to the answer found in the proof of the question. As the basis for establishing such a similarity, we extract the answer part of the hypothesis (knowing that the hypothesis has been constructed from the question and from the original answer determined by the QA system). hypo-delta-concepts counts the number of content-bearing tokens in the hypothesis which likely stem from the original answer of the QA system and not from the question. • hypo-delta-matched – Number of matched answer words In order to relate the answer of the QA system to a proof of the original question based on the snippet, we use the matching coefficient hypo-delta-matched. Based on the proof of the logical question representation over the snippet and the known origin of literals in one of the sentences of the snippet, we first identify that sentence in the snippet which contains most of the literals supporting the proof. The content-bearing and numeric tokens in the answer part of the hypothesis are then matched against the lexical concepts and cardinality specifications in the MultiNet representation of this central sentence.2 In this way, we obtain the number of those answer words which can be matched with the selected snippet sentence. The focus on a most relevant sentence of the snippet helps detect the case that a hypothesis refers to parts of the snippet not related to the answer. Defining the selection criterion A simple threshold θ is used to determine YES/NO decisions from the basic evaluation score as defined above, i.e. ( YES : imp-score(p) ≤ θ select(p) = NO : else . Thus all hypothesis-snippet pairs with an imperfection score imp-score ≤ θ will be evaluated YES, while those items with more than θ imperfections will be rejected. Defining the confidence ranking The assignment of confidence scores is based on the assumption that for YES decisions, the results with the least number of imperfections (as expressed by the evaluation crite- rion) are most reliable. Specifically, results with an imperfection score of 0 (no mismatches at all) should be fully reliable with score 1, and all other YES decisions should get smaller but positive confidence evaluation which depends on the number of detected errors. A simple linear function is used to obtain this behaviour. As to the NO decisions, a larger number of imperfections (non-provable literals) means more evidence that the answer is indeed false. Assuming that σ or more non-provable literals (and other imperfections) mean 2 Matching is performed on the level of lexical concepts in order to profit from lemmatization and synonym normalization. ‘totally wrong’, we capture this as follows, imp-score(p) 1−η : select(p) = YES θ +1 confidence(p) = imp-score(p) − θ − 1 min 1, : else σ −θ −1 where 0 ≤ θ < σ − 1. The parameter 0 < η < 1 + θ1 can be used to balance the YES scale of confidence scores with the NO scale of confidence scores. 2 Description of Runs The two submitted runs both instantiate the general scheme for selecting valid hypothesis and for assigning confidence scores. • In the recall-oriented first run, θ = 2 mismatches (‘imperfections’) were allowed for YES decisions. • In the more precision-oriented second run, θ = 1 mismatch was allowed for YES decisions. We chose σ = 8 in both runs, i.e. the confidence score for NO decisions will reach 1 if the imp-score detects 8 or more mismatches. It was intended to use η = 1 in all runs. However, the confidence scores in the submitted run #1 were based on η = 1.2 by mistake. 3 Evaluation The viability of the basic approach of robust inferential entailment checking is witnessed by Table 1 which lists the results of MAVE in the answer validation exercise. ‘Accuracy’ measures the proportion of correct decision, and CWS is the confidence-weighted score.3 Although the approach does not search for a min- imal set of failed literals, but only skips literals according to a simple ‘longest partial proof’ strategy, the computed failed literal counts are obviously significant with respect to the validity of entailment: There is a consistent decrease in precision if more imperfections θ (usually skipped literals) are allowed, but also the intended boost in recall. The table further reveals that the two submitted runs (which admit one or two mismatches) represent the best compromise of all evaluation measures. Assuming provability of all literals (θ = 0) is too restrictive in terms of recall, while admitting θ = 3 or more mismatches considerably lowers precision. MAVE features the use of question proofs for validating hypotheses, by comparing the answer part of the hypothesis with the lexical concepts in that snippet sentence which contributes most to the proof of the question. This mechanism is intended to increase recall when the hypothesis is syntactically ill-formed. In this case, no proof of the hypothesis is possible because no logical representation can be constructed. However, the question can usually be parsed so that a proof of the question can be tried. The hypothesis will then be compared with the answer determined by the proof of the question in a way which does not assume a parse of the hypothesis (i.e. only based on results of lexical analysis). To see the benefits of this technique, consider Table 2 which lists the results when no question-related indicators are used.4 For θ = 1, the recall decreases by 9.6% when no question-related information is available; for θ = 2, the question 3 For computing the confidence weighted score (CWS), σ = 8 and η = 1 were used throughout. Notice that for θ = 2, exactly the same results are obtained for η = 1 and η = 1.2, i.e. the numbers shown for θ = 2 indeed correspond to the submitted first run. 4 i.e. when omitting the second argument of min in the definition of imp-score(p). θ (run) Precision Recall F measure Accuracy CWS 0 0.8496 0.2720 0.4120 0.8051 0.8505 1 (run #2) 0.7238 0.3711 0.4906 0.8085 0.8457 2 (run #1) 0.5878 0.4929 0.5362 0.7859 0.8331 3 0.4880 0.5779 0.5292 0.7418 0.8134 4 0.4107 0.6771 0.5112 0.6750 0.7883 5 0.3658 0.7337 0.4882 0.6138 0.7638 6 0.3289 0.7705 0.4610 0.5477 0.7429 Table 1: Results of the MAVE system for the AVE-2006 test collection θ Precision Recall F measure Accuracy CWS 0 0.9452 0.1955 0.3239 0.7952 0.8418 1 0.8361 0.2748 0.4137 0.8044 0.8413 2 0.6882 0.3626 0.4750 0.7987 0.8375 3 0.5597 0.4646 0.5077 0.7738 0.8280 4 0.4508 0.5326 0.4883 0.7198 0.8138 5 0.3963 0.6062 0.4793 0.6693 0.7987 6 0.3615 0.6544 0.4657 0.6230 0.7832 Table 2: MAVE results based on hypothesis proofs only indicators even contribute 13.0% of recall. Precision is lowered somewhat when using question data (by 11.2% for θ = 1 and 10.0% for θ = 2) but remains on a high level, and we get an overall improvement as witnessed by the higher f-measure and accuracy in the submitted runs. Table 3 shows that this increase in performance can only be achieved when both hypothesis and question indicators are combined. The table lists the results of MAVE when no information about hypothesis provability is used.5 In fact, there is a clear decay in recall when deciding the validity of a hypothesis only based on the proof of the question and a subsequent matching with the hypothesis. Interestingly, the simplistic method of validating the hypothesis from a proof of the question and a comparison with the relevant snippet segment does not result in a loss of precision. The definition of imp-score(p) also involves indicators (like hypo-triviality, hypo-collapse or question-lexical-focus) which serve to detect false positives. Table 4 shows the results obtained when omitting these indicators from the scoring function. Specifically, it demonstrates that the proposed indicators are effective in removing false positives, because precision drops by 14.4 percent when the 5 i.e. when omitting the first argument of min in the definition of imp-score(p). θ Precision Recall F measure Accuracy CWS 0 0.8571 0.2380 0.3725 0.7987 0.8504 1 0.7468 0.3343 0.4618 0.8044 0.8481 2 0.6031 0.4476 0.5138 0.7873 0.8393 3 0.5081 0.5326 0.5201 0.7532 0.8244 4 0.4379 0.6487 0.5228 0.7027 0.8046 5 0.3910 0.7167 0.5060 0.6486 0.7822 6 0.3451 0.7450 0.4717 0.5811 0.7602 Table 3: MAVE results based on question proofs only θ Precision Recall F measure Accuracy CWS 0 0.7059 0.2720 0.3926 0.7888 0.8149 1 0.6000 0.3909 0.4734 0.7817 0.8052 2 0.5072 0.4986 0.5029 0.7525 0.7879 3 0.4319 0.5836 0.4964 0.7027 0.7650 4 0.3725 0.6827 0.4820 0.6136 0.7376 5 0.3390 0.7394 0.4648 0.5726 0.7128 6 0.3093 0.7790 0.4428 0.5078 0.6925 Table 4: MAVE results without false positive tests false positive indicators are ignored. This effect is most marked when the recall base is small (most false positives like ‘Gianni Versace is Gianni Versace’ are tautologies and thus provable for θ = 0). Recall is not altered significantly, i.e. the indicators are sufficiently selective for removing mostly false examples. 4 Conclusion We have presented the MAVE system for robust inferential answer validation in the MultiNet framework. Despite its small knowledge base, the system works reasonably well in the AVE exercise – mainly because most negative cases are strikingly false and easily recognized. Moreover, a relevant number of the positive cases need limited inference or are even trivial (i.e. constructed from a substring of the snippet). Thus AVE systems can be pretty simple-minded as long as QA systems use simple techniques, and they must become smarter only as the QA systems which generate the answers become smarter as well. Apart from that, the relative success of the MAVE prototype is explained by its robust proving strategy which tolerates a few mismatches and gaps in the coded knowledge. The most obvious improvement of MAVE will be the integration of large lexico-semantic resources like Germanet; this step is necessary to boost recall beyond 50%. Now that the annotated AVE 2006 corpus for German is available as a development collection, the design of improved scoring metrics and training of statistical classificators also become an option. References [1] Sven Hartrumpf. Hybrid Disambiguation in Natural Language Analysis. Der Andere Verlag, Os- nabrück, Germany, 2003. [2] Sven Hartrumpf, Hermann Helbig, Constantin Jenge, and Rainer Osswald. BenToWeb deliverable D6.2. Technical report, FernUniversität in Hagen, 2006. [3] Sven Hartrumpf, Hermann Helbig, and Rainer Osswald. The semantically based computer lexicon HaGenLex – Structure and technological environment. Traitement automatique des langues, 44(2):81– 105, 2003. [4] Hermann Helbig. Knowledge Representation and the Semantics of Natural Language. Springer, Berlin, 2006. [5] S. Marthen. Untersuchungen zur Assimilation größerer Wissensbestände aus textueller Information. Master’s thesis, FernUniversität in Hagen, Hagen, Germany, 2002.