=Paper=
{{Paper
|id=Vol-1178/CLEF2012wn-QA4MRE-GlocknerEt2012
|storemode=property
|title=The LogAnswer Project at QA4MRE 2012
|pdfUrl=https://ceur-ws.org/Vol-1178/CLEF2012wn-QA4MRE-GlocknerEt2012.pdf
|volume=Vol-1178
}}
==The LogAnswer Project at QA4MRE 2012==
The LogAnswer Project at QA4MRE 2012 Ingo Glöckner1 and Björn Pelzer2 1 Intelligent Information and Communication Systems Group (IICS), University of Hagen, 59084 Hagen, Germany ingo.gloeckner@fernuni-hagen.de 2 Department of Computer Science, Artificial Intelligence Research Group University of Koblenz-Landau, Universitätsstr. 1, 56070 Koblenz bpelzer@uni-koblenz.de Abstract. For QA4MRE 2012, we have improved our prototype of a pure logic-based answer validation system that was originally developed for QA4MRE 2011. The system uses core functionality of the LogAnswer question answering (QA) system available on the web, which was com- bined and extended such as to best meet the demands of the QA4MRE task. In particular, coreference resolution is used in order to allow knowl- edge processing on the document level, and a fragment of OpenCyc has been integrated in order to allow more flexible reasoning. In addition to a general consolidation of the original prototype that took part in QA4MRE 2011, the current system was optimized for accuracy on non- rejected questions. While last year’s system only achieved an accuracy of 0.24 for answered questions, the various measures taken to improve the rejection decision now yield an accuracy of 0.31 for answered questions in the QA4MRE 2012 task. Results show that these improvements were sufficient to raise the overall quality of answers to a c@1 score of 0.26, which is clearly above the c@1 baseline of 0.20 for random guessing. Due to the difficulty of the reading tests in QA4MRE, we consider this result promising, recalling the near-baseline scores of most QA4MRE partici- pants in 2011.1 1 Introduction The DFG-funded project LogAnswer investigates the use of deep linguistic pro- cessing and logical reasoning for QA, with a special emphasis on the issues of robustness and efficiency. An experimental QA system for German (also called LogAnswer) [2] that evolved from this research can be tested online (www.loganswer.de), a screenshot is shown in see Fig. 1. The system took part in the QA@CLEF and ResPubliQA QA system evaluations [3,4,5]. Recently, we have focused on the issue of providing automatic QA support for human Q&A portals. Users of the German FragWikia! portal (http://frag.wikia.com/) can subscribe to the automatic answer service on the LogAnswer website. LogAnswer keeps track of new questions at FragWikia! and sends an answer email whenever 1 LogAnswer is supported by the DFG with grants FU 263/12-2 and GL 682/1-2. How many people died when the MS Estonia sank? more than 900 people Fig. 1. Screenshot of the LogAnswer QA system, using QA@CLEF data it finds sufficiently good results for a question of a registered user. In this new use case for QA, we face the challenge that there is often few lexical overlap between question and answer. Moreover the system must validate answers in order to achieve better precision. This is because users do not necessarily expect the system to answer at all in this scenario. But if it does post a response, then the answer should show a high probability of correctness, in order to avoid users getting annoyed by the machine-generated answers. This scenario resembles the present QA4MRE machine reading exercise [6] in that there is often no close relationship between question and answers, and in the need to validate answers rather than always providing a ranked list of answer candidates. Despite these similarities that make the QA4MRE participating appealing for our project, application of existing LogAnswer system (even with the precision- oriented refinements for QA portals) to the QA4MRE task does not make sense: – LogAnswer normally finds answer from scratch, based on a candidate re- trieval backend whose retrieval scores are considered in determining the fi- nal ranking. In QA4MRE, however, there is no retrieval score as one of the starting points for reranking. – LogAnswer usually works with a sentence-level segmentation of documents, while QA4MRE may demand a justification of answers from several parts of a document or even from several documents. – Since the validation features of LogAnswer depend on sentence-level segmen- tation and since the retrieval score as one of these features, the validation models of LogAnswer as determined from a learning-to-rank approach are not applicable. – The existing QA@CLEF-based training set for the machine learning (ML) technique cannot be reused either because it shows strong effect of lexical overlap (unlike QA4MRE) and because it comprises sentence-level items only. Owing to these difficulties that were already present in QA4MRE 2011, we de- cided to build a new LogAnswer prototype for the QA4MRE task in the last year [7], using only those basic techniques from the regular LogAnswer system that seemed useful for the task: – The core system for QA4MRE was designed to achieve logical answer vali- dation on document level representations; – Existing LogAnswer functionality was reused whenever possible; – Instead of using the supplied “QA4MRE Background Collection” or a web- based validation approach, we decided to strengthen the background knowl- edge of LogAnswer by integrating part of OpenCyc; – Generation of explanations for a chosen answer was added. For QA4MRE 2012, our goal was a general consolidation of the LogAnswer prototype built for QA4MRE and an improvement of its capabilities for rejecting wrong answers in particular. 2 System Description The LogAnswer prototype for QA4MRE relies on a deep linguistic analysis. – The reading tests, questions and answer candidates are parsed by the WOCADI parser [8]. – WOCADI computes a meaning representation for each sentence in the form of a Multilayered Extended Semantic Network (MultiNet [9]). – The question is then subjected to a question classification, based on a system of 248 classification rules that operate on the generated semantics networks. In QA4MRE 2011, the question classification was of minor importance compared to regular Q&A here since no predictive annotation and no expected answer type check was used. For QA4MRE 2012, some effort was spent in utilizing the computed question category, specific answer types (as known from the former QA@CLEF and ResPubliQA exercises) and other information determined by the question classification rules in order to eliminate questions that the system is unable to answer: – LogAnswer is designed for answering single-sentence questions. Therefore, questions that span more than one sentence are now discarded. – LogAnswer has no special means for validating list answers (e.g. answers involving a coordination such as ‘London and Paris’). Therefore question that target at several answers (such as ‘Which cities. . . ’) are now recognized and discarded since they are not likely to be answered correctly. – Questions are now generally discarded if no expected answer type can be established, except if the question is classified into the ‘DEFINITION’ cat- egory (for definition questions). In the following we sketch the construction of the logical representation on the document level. – Coreference resolution is essential to the success of logic-based processing on whole documents: under the unique name assumption, the combined infor- mation about a discourse entity can only be utilized if all mentions of the entity in the text are represented by the same constant. We therefore use a coreference resolver, CORUDIS [8], for handling anaphoric pronouns and other forms of coreference. – The coreference resolver is complemented with a simplistic solution to iden- tifying the speaker from the manuscript of a talk which was activated in the QA4MRE runs. The method simply chooses the first name mentioned at the beginning of the manuscript. Knowing the speaker is necessary for resolving first person deixis ‘Ich’/’I’). – In a process called assimilation, the text representation is constructed from all logical sentence representations, thereby using the results of coreference analysis and speaker recognition. In order to capture more inferences, ca. 20,000 subsumption relationships were extracted from OpenCyc and translated into our concept system for German, see Section 3. For efficiency reasons, this subsumption hierarchy is not added to the background knowledge seen by the prover. Instead, the text representation is enriched by explicitly declaring each instance of a concept an instance of all of its superconcepts as well. Example: – A document mentions a tsunami. – Logical representation contains statement like SUB(c24, tsunami.1.1), ex- pressing that entity c24 is an instance of a tsunami (with HaGenLex word sense index 1.1, see [10]). – From the OpenCyc link, we know that every tsunami is a ‘Naturkatastrophe’ (natural disaster). – The OpenCyc expansion step then adds an assertion expressing that c24 is also an instance of a natural disaster, viz SUB(c24, naturkatastrophe.1.1). Let us now turn to hypothesis construction. By a hypothesis, we mean a declarative statement formed from the question and the candidate answer. Con- sider the question ‘What is Bono’s attitude with respect to the digital age?’ and the possible answer “enthusiasm”. In this case, the textual hypothesis would be ‘Bono’s attitude with respect to the digital age is enthusiasm’. In our proto- type, the template-based construction of a textual hypothesis from question and answer, followed by parsing, is only used as a fallback. This is because the text- based hypothesis construction is problematic for German due to mismatches of word forms with respect to syntactic case. Instead, our system normally con- structs the logical hypothesis directly on the logical level, by combining the literal lists that represent question and answer. Based on the logical hypothesis (as a list of literals) and the logical represen- tation of the document for the given reading test, our system then uses robust logical processing for justifying or rejecting the hypothesis (and in turn, the an- swer candidate). This procedure follows the general principle of logical answer validation: – Try to prove the logical hypothesis from the logical representation of the reading test document and from the general background knowledge of the system. – If such a proof succeeds, then the answer from which the hypothesis was constructed is considered correct. In practice, linguistic analyses contain errors, coreference resolution can fail, and the background knowledge is incomplete. Therefore, the proof attempt may fail for correct hypotheses, and we cannot conclude from a failed proof that the hypothesis is indeed wrong. For ensuring sufficient robustness, we thus add a relaxation loop to the logical validation process: – Remove problematic hypothesis literals one at a time until a proof of the remaining hypothesis fragment succeeds. – Impose a limit on the number of admissible relaxation steps in order to keep processing time within reasonable bounds. The skipped literal count and other criteria related to the relaxation result are then used as features that affect answer scoring. QA4MRE 2011 required systems to provide one or more so-called provenance elements that justify the chosen answer. This requirement was simplified in this year’s task, which only assumes general information of the resources involved to be provided. Still, we did not eliminate the extraction of detailled justifications from our system since it proved useful for computing validation scores, depending on the coherence of found justifications. The computation of provenance information (i.e. of the individual justifying elements that together explain the answer) is based on prover results: – All names of used axioms (implicative rules) are added as provenance ele- ments, since axiom names in our system are chosen in a meaningful way. – Used facts from the background knowledge generally express lexical-semantic relations, such as CHPA(hoch.1.1, höhe.1.1) (i.e., property ‘high’ corresponds to attribute ‘height’). Knowing the MultiNet documentation, such a fact is self-explanatory and can be added as a provenance element. – For literals from the text representation such as SUB(c24, tsunami.1.1), the system presents a witness sentence rather than the used fact itself. These wit- ness sentences are also considered as provenence elements and thus included into the computed justification. Since no sufficient training data was available for applying a machine-learning technique that automatically determines validation scores, we used an ad-hoc scoring metric that closely resembles the validation criterion already used in the last year.2 Many of the validation features normally used by LogAnswer had to be omitted since it was not clear how to utilize them in a simple hand-crafted scoring criterion. As in the last year, we started from a basic scoring metric % defined as the arithmetic mean of: – %1 = α#skipped-lits , where α = 0.7.3 – %2 = α#skipped-lits · β #unknown-status-lits , where β = 0.8.4 – %3 : optimistic proportion of provable question literals: #skipped-q-lits %3 = 1 − #all-q-lits – %4 : pessimistic proportion of proved question literals: #proved-q-lits %4 = #all-q-lits – %5 : optimistic proportion of provable answer literals: #skipped-a-lits %5 = 1 − #all-a-lits – %6 : pessimistic proportion of proved answer literals: #proved-a-lits %6 = #all-a-lits This basic validation score is then combined with a criterion intended to capture the coherence of the justification of the answer as found by the system using a relaxation proof. In order to determine this coherence score, we organize the witness sentences into blocks of adjacent sentences. Let us say that a block of adjacent witness sentences is ‘unconnected’ to earlier blocks if it is not linked by coreference to these blocks. Let u be the number of unconnected blocks of witness sentences for the given answer a. The final answer selection score σ is then computed from the basic score % and the unconnected sentence score u as σ = % · γ u−1 where γ = 0.7 in our experiments. The final step in solving an individual reading test item is the selection of the best answer, or alternatively the decision not to commit to any answer at all, leaving the question unanswered. 2 The QA4MRE 2011 data for LogAnswer was considered too small for learning a useful model. 3 #skipped-lits is the number of skipped hypothesis literals due to relaxation. 4 #unknown-status-lits is the number of non-skipped hypothesis literals not yet proved in the last proof attempt of the relaxation loop. – The prototype generally refuses to answer if question analysis fails (i.e., if there is no full parse of the question or an empty question literal pattern). – An answer candidate is also rejected if hypothesis construction fails (e.g. if the constructed hypothesis contains no literals associated with the original answer). As opposed to our system setup in QA4MRE 2011 described in [7], the prototype no longer rejects answer candidates for which provenance generation fails. In this case that an empty justification for the answer is extracted, the system rather determines σ based on u = 1 (i.e. assuming that the candidate answer is justified by a single witness sentence). If all answers are rejected for one of the reasons mentioned, then the question is left unanswered. Otherwise the answer with maximum score is chosen. 3 Resources Used The background knowledge of LogAnswer comprises: – A system of 49,000 synonym classes involving 112,000 lexemes, used for syn- onym normalization (concepts are replaced by a canonical representative); – About 10,500 facts, e.g. – nominalization relationships (‘to attack’ vs. ‘the attack’) – adjective-attribute relationships (‘high’ vs. ‘height’) – gender-specific profession names (‘Professor’ vs. ‘Professorin’ [female pro- fessor] in German) – Approx. 120 rules for basic inferences (e.g. for making use of the lexical- semantic relationships listed above). In order to enhance the background knowledge, the system is further equipped with knowledge from OpenCyc (http://sw.opencyc.org/), To this end, the HaGenLex [10] concept system of German word senses was linked to a frag- ment of OpenCyc using a semi-automatic alignment technique. We started with automatic generation of hypotheses for HaGenLex-OpenCyc links: – HaGenLex → wikipedia.de topic or redirect → DBPedia map (de-en) → wikipedia.en-OpenCyc link, or alternatively: – HaGenLex → HaEnLex → wikipedia.en topic/redirect → wikipedia.en-OpenCyc link, where HaEnLex is a core lexicon for English designed analogously to HaGenLex and interlinked with the German lexicon. The automatic generation of link hy- potheses was followed by a manual validation, checking several ten thousands of links generated in this way for their actual correctness. The checked links are then used for knowledge extraction: the validated alignment lets us map synonyms, subsumptions, and disjoint declarations from OpenCyc into our HaGenLex con- cept system. A final refinement consists in a check for cyclic subsumptions and inconsistencies, which are found automatically but had to be fixed by hand. Table 1. Main results of LogAnswer prototype in QA4MRE 2012 Run aw corr acc|aw acc c@1 loga11011dede 96 28 0.29 0.18 0.25 loga12023dede 96 30 0.31 0.19 0.26 Moreover, there is an automatic elimination of redundancy (i.e., optimization of subsumption chains). This process resulted in 20,000 new redundancy-free sub- sumption relationships, 1,500 new synonyms, and 4,000 disjointness declarations, which substantially extend the background knowledge available to LogAnswer. 4 Experimental Results Two runs were submitted to QA4MRE 2012, using German as both the source and target language. These runs were based on the following system configura- tions: – loga12011dede: only elementary synonyms (spelling variants, wrong spellings) and elementary background knowledge (e.g. simple temporal and spatial reg- ularities), no OpenCyc integration. – loga12023dede: full background knowledge of LogAnswer including OpenCyc expansion. The main evaluation results obtained for the 160 test questions of QA4MRE 2012 are shown in Table 1. Here, ‘aw’ is the number of answered (non-rejected) questions, ‘corr’ is the number of answered questions with a correct choice of the answer, ’acc|aw’ is the accuracy achieved for answered questions, ‘acc’ is overall accuracy, and ‘c@1’ is the c@1 score achieved by the given run. The system attempts to answer 94 questions, with 28 correct answers in the first run and 30 correct answers in the second run supported by the full background knowledge of LogAnswer. Thus, there is a slight positive effect of the added knowledge including the OpenCyc integration. The achieved c@1 score of 0.26 clearly exceeds the c@1 score of 0.20 for random guessing. The accuracy of 0.31 for non-requested questions in the second run is substantially higher than the corresponding value of 0.24 achieved in the last year [7]. Moreover the c@1 score has improved from 0.22 to the current 0.26. This indicates a positive effect of the consolidation of the system and of the refinements for recognizing questions that the system is unable to handle. Table 2 also shows the results by topics. T1. T2, T3 and T4 are the c@1 scores achieved for topics 1 (AIDS), 2 (Climate Change), 3 (Music and Society), and 4 (Alzheimer), respectively. ‘all’ is the overall c@1 score for comparison. While the self-assessment of the system with respect to failed answer at- tempts has been improved, mainly by better exploiting the results of question classification, some problems of the prototype with the QA4MRE tasks remain: Table 2. Topic results of LogAnswer prototype in QA4MRE 2012 Run T1 T2 T3 T4 all loga11011dede 0.26 0.07 0.42 0.23 0.25 loga12023dede 0.29 0.11 0.42 0.23 0.26 – Poor document quality One of the main problem for deep linguistic processing were missing blanks, resulting in adjacent words to be wrongly merged into single tokens. There was also no structuring of the documents by blank lines at all, so that head- lines were not clearly separated from regular text. Due to parsing failure, this resulted in many sentences to become invisible for logic-based validation. – Spelling errors in questions The WOCADI parser has found a full parse for 132 out of the 160 questions. Parsing problems in the remaining 28 questions were mostly due to spelling errors or wrong grammar. These problems made it impossible for the system to answer the corrupted questions. – Linguistic diversity of answers Due to the rich question types in the reading tests, answers were syntactically diverse and hard to parse (e.g. incomplete sentences). – Missing background knowledge The regularities needed to understand the correctness of an answer were complex and hard to capture by logical rules. 5 Conclusions and Future Work We have described the LogAnswer prototype for QA4MRE 2012, which resulted from last year’s system by a general consolidation and refinements for better recognizing questions that the system cannot answer. These improvements have raised the c@1 score of the system to 0.26, which is substantially better than the expected c@1 score of 0.20 for random guessing. There was also a slight positive effect of using background knowledge including the OpenCyc integration, com- pared to using no background knowledge at all. However, defects in reading test documents (missing blanks and structuring) still make it difficult to apply deep linguistic processing and subsequent logical reasoning for validating answers. No ‘shallow’ techniques were used as a fallback if the deep linguistic analysis fails, since lexical overlap and similar shallow linguistic criteria will not help too much in QA4MRE. It would have been best to complement logical answer validation with web-based validation or hit scores obtained from the QA4MRE Background Collection. The modest c@1 score achieved by LogAnswer can also be attributed to the fact that the QA4MRE background collections were not yet used by the system. Looking at concrete questions, a decomposition of coordinating answers such as ‘London and Paris’ would have been useful, but for the time being the system discards questions that target at such answers. From a more general per- spective, we will continue working on techniques for increasing the precision of results in order to make our system more useful in the QA portals use case. One concrete approach that we are currently working on is the use of techniques from Case-Based Reasoning (CBR). These techniques could help to utilize user feedback for continually improving validation quality. References 1. Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G., Kurimo, M., Mandl, T., Peñas, A., Petras, V., eds.: Evaluating Systems for Multilingual and Multimodal In- formation Access: 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus, Denmark, Revised Selected Papers, Heidelberg, Springer (2009) 2. Furbach, U., Glöckner, I., Pelzer, B.: An application of automated reasoning in natural language question answering. AI Communications 23(2-3) (2010) 241–265 3. Glöckner, I., Pelzer, B.: Combining logic and machine learning for answering ques- tions. [1] 4. Glöckner, I., Pelzer, B.: The LogAnswer project at CLEF 2009. In: Working Notes for the CLEF 2009 Workshop, Corfu, Greece (September 2009) 5. Glöckner, I., Pelzer, B.: The LogAnswer project at ResPubliQA 2010. In: CLEF 2010 Working Notes. (September 2010) 6. Peñas, A., Hovy, E., Forner, P., Rodrigo, A., Sutcliffe, R., Sporleder, C., Forascu, C., Benajiba, Y., Osenova, P.: Overview of QA4MRE at CLEF 2012: Question Answering for Machine Reading Evaluation. In: CLEF 2012 Evaluation Labs and Workshop Working Notes Papers, Rome, Italy (2012) 7. Glöckner, I., Pelzer, B., Dong, T.: The LogAnswer project at QA4MRE 2011. In: CLEF 2011 Working Notes. (September 2011) 8. Hartrumpf, S.: Hybrid Disambiguation in Natural Language Analysis. Der Andere Verlag, Osnabrück, Germany (2003) 9. Helbig, H.: Knowledge Representation and the Semantics of Natural Language. Springer, Berlin (2006) 10. Hartrumpf, S., Helbig, H., Osswald, R.: The semantically based computer lexicon HaGenLex. Traitement automatique des langues 44(2) (2003) 81–105