=Paper=
{{Paper
|id=Vol-1177/CLEF2011wn-QA4MRE-GlocknerEt2011
|storemode=property
|title=The LogAnswer Project at QA4MRE 2011
|pdfUrl=https://ceur-ws.org/Vol-1177/CLEF2011wn-QA4MRE-GlocknerEt2011.pdf
|volume=Vol-1177
}}
==The LogAnswer Project at QA4MRE 2011==
The LogAnswer Project at QA4MRE 2011 Ingo Glöckner1 , Björn Pelzer2 , and Tiansi Dong1 1 Intelligent Information and Communication Systems Group (IICS), University of Hagen, 59084 Hagen, Germany {ingo.gloeckner,tiansi.dong}@fernuni-hagen.de 2 Department of Computer Science, Artificial Intelligence Research Group University of Koblenz-Landau, Universitätsstr. 1, 56070 Koblenz bpelzer@uni-koblenz.de Abstract. We present the prototype of a pure logic-based answer val- idation system that was developed for QA4MRE. While the prototype uses core functionality of the LogAnswer question answering (QA) sys- tem available on the web, it had to be designed from scratch in order to meet the demands of the QA4MRE task. Specific improvements include the use of coreference resolution in order to allow knowledge processing on the document level, the integration of a fragment of OpenCyc in or- der to allow more flexible reasoning, and the extraction of ‘provenance information’ that explains the preference of the system for the chosen answer candidate. Results show that the new prototype was not mature yet at the time of the participation in the QA4MRE task. However, the system has a solid architecture that allows a powerful logic-based answer validation, and our analysis of the submitted runs clearly identifies the necessary improvements that will unfold the potential of the system.1 1 Introduction LogAnswer [2] is a question answering (QA) system for German that can be tested on the web, using the German Wikipedia as the document collection.2 The system was repeatedly evaluated in the QA@CLEF and ResPubliQA QA system evaluations [3,4,5]. Recently, we have focused on the use of LogAnswer as a virtual forum participant in Human Q&A portals on the web (such as Yahoo! answers). Experiments using data from the German QA portal FragWikia!3 show that the questions from real users are substantially more difficult to handle than the artificial questions of the earlier QA@CLEF experiments [6,7]. In particular, the relationship between the question and the relevant text passage that answers the question can be rather loose, with few lexical overlap between question and answer phrases. We thus embraced the new QA4MRE challenge of CLEF 2011 that also fosters the development of QA technologies which no longer assume a simple lexical relationship between questions and answer passages. 1 LogAnswer is supported by the DFG with grants FU 263/12-2 and GL 682/1-2. 2 see http://www.loganswer.de/ 3 see http://frag.wikia.com/ It was not possible, though, to apply the existing LogAnswer system to the QA4MRE task, because the system is based on some implicit assumptions that no longer hold in the QA4MRE context. First of all, the existing LogAnswer system uses a sentence-level segmentation, i.e. the candidate retrieval delivers single sentences and the answer selection models can operate on single sentences only. In the QA4MRE task, however, the content of a full document must be processed and the restriction to single sentences does not make sense. The eval- uation models of LogAnswer further depend on the retrieval score produced by the retrieval system, while there is a fixed reading test document without any retrieval step in QA4MRE. Finally, the training sets for the answer selection models of LogAnswer were obtained from the earlier QA@CLEF competitions, where there was often a close lexical relationship between question and answer sentence. Consequently, it was not possible to reuse the existing validation fea- tures and the existing answer selection models of LogAnswer. The switch to a document level validation forced us to construct a new prototype specifically for QA4MRE, using only those basic techniques from LogAnswer that seemed useful for the task. Since a new system had to be built, many advanced aspects already covered by the existing LogAnswer system (use of fallback techniques, sophis- ticated validation models obtained by machine learning, etc.) were dropped for the time being. We state the following goals for the QA4MRE participation: – Develop a core solution for the QA4MRE task that implements a logic-based answer validation for document level segmentation, reusing functionality of the first LogAnswer prototype whenever possible. – Instead of using the background collection provided by QA4MRE, or external evidence as provided by hit counts of search engines, we decided to focus on strengthening the reasoning capabilities of our system by extending its background knowledge. The specific goal was that of coupling LogAnswer to (a fragment of) OpenCyc, so that its background knowledge will cover more possible paraphrases. – Another goal for QA4MRE was the addition of functionality that can extract an explanation why the chosen answer was prefered over the alternatives. In our case, such a justification for the best answer includes the concrete logical rules and facts and the relevant sentences necessary to prove the answer. In the paper, we first present the system description of the prototype built for QA4MRE. We then show the results of the system on the QA4MRE test set for German. Finally we discuss various problems that became visible in a first error analysis of the QA4MRE runs. 2 System Description 2.1 Parsing of Reading Tests, Questions and Candidate Answers The reading tests, questions and answer candidates were parsed by applying the WOCADI parser [8]. It computes a meaning representation for each sentence in the form of a Multilayered Extended Semantic Network (MultiNet) [9]. 2.2 Question Classification The LogAnswer system uses a question classification in order to determine the category of the question (factoid or definition question) and the expected answer type (e.g. PERSON or DATE). In the QA4MRE context, we lacked training data for exploiting the criterion of an expected answer type check, so we ignored the answer type. Still, the distinction between factoid questions and definition questions was important for our QA4MRE runs, since it affects the amount of background knowledge being used. Specifically, LogAnswer uses a very small synonym system for answering definition questions (it only contains spelling variants and very obvious synonyms that everybody takes for granted). In this way, definitions by stating a synonym are not regarded as trivial answers in the synonym mode. For factoid questions, on the other hand, the full system of all synonyms known to LogAnswer is used. Let us now consider the additional processing steps applied to the parses of the reading test documents that serve to further elaborate the document representation for subsequent logical processing. 2.3 Coreference Resolution When parsing the reading tests, which are multi-sentence documents, the coref- erence resolver CORUDIS [8] is applied. The discourse entities mentioned in the individual sentences are represented by nodes in the corresponding semantic network. CORUDIS finds groups of such nodes that denote the same individual. Among other coreference phenomena, this allows CORUDIS to handle anaphoric pronouns that refer to entities introduced earlier in the text. Note that the coref- erence resolution, and the identification of all constants that refer to the same individuum, are essential to the success of logic-based processing on the docu- ment level. This is because, due to the unique name assumption, the combined information about all mentions of a discourse entity can only be utilized if all mentions of this discourse entity are represented by the same constant. 2.4 Recognition of the Speaker of a Talk A special task not handled by CORUDIS is the recognition of the speaker of a talk. In the reading test documents, which are manuscripts of talks, the speakers refer to themselves in the first person singular as “ich” (I), and the name of the speaker is only mentioned once at the beginning of the manuscript of the talk. Knowing the speaker is essential for QA4MRE, since many questions refer to the name of the speaker of the talk. The LogAnswer prototype for QA4MRE contains a simplistic solution to identifying the speaker from the manuscript of a talk; the method simply chooses the first name mentioned in the manuscript. All mentions of the first person singular in the text are then identified with the found speaker, so that the corresponding pieces of information become accessible via the speaker’s name. 2.5 Assimilation of Sentence Representations By assimilation we mean the integration of the semantic representations of indi- vidual sentences of a text into a semantic representation of the text as a whole. In the case of semantic networks, this means that the graphs are merged, keeping each edge that occurs in several sentence networks only once. Coreference reso- lution and the special treatment of first person singular ensure that mentions of the same entity in different sentences are finally represented by the same node in the network (modulo errors of coreference resolution, of course). The semantic networks are translated into a logical representation by treating nodes as con- stants and edges as literals based on the edge symbol as a binary predicate. The logical representation of the text thus becomes a large conjunction of ground literals. 2.6 OpenCyc-Based Concept Expansion For QA4MRE, we have aligned our concept system (of HaGenLex [10] con- cepts for German) with a fragment of OpenCyc.4 As described in Sect. 2.12, this process resulted in about 20,000 additional redundancy-free subconcept- superconcept relationships between HaGenLex concepts. We felt that integrat- ing all of these subsumption relationships into each reading test would place too much load on the prover. Therefore, a map from lexical concept ids to all known superconcepts (as mediated by the OpenCyc link) is computed in advance. If an entity mentioned in the text belongs to a certain base concept, then it is automatically marked as belonging to all known superconcepts based on this pre-computed map. We thus use OpenCyc (via an alignment to our own concept system for German) in order to enrich the logical representation of the docu- ments. For example, suppose that a document mentions a tsunami. From the OpenCyc link, we know that every tsunami is a “Naturkatastrophe” (natural disaster). The original sentence representation then contains a statement like SUB(c243, tsunami.1.1), expressing that entity c243 is an instance of a tsunami (with HaGenLex word sense index 1.1). The OpenCyc expansion then adds an assertion SUB(c243, naturkatastrophe.1.1), expressing that c243 is also an in- stance of a natural disaster. 2.7 Hypothesis Construction By a hypothesis, we mean a declarative statement formed from the question and the candidate answer. For example, given the question “Was ist Bonos Einstel- lung zum digitalen Zeitalter” (What is Bono’s attitude with respect to the digital age) and the possible answer “Enthusiasmus” (enthusiasm), we may form the hy- pothesis “Bonos Einstellung zum digitalen Zeitalter ist Enthusiasmus” (Bono’s attitude with respect to the digital age is enthusiasm). 4 We used the OWL version of OpenCyc, http://sw.opencyc.org/ The construction of a hypothesis is crucial for answer validation since it is the validity of the hypothesis (given the reading test document and the knowledge of the system) that eventually decides if the answer is justified or not. The hypothesis can be formed on the textual level, but for our purposes it is only important to know the logical representation of the hypothesis, which can be constructed from the logical analysis of the question and of the answer. The question literals and answer literals are then combined into a conjunction of hypothesis literals as explained in [11]. The benefit of hypothesis construction on the logical level is that it works well for highly inflecting languages like German. In fact, a template-based con- struction of textual hypotheses followed by another application of the parser is problematic for German due to mismatches of word forms with respect to syntactic case. On the other hand, QA4MRE comes with a diversity of answers expressed as nominal phrases, prepositional phrases, numbers, adjectives, ad- verbs, (causal or modal) auxiliary sentences, or infinitive constructions. Not all of these answers can be parsed by WOCADI, as required for hypothesis con- struction. Therefore we decided to implement a traditional, textual hypothesis construction as a fallback solution to be applied when the direct construction of the hypothesis on the logical level fails. Here, part of the question is replaced by the candidate answer, the resulting textual hypothesis is parsed by WOCADI, and a literal pattern for the hypothesis is formed from the parsing result. This fallback technique was only implemented for special question types, such as causal questions, modal questions, “wann” (when) questions, “wie viele” (how many) questions, and “in welchem Jahr” (in which year) questions. 2.8 Robust Logical Processing While LogAnswer normally combines logic-based and shallow techniques, we opted for a pure logic-based validation approach in QA4MRE (mainly due to lack of training data for a robust multi-criteria approach). In principle, the logic-based validation is accomplished by trying to prove the logical hypothesis from the logical representation of the reading test document and from the general background knowledge of the system. If such a proof suc- ceeds, then the answer from which the hypothesis was constructed is considered correct, otherwise it is incorrect. However, due to errors of the linguistic analysis, failure of coreference resolution, and gaps in the background knowledge, it is not realistic to demand a perfect logic-based validation based on a complete proof of the hypothesis. Thus, if a complete proof of the query fails, then the system resorts to query relaxation. We have described the use of relaxation in LogAnswer elsewhere [2]. For the purposes of this paper, it is sufficient to know that the system will remove problematic hypothesis literals one at a time until a proof of the remaining hypothesis fragment succeeds. In practice, we impose a limit on the number of admissible relaxation steps. In general, the result of this kind of robust logical processing is only a proof of a hypothesis fragment. From this we cannot conclude that the hypothesis as a whole would be provable assuming ideal knowledge and an error-free linguistic analysis. However, the more literals must be skipped, the less likely it becomes that the hypothesis is indeed correct. Thus, we use the skipped literal count and other indicators related to the relaxation proof as the basis for scoring answers. 2.9 Computation of Scoring Metrics Due to the lack of a training set for QA4MRE, we decided to use a primitive ad-hoc scoring formula. This formula omits many of the answer selection fea- tures normally used by LogAnswer, because we felt unable to integrate all these features in a hand-coded scoring metric. In order to be able to calculate the scoring metrics, the result of the last relax- ation proof based on the hypothesis for the answer of interest is needed. The re- laxation result consists of the list of hypothesis literals that the system managed to prove in the last proof attempt, the list of failed literals that were skipped by query relaxation, and a list of literals with unknown status. These are the literals that the prover was not able to prove in the last proof attempt of the relaxation loop, but that are also not yet classified as skipped literals.5 Using the same cri- teria, we can not only split the hypothesis literals in proved-h-lits, skipped-h-lits and unknown-status-h-lits. We can also restrict attention to the hypothesis lit- erals that stem from the question, obtaining proved-q-lits, skipped-q-lits and unknown-status-q-lits. Similarly, we can restrict attention to the hypothesis lit- erals that stem from the answer, yielding literal sets proved-a-lits, skipped-a-lits and unknown-status-a-lits. The basic scoring metric % for an answer candidate is then defined as the arithmetic mean of the following six scoring criteria: – %1 : based on the number of skipped literals and a parameter α = 0.7: %1 = α#skipped-lits – %2 : additionally based on the number of literals with unknown status and a parameter β = 0.8: %2 = α#skipped-lits · β #unknown-status-lits – %3 : optimistic proportion of provable question literals: #skipped-q-lits %3 = 1 − #all-q-lits – %4 : pessimistic proportion of proved question literals: #proved-q-lits %4 = #all-q-lits 5 They can be provable given the remaining query fragment or not, but we do not know since the relaxation cycle has already been stopped. – %5 : optimistic proportion of provable answer literals: #skipped-a-lits %5 = 1 − #all-a-lits – %6 : pessimistic proportion of proved answer literals: #proved-a-lits %6 = #all-a-lits The basic score % as described so far is combined with a second score that tries to capture the coherence of the evidence for the answer, judging from the witness sentences that justify the answer (see the Sect. 2.11 on provenance information for a description of how this list of sentences that justify the answer is computed). Given this set of sentences that were needed for the proof, the sentences are now bundled into sequences of directly adjacent sentences without intermittent gaps. For example, if the proof of the hypothesis from the representation of the test document involves sentences number 1, 2, 4, 5, and 7 from the text, then we have three groups of adjacent sentences, {1, 2}, {4, 5}, {7}. We then iterate over these blocks in the natural order of the text. If one of the sentences in the current block involves a discourse entity (expressed by a constant) introduced in a block that was already considered, then the current block is viewed as connected to the earlier block, otherwise it is considered unconnected. We count the number u = u(a) of all unconnected blocks of witness sentences for the given answer a. The final answer selection score σ is computed from the basic score % and the unconnected sentence score u as σ = % · γ u−1 , where γ = 0.7 in our experiments. 2.10 Selection of the Best Answer, or NOA Decision According to the QA4MRE guidelines, systems can either commit to one of the answer candidates for a given question, or they can refuse to answer the question, leaving it unanswered. Since the LogAnswer prototype developed for QA4MRE uses a pure logical validation approach, a no answer (NOA) response will always be generated if question processing fails (no full parse of question or empty question literal pattern). Moreover, the answer candidates for a question can only be evaluated if they admit the construction of a logical hypothesis (either directly from the answer parse, or indirectly by parsing a textual hypothesis), and if a non-empty set of answer literals is part of the generated hypothesis. Finally, the answer must come with valid provenance information (see Sect. 2.11), since the QA4MRE guidelines do not allow empty provenance data. If no answer candidate for a question fulfills these requirements, then a NOA decision is made. Otherwise the answer with maximum score will be chosen, even if the score is so low that it indicates a poor quality of the validation, and even if there is no clear preference for one answer over another (i.e. even if there is no clear winner judging from the validation scores). We did not introduce a NOA threshold for validation scores due to lack of development data. 2.11 Provenance Generation The QA4MRE result specification requires systems to provide one or more so- called provenance elements for each answered question. These provenance el- ements are expected to provide an explanation that justifies the chosen an- swer. They can include sentences from the reading test, from documents in the QA4MRE background collection, or knowledge specific to the QA system. The computation of provenance information is based on the results of the prover in the last proof of the relaxation cycle. The prover outputs the list of used axioms (i.e. used implicative rules) and the answer substitution (for the proved fragment of the query). The names of the used axioms are directly turned into provenance elements. Since these names are usually chosen in a meaningful way, showing the names of the axioms should be informative to users. Apart from the axiom names, it is important to reconstruct the used facts from the background knowledge and the relevant sentences of the reading test document itself. To this end, we first determine the list of all used facts (in our system, these facts are always variable-free, i.e. ground facts). We start with the literals from the instantiated hypothesis pattern and add all literals that occur in the premise of instantiated used axioms. From this set we remove all literals that occur in the conclusion of any instantiated used axiom. The remaining literals are ground facts that occur either in the background knowledge of LogAnswer or in the logical representation of the reading test document. Those facts that stem from the background knowledge of LogAnswer are eas- ily identified since they are not contained in the representation of the reading test document. If a used fact is identified as stemming from the background knowledge, then it is directly turned into a provenance element. Since the ar- guments of such a fact are lexical concept identifiers and the predicate usually describes a lexical-semantic relationship (e.g. relationship between a verb and its nominalization), the meaning of such a fact is self-explanatory. For those used facts that are not background facts, the corresponding sen- tence from the reading test document should be included as provenance informa- tion, rather than the logical fact itself that was extracted from the sentence. For example, the fact SUB(c243, tsunami.1.1) represents a concrete tsunami men- tioned in the text. Since a single sentence can cover a large number of such facts, it is preferable to present the sentence instead of the individual facts. In some cases there may be indeterminacy in the process of determining matching sen- tences, i.e. there can be several sentences whose meaning representation contains a considered fact. In order to find good configurations of witness sentences, we iterate over all used facts, starting with those facts with the smallest number of available witness sentences and proceeding to facts with more possible witness sentences. For the considered fact, we first look if the fact is contained in one or more of the already chosen witness sentences. If yes, then the fact is already covered and no new witness sentence must be added. Otherwise we must choose a new witness sentence. To this end, we consider all witness sentences containing the considered fact. We then choose that sentence that covers the highest number of other used facts not covered by the witness sentences already chosen. In case of ties, we prefer the sentence that also has a higher number of covered used facts (without the restriction to those used facts that still need justification). As the result of this process, we know a selection of witness sentences from the reading test document that together cover all considered used facts. We thus add provenance elements for all of these sentences, in the order in which they occur in the text. The chosen witness sentences also affect the validation score of the considered answer, see Sect. 2.9. The extraction of provenance information may fail in rare cases, resulting in empty provenance data. Since the task specification of QA4MRE requires at least one provenance element to be included for each non-NOA decision, the LogAnswer system will drop the best answer and resort to a NOA response instead if the provenance information is empty. 2.12 Resources Used: OpenCyc Integration Concerning the general background knowledge of LogAnswer and the resources used, we refer to [2]. In order to allow additional inferences, the existing back- ground knowledge of LogAnswer was enhanced by knowledge from OpenCyc. To this end, we had to couple (part of) our concept system for German with the concept identifiers used in OpenCyc. We established this connection in two ways: First of all, we know corresponding English words for ca. 6,000 of our Ger- man lexemes, due to the evolving English lexicon HaEnLex that is fully aligned with the German HaGenLex [10]. If an OpenCyc concept id was formed from the same word, or if it was linked to an English Wikipedia topic with the same name as the word of interest, then the OpenCyc concept was considered as an alignment candidate for the original German lexeme. Alternatively, we used the following chain for generating alignment hypothesis: Starting from the German lexeme, we looked for matching topics or redirects in the German Wikipedia. Then, the DBpedia6 map from German to English topics was used, and from there on the OpenCyc-Wikipedia link was used, again resulting in alignment candidates. Several ten thousands of these alignment candidates were manu- ally checked, and all subconcept-superconcept relationships were extracted (and mapped back into our German concept world) based on these manually veri- fied alignments. After eliminating redundancy and inconsistencies, we ended up with ca. 20,000 subconcept-superconcept relationships and around 1,500 new synonymy relationships between German lexemes. 3 Experimental Results 3.1 System Configuration in the QA4MRE Runs Two runs were submitted for the QA4MRE task for German, based on different configurations of the LogAnswer prototype. The loga1101dede run was subject 6 see http://dbpedia.org to the restrictions of the QA4MRE guidelines on admissible resources for the first submitted run. Therefore, the synonym system used by LogAnswer in the first run was deliberately restricted to a very limited set of synonyms that only covers spelling variants but no synonyms proper.7 The restriction was imposed since some of the synonyms stem from external resources (such as GermaNet), and external resources or ontologies may not be used in the first submitted run according to the guidelines. The rule-based background knowledge of LogAnswer was also filtered, by eliminating all lexical-semantic relations (such as subconcept-superconcept rela- tionships), and by eliminating any rules expressing domain knowledge. We only kept general logical rules that specify the behaviour of the expressive means of MultiNet, and rules for exploiting simple temporal and local regularities. This filtering was made because rule bases and ontologies may not be used in the first run. We checked back with the QA4MRE organizers and they confirmed that the general rules that were kept belong to the core functionality of the QA system, so that they were admissible in the first run. The second run, loga1102dede, is not subject to the special restrictions on resources that the QA4MRE guidelines impose on the first run. So, we have acti- vated the full knowledge used by LogAnswer, including the full set of synonyms and all other lexical-semantic facts and logical rules available in the system. Moreover, the OpenCyc integration was activated in the second run, resulting in about 20,000 additional subconcept-superconcept relationships to be utilized. The two runs were generated with the following parameters: The initial proof depth for iterative deepening for the prover was set to 1 and the maximum proof depth to 2. Each single proof attempt in the relaxation process was allowed a 2 seconds time. Up to 5 relaxation steps were allowed for a given answer candidate. 3.2 Results Achieved The results of the two LogAnswer runs were different looking at individual choices, but as a whole exactly the same scores were achieved: In both runs, LogAnswer committed to one of the answers for 88 out of the 120 questions, leaving 32 questions unanswered. Considering the answered questions, the cho- sen answer was correct in 21 cases. Thus, the overall accuracy was 0.18, and the accuracy for those questions that LogAnswer decided to answer was 0.24. The resulting c@1 score was 0.22. Compared to other systems, we find that Log- Answer scored slightly better than the average c@1 score of 0.21 considering all participants. However, the result of LogAnswer cannot be satisfactory since it is very close to random guessing (which would yield an expected c@1 score of 0.2), and the margin to the best system with a c@1 score of 0.57 is quite large. 7 That is, the synonyms normally used for definition questions are used for all questions in the first run. 3.3 Error Analysis and Discussion We start by pointing out a few problems related to the QA4MRE test set for German, then turning to the issues of the LogAnswer prototype developed for the QA4MRE task. Poor Document Quality (Missing Blanks etc.) The documents in the German reading tests were of low encoding/formatting quality: There were many missing blanks (probably due to a systematic error when generating the test set), resulting in adjacent words to be wrongly merged into single tokens. In one case even the name of the speaker was wrongly merged with the next word, resulting in this essential piece of information for answering the questions to be completely unavailable. Moreover, there was no structuring by blank lines at all, so that headlines were not clearly separated from the following body of regular text. The result of all this was a loss of useful information in the documents, a poor parsing rate, and total failure of the coreference resolver (see below). Non-Parseable Questions WOCADI has found a full parse for 106 out of the 120 questions, which is quite satisfactory. We noticed that 3 out of the 14 prob- lematic questions cannot be parsed due to spelling errors or wrong punctuation; these problems could have been avoided by a correction of the test set.8 Parsing Rate for Answers We have already remarked in Sect. 2.7 that the syntactic diversity of the answers in the task poses problems to parsing and to hypothesis construction. LogAnswer is flexible in that it can construct logical hypotheses either directly from the question and answer parse; or alternatively from the parse of a textual hypothesis constructed from question and answer. However, it frequently happened that a non-parseable answer also resulted in the constructed textual hypothesis to parse poorly. These difficulties resulted in the generation of 31 NOA responses (another NOA response was due to failed provenance generation). Let us now turn to the main issues of the system prototype that were revealed by the QA4MRE participation. Failure of Coreference Resolution Due to the missing blank issue of the documents, but also due to their length, the coreference resolver CORUDIS failed completely. It was not able to find a regular coreference resolution result (a so-called ‘coreference partition’) for any of the twelve reading test documents. While CORUDIS provided some fallback information, it was so sparse that it was practically useless. The failure of the coreference resolver means that no information at all is merged beyond the sentence level. Thus, a document-level knowledge processing became virtually impossible. 8 A comma is missing in question 9 for reading test 3, and in question 8 for reading test 4. Moreover “das” in question 3 of reading test 7 is wrongly spelled “dass”. Extraction of Provenance Information The extraction of provenance in- formation worked correctly, although the results were often confusing because the struggling logical processing (and extreme relaxation) produced confusing results. Only in one case, a correct answer of the LogAnswer system had to be dropped because the system was not able to extract any provenance information. Missing Integration of OpenCyc Synonyms in the Second Run In gen- eral, several lexical concepts in our German concept space can be associated with the same OpenCyc concept; in this case, one can deduce a synonym relation- ship between the German lexemes. As mentioned in Sect. 2.12, this mechanism resulted the detection of about 1,500 new synonyms. Unfortunately, they were not included in the second LogAnswer run. So by accident, about 1,500 possible links between German lexemes, and also the connection of these lexemes to the superconcepts mediated by OpenCyc, were lost. 4 Conclusions and Future Work We have built a new LogAnswer prototype for the QA4MRE task. The system implements the main goals defined in the introduction, but it was clearly not yet mature at the time of the QA4MRE participation. Due to substantial problems with the quality of the reading test documents, and most importantly due to the total failure of the coreference resolver for these documents, it was impossible to successfully demonstrate the utility of the logic-orientied approach to answer validation in the task. In particular, the extension of the background knowledge of LogAnswer by additional subconcept- superconcept relationships from OpenCyc had no positive effect at all, since the qualitative requirements for using logical reasoning on the document level were obviously not met. We did not try falling back to shallow linguistic methods (such as a lexical overlap measure) when logical processing fails. One reason for that was the lack of training data for QA4MRE. Moreover the QA4MRE guidelines discourage the use of such lexical methods, pointing out that examples are chosen such that there is no close word-to-word relationship between hypothesis and answer passage. We conclude that it would have been best to complement answer selection by a criterion which does not assume a full semantic analysis of the documents, and which is also not based on lexical overlap. Examples are popularity or frequency criteria, for example based on hits in the QA4MRE background collection, or based on an integration of a search engine (such as google) for estimating hit counts. The attempt to work without the background collection for the moment can also be regarded as futile. However, it is possible that using an external collection (such as the Wikipedia) will achieve similar effects as using the background col- lection. Moreover the best way of utilizing the QA4MRE background collection is not obvious. Looking at concrete questions, it is clear that question decomposition would have been helpful in many cases. Besides that, a decomposition of conjunctive answers (which express a conjunction of two or more subanswers) would also have been useful.9 An example of a conjunctive answer from the test set is “Edinburgh und Oslo” (Edinburgh and Oslo). Knowing basic biographic data about a person would also have been useful, especially name variants (e.g. full name of a person, stage name or pen name). We will account for this kind of information in the further development of Log- Answer. Our specific goal is a coupling with the DBPedia, which could often have provided the required information in the QA4MRE task. The success of LogAnswer in the new task is important to us, since QA4MRE (as opposed to earlier QA@CLEF tasks in our opinion), contains questions of realistic difficulty, where the way in which the question and text are phrased are not artificially similar. Therefore we hope that the envisioned changes to LogAnswer will also help us to improve our success in the application domain of our choice (i.e., providing QA support in human question answering portals). Apart from the mentioned deficiencies of implementation and methods, the lack of a development set for German was also a problem. Without a development set, it was not possible to train validation models using techniques from machine learning (and due to the very different characteristic of QA4MRE compared to the earlier ResPubliQA oder QA@CLEF tasks, a reuse of existing models seemed pointless). Finally, it was not possible to establish an optimal threshold for the NOA decision, so that no threshold at all was applied in our submitted runs. The basis for solving these problems is much better now, since experimental data from the first round of QA4MRE are available. References 1. Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G., Kurimo, M., Mandl, T., Peñas, A., Petras, V., eds.: Evaluating Systems for Multilingual and Multimodal In- formation Access: 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus, Denmark, Revised Selected Papers, Heidelberg, Springer (2009) 2. Furbach, U., Glöckner, I., Pelzer, B.: An application of automated reasoning in natural language question answering. AI Communications 23(2-3) (2010) 241–265 3. Glöckner, I., Pelzer, B.: Combining logic and machine learning for answering ques- tions. [1] 4. Glöckner, I., Pelzer, B.: The LogAnswer project at CLEF 2009. In: Working Notes for the CLEF 2009 Workshop, Corfu, Greece (September 2009) 5. Glöckner, I., Pelzer, B.: The LogAnswer project at ResPubliQA 2010. In: CLEF 2010 Working Notes. (September 2010) 6. Pelzer, B., Glöckner, I., Dong, T.: Loganswer in question answering forums. In: 3rd International Conference on Agents and Artificial Intelligence (ICAART 2011), SciTePress (2011) 492–497 9 This should be much easier to achieve compared to question decomposition. The idea is validating the individual conjuncts of a conjunctive answer separately. Then, we use the minimum score of the conjuncts as the score of the total answer. Moreover, all provenance elements of the individual answers have to be merged. 7. Dong, T., Furbach, U., Glöckner, I., Pelzer, B.: A natural language question an- swering system as a participant in human Q&A portals. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence (IJCAI- 2011), Barcelona, Spain (July 2011) 2430–2435 8. Hartrumpf, S.: Hybrid Disambiguation in Natural Language Analysis. Der Andere Verlag, Osnabrück, Germany (2003) 9. Helbig, H.: Knowledge Representation and the Semantics of Natural Language. Springer, Berlin (2006) 10. Hartrumpf, S., Helbig, H., Osswald, R.: The semantically based computer lexicon HaGenLex. Traitement automatique des langues 44(2) (2003) 81–105 11. Glöckner, I.: University of Hagen at QA@CLEF 2007: Answer validation exercise. In: Working Notes for the CLEF 2007 Workshop, Budapest (2007)