The LogAnswer Project at CLEF 2009 Ingo Glöckner1 and Björn Pelzer2 1 Intelligent Information and Communication Systems Group (IICS), University of Hagen, 59084 Hagen, Germany ingo.gloeckner@fernuni-hagen.de 2 Department of Computer Science, Artificial Intelligence Research Group University of Koblenz-Landau, Universitätsstr. 1, 56070 Koblenz bpelzer@uni-koblenz.de Abstract The LogAnswer system, a research prototype of a question answering (QA) system for German, participates in QA@CLEF for the second time. The ResPubliQA task was chosen for evaluating the results of the general consolidation of the system and improve- ments concerning robustness and processing of administrative language. LogAnswer uses a machine learning (ML) approach based on rank-optimizing decision trees for integrating logic-based and shallow (lexical) validation features. The paragraph with the highest rank is then chosen as the answer to the question. For ResPubliQA, Log- Answer was adjusted to specifics of administrative documents, as found in the JRC Acquis corpus. In order to account for the low parsing rate for administrative texts, in- dexing, answer type recognition, and all validation features were extended to sentences with a failed parse. Moreover, support for questions that ask for a purpose, reason, or procedure was added. Compared to the first prototype of LogAnswer that partic- ipated in QA@CLEF 2008, there were no major changes in the resources employed. We have utilized the Eurovoc thesaurus for extracting definitions of abbreviations and acronyms but this knowledge was not activated by the questions in the ResPubliQA test set. Two runs were submitted to ResPubliQA: The first run was obtained from the standard configuration of LogAnswer with full logic-based processing of results, while the second run was run with the prover switched off. It simulates the performance of the system when all retrieved passages have a failed parse. The results obtained for the two runs were almost identical. Given that our parser for German has generated a useful logical representation for less than 30% of the sentences in the JRC Acquis corpus, it is not surprising that logical processing had a minor effect. A systematic analysis of the results of LogAnswer for the different question categories revealed an unfavorable decision in the processing of definition questions that will now be fixed. Moreover, questions asking for a procedure proved difficult to answer. On the positive side, the results of LogAnswer were particularly convincing for factoid questions and for questions that ask for reasons. With an accuracy of 0.40 and c@1 score of 0.44, LogAnswer also outperformed the two official ResPubliQA baselines for German. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Search Pro- cess, Selection process; I.2.4 [Artificial Intelligence]: Knowledge Representation Formalisms and Methods—Predicate Logic, Semantic networks; I.2.7 [Artificial Intelligence]: Natural Lan- guage Processing General Terms Experimentation, Measurement, Verification Keywords Logical Question Answering, Questions beyond Factoids, Passage Reranking, Robust Inference 1 Introduction The goal of the LogAnswer project1 is to further research into logic-based question answering. Emphasis is placed on the problem of achieving acceptable response times in the logical QA framework, and on the problem of ensuring stable results despite the brittleness of a deep linguistic analysis and of logical reasoning. An early prototype of the LogAnswer QA system that evolved from this research took part in QA@CLEF 2008. After consolidation and improvement, LogAnswer now attends the CLEF QA systems evaluation for the second time. The ResPubliQA2 task was chosen for evaluating LogAnswer since GikiCLEF3 requires special geographic knowledge not available to LogAnswer while the third QA system evaluation, QAST4 is not available for German. Therefore, only ResPubliQA provided a suitable testbed for evaluating our system. The ResPubliQA question answering task is based on the JRC Acquis5 corpus of documents related to EU administration. As opposed to earlier QA@CLEF tasks, ResPubliQA does not require the extraction of exact answers from retrieved paragraphs. But ResPubliQA introduces new difficulties that make it a demanding task for the LogAnswer system: • The JRC corpus is characterized by administrative language. The texts are syntactically complex and contain special structures (such as references to sections of regulations, and very long enumerations of items) that are difficult to analyze syntactically. Since logical processing of questions in LogAnswer depends crucially on the success of syntactic-semantic analysis, it was important to adjust the parser to the documents found in the JRC Acquis corpus. Moreover, LogAnswer had to be equipped with fallback methods in order to ensure a graceful degradation of results if linguistic analysis or the logic-based processing of the question fails. • Compared to earlier QA@CLEF tasks, ResPubliQA also brings significant changes with respect to the considered types of questions. There are three new question categories (PUR- POSE, PROCEDURE, REASON). Moreover, even for the familiar FACTOID category, there is a shift from simple questions asking for entities with a known type (like PERSON, LO- CATION) to more general questions (for example, questions asking for preconditions) with answer type OTHER in earlier QA@CLEF terminology. The LogAnswer system had to be extended to recognize these types of questions in question classification and to find suitable paragraphs in the texts. • ResPubliQA now expects a whole paragraph to be returned as the answer to a question. While answer selection in LogAnswer used to be strictly sentence-oriented, the emphasis of ResPubliQA on answer paragraphs made it necessary to consider information from several sentences in a paragraph. • ResPubliQA introduces the c@1 score as the primary evaluation criterion. Apart from the number of correct results found by the system, the c@1 score also takes the quality of validation into account. The QA system must thus be able to recognize bad answers, 1 Funding of this work by the DFG (Deutsche Forschungsgemeinschaft) under contracts FU 263/12-1 and HE 2847/10-1 (LogAnswer) is gratefully acknowledged. 2 http://celct.isti.cnr.it/ResPubliQA/ 3 Cross-language Geographic Information Retrieval from Wikipedia, see http://www.linguateca.pt/GikiCLEF/ 4 QA on Speech Transcripts, see http://www.lsi.upc.edu/ qast/2009/ ~ 5 see http://langtech.jrc.it/JRC-Acquis.html and if in doubt, prefer to show no response. To this end, LogAnswer computes a quality score based on a large number of features (including logical validation). For ResPubliQA, a suitable threshold had to be found such that cutting off answers with a quality score below the threshold results in an increase of the c@1 metric. Apart from the general consolidation of the prototype, LogAnswer was extended for ResPubliQA in order to meet these requirements. However, any customization to idiosyncratic aspects of the JRC corpus (like the special style of expressing definitions found in regulations) was deliberately avoided, in favor of developing generic methods that are useful for other corpora as well. The overall goal of our participation in QA@CLEF was that of evaluating the success of the measures taken to improve the LogAnswer system; this includes the refinements of LogAnswer based on our experience from QA@CLEF 2008, and also the extensions specifically developed for ResPubliQA (such as supporting REASON or PURPOSE questions) that enhance the question answering capabilities of the prototype. In the paper, we first introduce the LogAnswer system. Since the architecture of LogAnswer and many details of specific solutions are already published elsewhere [1, 2, 3, 5, 6], we focus on a description of the improvements and extensions that were made for ResPubliQA. We then discuss the results obtained by LogAnswer, including a detailed analysis of the strengths and weaknesses of the system and of typical problems that were encountered. In particular, we assess the effectiveness of the measures taken to prepare LogAnswer for the ResPubliQA task. 2 System Description 2.1 Overview of the LogAnswer System LogAnswer is a question answering system that uses logical reasoning for validating possible answer passages and for identifying the actual answer in these passages. To this end, the documents in the corpus are translated into logical form. The system then tries to prove the logical representation of the question from the logical representation of the answer passage to be validated and from its general background knowledge. In order to gain robustness against gaps in the background knowledge and other sources of errors, the prover is embedded in a relaxation loop that gradually skips non-provable literals until a proof of the simplified fragment of the query succeeds. If the validation score of the checked text passage is changed accordingly, this mechanism can help to achieve a graceful degradation of result quality in the case of errors. In addition to using a relaxation loop, the logical validation is complemented with so-called ‘shallow’ linguistic criteria (like the degree of lexical overlap between question and answer passage) in order to gain more robustness. A machine learning (ML) approach integrates the resulting criteria for result quality and generates a local score that judges the quality of the considered support passage (and possibly also the quality of the extracted exact answer string if such extraction takes place). In the event that a given answer is supported by several support passages, the evidence provided by the diverse support passages is aggregated in order to improve the ranking of the answer. This aggregation mechanism, which proved effective in [4], was also used in ResPubliQA because we wanted the system to benefit from aggregation. Since ResPubliQA only requires answer paragraphs but no further answer extraction, extracted precise answer strings were dropped after aggregation, and only the best validated paragraph for each question (according to the aggregated evidence) was included into the ResPubliQA result. Despite its use of logical knowledge processing, the response times of LogAnswer are in the order of a few seconds. This is possible because the time-consuming linguistic analysis of all documents is performed prior to indexing. Therefore the retrieval module can immediately provide all retrieved answer passages together with their pre-computed semantic analysis (in the form of a MultiNet [8]). Before logical processing of any retrieved passage starts, LogAnswer computes a set of shallow linguistic features that can be assessed very quickly. These shallow features are utilized by an ML approach for a first ranking of the retrieved passages. Depending on the specified time limit for answering the question, only the best passages according to this ranking are subjected to further Table 1: Parsing rate of WOCADI for JRC-Corpus before and after robustness enhancements and adjustment to administrative German. Results for the German CLEF news corpus and Wikipedia are shown for comparison. The ‘partial parse’ column includes both full and chunk parses Corpus full parse partial parse JRC-Acquis (original parser) 26.2% 54.0% JRC-Acquis (adjusted parser) 29.5% 57.0% CLEF News 51.8% 84.1% Wikipedia (Nov. 2006) 50.2 % 78.6% logical processing and validation. Exact answer strings are extracted from the variable bindings determined by proving the question representation from the representation of the support passage. Therefore every extracted precise answer is already logically validated. Compared to the usual generate-and-test paradigm of extracting a large number of potential answer strings that are then validated, this approach means a great efficiency advantage. The general architecture of LogAnswer was presented in [1]. Some experiments concerning robustness are described in [3, 5]. Details on the E-KRHyper prover used by LogAnswer can be found in [9]. The current state of the system, including optimizations of the prover, the current set of features used for ranking candidates, and the ML technique used for learning the ranking, is described in [2]. 2.2 Improvements of Document Analysis and Indexing In the following, we describe the improvements and extension of the LogAnswer prototype that were added for ResPubliQA 2009. We begin by explaining changes related to document analysis and indexing. Following that, we detail the changes that affect question processing. Optimization of the WOCADI Parser for Administrative Language LogAnwer uses the WOCADI parser [7] for a deep syntactic-semantic analysis of documents and questions. As shown by Table 1, the administrative language found in the JRC Acquis corpus poses severe problems to the parser: While more than half of the sentences in the German Wikipedia and in the German news corpora of the earlier CLEF evaluations are assigned a full parse, this number decreases to 26.2% for JRC Acquis. A similar picture arises if we consider all sentences that have at least a partial parse (see ‘partial parse’ column in Table 1). Several steps were taken in order to increase the parsing rate (and thus obtain logical representations for more sentences). First of all, an umlaut reconstruction technique was added – we noticed that German umlaut characters ä, ö, ü and also the character ß were often expanded into ae, oe, ue and ss in the texts, which resulted in parsing errors. Another observation was that many sentences in the corpus are fully capitalized. In order to improve the parsing rate for these sentences, the case sensitivity of the parser was switched off for sentences where the proportion of fully capitalized words exceeds a certain threshold. Another problem are the complex names of regulations, resolutions etc. that abound in admin- istrative texts. An example of such a regulation name is “(EWG) Nr. 1408/71 [3]”. References to such legal documents and specific sections thereof (e.g. paragraphs, articles) are highly domain specific and difficult to analyze for a general-purpose parser. In order to improve the quality of parsing results, the texts from the JRC corpus were subjected to a preprocessing step that recognizes complex names of legal documents.6 To this end, we employed an n-gram-model con- sidering the current token and up to three previous tokens. To account for data sparsity, the total probability of a token belonging to a section or not is estimated by a log-linear model summing 6 Many thanks to Tim vor der Brück for developing and training the recognizer for legal document names, and to Sven Hartrumpf for adjustments and extension of the WOCADI parser. up the logarithmic probabilities for unigrams, bigrams, trigrams and four-grams. The semantic representation of the complex name is then filled into the parsing result. As shown by the ‘adjusted parser’ row in Table 1, the various changes and adjustments of the WOCADI parser achieved a relative gain of 12.6% in the rate of full parses, and of 5.6% for partial parses. Even with these improvements, only 29.5% of the sentence in the JRC Acquis corpus are assigned a full parse (and thus a useful logical representation for the prover of LogAnswer to operate on). This made it very clear that the extension of LogAnswer by techniques for handling non-parseable sentences had to be enforced for ResPubliQA. Indexing Sentences with a Failed Parse In the first LogAnswer prototype, only sentences with a full parse were indexed. Clearly such an approach is not possible for JRC Acquis since too much content would be lost. We therefore allowed sentences with a failed or poor parse to be indexed as well. This was a non-trivial task since LogAnswer not only indexes the lexical concepts that occur in a sentence. As described in [6], the system also indexes the possible answer types found in the sentences. Since the existing solution for extracting answer types was specialized on sentences with a full parse, it had to be complemented with fallback methods that can recognize expressions of the interesting types in arbitrary sentences. Based on trigger words, regular expressions, and a custom LALR grammar for recognizing numeric expressions, temporal expressions, and measurements, the system can now reliably judge if a sentence from one of the documents contains one of the answer types of interest and index the sentence accordingly. A parse of the sentence is no longer required for answer type extraction. We also tried to complement the special treatment of regulation names described above by a method that helps for non-parseable sentences. To this end, the tokenization computed by the WOCADI parser was enriched by the results of two additional tokenizers: the GermanAnalyzer of Lucene, and a special tokenizer for recognizing email addresses and URLs. In those cases where a token found by these special tokenizers is not contained in a token found by WOCADI, it was additionally used for indexing. Support for New Question Categories Support for the new question types PROCEDURE, PURPOSE, and REASON has also been added to LogAnswer. For that purpose, trigger words (and sometimes more complex patterns applied to the morpho-lexical analysis of the sentences) were formulated. They are used for recognizing sentences that describe methods, procedures, rea- sons, purposes, or goals. If the presence of one of the new answer types is detected in a sentence, then the answer type is also indexed for that sentence. Based on the answer type indexing, Log- Answer can systematically retrieve sentences of the intended type, which helps focusing retrieval on the most promising sentences. Apart from supporting the new question categories, the treatment of questions of the familiar types has also been improved. For example, we have experimented with the use of the Eurovoc7 thesaurus, by indexing all sentences that contain a known abbreviation from Eurovoc and its definition with a special ABBREV marker. Including this kind of knowledge had no effect on the ResPubliQA results, however, since there was no question involving an abbreviation from Eurovoc in the test set. Beyond Indexing Individual Sentences One novel aspect of ResPubliQA was the require- ment to submit answers in the form of full paragraphs. This suggests using retrieval on the paragraph level, or at least including some information beyond the sentence level so that ques- tions can still be answered when the relevant information is scattered over several sentences. While LogAnswer was originally based on sentence-level indexing, we have now added an alter- native paragraph-level index and also a document-level index. Moreover a special treatment for anaphoric pronouns has been implemented. The WOCADI parser used by LogAnswer also includes a coreference resolver (CORUDIS, see [7]). Whenever CORUDIS establishes an antecedent for a pronoun, the description of the antecedent is used for enriching the description of the considered 7 http://europa.eu/eurovoc/ sentence in the index. For example, if the sentence to be indexed contains an occurrence of the pronoun ‘sie’ that refers to ‘Bundesrepublik Deutschland’, then ‘Bundesrepublik’ and ‘Deutsch- land’ are also added to the index. Moreover the sentence is tagged as containing an expression that corresponds to the name of a country. 2.3 Improvements of Question Processing In the following, we describe the changes to LogAnswer that affect the processing of a given question. Improved Syntactic-Semantic Parsing of the Question The linguistic analysis of the ques- tion obviously profits from the adjustments of the WOCADI parser to administrative texts as well. In particular, references to (sections of) legal documents in a question are treated in a way consis- tent with the treatment of these constructions in the corresponding answer paragraphs. Similarly, the additional tokenizers used for segmenting the texts are also applied to the question in order to generate a matching retrieval query. Refinement of Question Classification The question classification of LogAnswer was ex- tended to recognize the new question categories PROCEDURE, REASON, PURPOSE introduced by RespubliQA. Rules that cover some special cases of factoid questions (e.g. questions asking for a theme/topic and questions asking for preconditions/modalities) were also added. Moreover, the improvement of the question classification involved the inclusion of new rules for existing question types. For example, LogAnswer now supports additional ways of expressing definition questions. Overall, the number of classification rules increased from 127 to 165. The refinement of the question classification rules was based on a total of 1285 test cases, including translations of all questions from the ResPubliQA 2009 development set. Note that there was no time for adapting the background knowledge of LogAnswer to the new question types (for example by adding logical rules that link various ways of expressing reasons or purposes). Thus the only effect of the new classification rules is the recognition of the expected answer type, and the possible elimination of expressions like ‘Warum’ (why) or ‘Was ist der Grund’ (What is the reason) that do not contribute anything to the meaning of the question beyond specifying the question category. The resulting core query and the expected answer type then form the basis for retrieving potential answer paragraphs. Querying the Enriched Index The retrieval step profits from all improvements described in the subsection on document analysis and indexing. Since many of the validation features used by LogAnswer are still sentence-based, the sentence-level index was queried for each question in order to fetch the logical representation of 100 candidate sentences.8 For experiments on the effect of paragraph-level and document-level indexing, the 200 best paragraphs and the 200 best documents for each question were also retrieved. Changes in the Computed Features In the experiments, the validation features already described in [2] were used, with some refinements concerning the way in which the features are computed. In particular, the descriptors and the found answer types provided by the coreference resolution of pronouns are now included in features that depend on the matching of descriptors or of the expected vs. found answer types. Moreover the features have been generalized to retrieved sentences with a failed parse. Improved Estimation of Validation Scores One of the main lessons from QA@CLEF08 concerning the first LogAnswer prototype was the inadequacy of the earlier ML approach for determining validation scores. After analysing the problem, we came up with a new solution 8 LogAnswer is normally configured to retrieve 200 candidate sentences but in the ResPubliQA runs, only 100 were retrieved by mistake. Table 2: Results of LogAnswer in ResPubliQA. Note that #right cand. is the number of correct paragraphs on top-1 position before applying the acceptance threshold and accuracy = #right cand./#questions run #right cand. accuracy c@1 score loga091dede 202 0.40 0.44 loga092dede 199 0.40 0.44 base091dede 174 0.35 0.35 base092dede 189 0.38 0.38 based on rank-optimizing decision trees; see [2] for a description of the new method and some experimental results. As observed in [6], switching from the earlier ML approach to the new models yielded a 50% gain in the accuracy of LogAnswer on the QA@CLEF 2008 test set for German. The same models based on kMRR-optimizing decision trees for k = 3 were also used for generating the ResPubliQA runs of LogAnswer.9 The resulting evaluation scores based on the evidence from individual sentences are then aggregated as described in [4]. Optimization of the c@1 Score The main evaluation metric of ResPubliQA, i.e. the c@1 score10 , rewards QA systems that validate their answers and prefer not answering over presenting a wrong answer. In order to push the c@1 score of LogAnswer, a threshold was applied to the validation score of the best answer paragraph. The idea is that results with a low validation score should rather be dropped since their probability of being correct is so low that showing these results would reduce the c@1 score of LogAnswer. The threshold for accepting the best answer, or refusing to answer if the aggregated score falls below the threshold, was chosen such as to optimize the c@1 score of LogAnswer on the ResPubliQA development set. To this end, the ResPubliQA 2009 development set was translated into German, and LogAnswer was run on the translated questions. The subsequent determination of the optimum threshold resulted in θ = 0.08 to be chosen, achieving a c@1 score of 0.58 on the training set.11 Once a retrieved sentence with top rank is evaluated better than the acceptance threshold, the corresponding paragraph that contains the sentence is determined and returned as the final result of LogAnswer for the question of interest. Adjustments of Resources and Background knowledge Compared to QA@CLEF 2008, there were few changes to the background knowledge of LogAnswer (see [2, 6]). Only 150 new synonyms were added. Apart from that, the logical rules and lexical-semantic relations that form the background knowledge of LogAnswer were kept stable. We have formalized a system of logical rules for treating idiomatic expressions and support verb constructions, but this extension was not yet integrated at the time of the ResPubliQA evaluation. 3 Results on the ResPubliQA 2009 Test Set for German The results of LogAnswer in ResPubliQA 2009 and results of the two official baseline runs are shown in Table 2. The first run, loga091dede, used the standard configuration of LogAnswer as described in the previous section, including the use of the logic prover for computing logic-based features. In the second run, loga092dede, the prover was deliberately switched off and only the 9 Note that these models were obtained from a training set with annotations for LogAnswer results for the QA@CLEF 2007 and 2008 questions. We did not try and learn special models for ResPubliQA based on the ResPubliQA development set since annotating results from the JRC Acquis corpus seemed too difficult and tedious for a non-expert of EU administration. 10 see official ResPubliQA guidelines at http://celct.isti.cnr.it/ResPubliQA/resources/guideLinesDoc/ ResPubliQA_09_Final_Track_Guidelines_UPDATED-20-05.pdf 11 This result cannot be directly projected to the ResPubliQA test set, since the development set formed the basis for refining the question classification. Table 3: Accuracy by question category Run DEFINITION FACTOID PROCEDURE PURPOSE REASON (95) (139) (79) (94) (93) loga091dede 0.168 0.547 0.291 0.362 0.570 loga092dede 0.137 0.554 0.291 0.362 0.559 ‘shallow’ features that do not depend on the results of logical processing were used for validation. The second run thus demonstrates the fallback performance of LogAnswer when no logic-based processing is possible. Both runs are based on the same results of the retrieval module using the sentence-level index. Considering the number of questions with a correct candidate paragraph on top position, the logic-based run loga091dede performed best, followed by the shallow LogAnswer run and then the baseline runs base092dede and base091dede. According to a quick comparison of the runs using McNemar’s test, both LogAnswer runs are significantly better than base091dede with respect to #right cand. (p < 0.05), while the difference with respect to base092dede is not significant. On the other hand, LogAnswer clearly outperforms both baselines with respect to the c@1 score of ResPubliQA that also takes validation quality into account. 4 Error Analysis and Discussion 4.1 Strengths and Weaknesses of LogAnswer In order to get a general impression of the strong and weak points of LogAnswer, we have prepared a breakdown of results by question category. As shown in Table 3, LogAnswer performed particularly well for FACTOID and REASON questions, with results clearly better than the average accuracy of 0.40 of both runs. The new type of PURPOSE questions performed only slighlty worse than average. However, for PROCEDURE and DEFINITION questions the results are not satisfactory. There are several reasons for the disappointing results of LogAnswer for definition questions. First of all, LogAnswer is known to perform better for factoid questions anyway. This is because the training set used for learning the validation model of LogAnswer contains annotated results of LogAnswer for the QA@CLEF 2007 and QA@CLEF 2008 questions. These question sets include too few definition questions to allow successful application of our machine learning technique. As a result, the model for factoids (that was also used for questions of the new ResPubliQA types) is much better than the validation model used for definition questions. Another factor is the discernment between definitions proper and references to definitions. It is quite common in regulations to define a concept by reference to a certain other document where the relevant definition can be found. An example is “Dauergrünland”: “Dauergrünland” im Sinne von Artikel 2 Absatz 2 der Verordnung (EG) Nr. 795/2004 der Kommission. (“Permanent pasture” shall mean “permanent pasture” within the meaning of Article 2 point (2) of Commission Regulation (EC) No 795/2004) Since in regulations, ordinary definitions and definitions by reference serve the same purpose, it was not clear to us that definitions by reference would not be accepted as answers to a definition question. LogAnswer did not filter out such references to definitions which resulted in several wrong answers. The most important cause of failure with respect to definition questions, however, was the way in which definitions are expressed in the JRC Corpus. A typical definition in a regulation looks like this: Table 4: Accuracy by expected answer types for the FACTOID category Run COUNT LOCATION MEASURE ORG OTHER PERSON TIME (3) (8) (16) (14) (80) (3) (16) loga091dede 0.33 0.75 0.56 0.71 0.51 1.00 0.44 loga092dede 0.33 1.00 0.56 0.71 0.50 1.00 0.44 Hopfenpulver: Das durch Mahlen des Hopfens gewonnene Erzeugnis, das alle natürlichen Bestandteile des Hopfens enthält. (Hop powder: the product obtained by milling the hops, containing all the natural elements thereof) This domain-specific style of expressing definitions was not systematically recognized by Log- Answer. This had catastrophic consequences with respect to the results of LogAnswer for defini- tion questions because the retrieval queries for definition questions are expressed in such a way that only sentences containing a recognized definition are returned. Therefore many of the defini- tions of interest were totally skipped simply because this particular way of defining a concept was not recognized as expressing a definition. The obvious solution is making the requirement that retrieved sentences contain contain a recognized definition an optional rather than obligatory part of the retrieval query. In addition, more ways of expressing definitions should be recognized. The poor performance of LogAnswer for PROCEDURE questions reflects the difficulty of recognizing sentences that express procedures in the documents, as needed for guiding retrieval to the relevant sentences. Compared to the recognition of sentences that express reasons or purposes, we found this task much harder for procedures. It also happened several times that LogAnswer returned a reference to a procedure instead of the description of the procedure itself as an answer. Since we did no anticipate that this kind of result would be judged incorrect, we did not add a filter that eliminates such answers by reference. A breakdown of the results for FACTOID questions by their expected answer type is shown in Table 4. Questions for country names were either classified LOCATION or ORG(ANIZATION) depending on the question. Questions of the OTHER and OBJECT types were lumped together since LogAnswer does not internally distinguish these types. Due to the small numbers of questions for some of the answer types, it is hard to interpret the results, but it appears that LogAnswer worked especially well for ORGANIZATION and LOCATION questions. 4.2 Effectiveness of Individual Improvements Success of Linguistic Analysis We have already shown in Table 1 how the improvements of WOCADI have affected the parse rate for documents in the JRC corpus. But the number of sentences in the corpus with a full parse (or even a partial parse) is still low, and this has motivated our focus on developing fallback solutions for LogAnswer that will also work for non- parseable sentences. Fortunately, the questions in the ResPubliQA test set for German were much easier to parse than the administrative documents in the JRC corpus: The WOCADI parser was able to generate a full parse for 450 questions, and a chunk parse for 32 questions, so the full parse rate was 90% and the partial parse rate (including chunk parses) was 96.4%. Thus, the success rate of linguistic analysis for the questions in the ResPubliQA test set was very high. This is important since the question classification depends on the availability of a parse of the question. Note that 17 questions in the test set contain typographic or grammatical errors. The full parse rate for these ill-formed questions was only 48% and the partial parse rate was 65%. This demonstrates a clear negative effect of these ill-formed sentences on the success of parsing. Recognition of References to Legal Documents The ResPubliQA test set contained 15 questions with references to legal documents that should be found by our n-gram based recognizer for such references. In fact, 13 of these expressions were correctly identified, while two expressions were not recognized due to gaps in the training data. The ResPubliQA test set further contained four questions with sloppy, abbreviated references to regulations, e.g. question 146, ’Warum sollte 821/68 aufgenommen werden?’ (Why should 821/68 be adopted?) Obviously the interpretation of 821/68 as a reference to a regulation is highly domain specific. Since LogAnswer is supposed to work in arbitrary domains, it cannot be expected to treat this case correctly. However, apart from such abbreviated references that demand a special solution limited to JRC Acquis, the recognition rate of LogAnswer for references to legal documents was satisfactory. The positive effect of a correctly recognized reference is that the parser has a better chance of analyzing the question, in which case the proper interpretation of the reference to the document is inserted into the generated semantic representation. Moreover, since the recognized document names are indexed, retrieval will be guided to the proper results when the complex name is recognized as a single token. Use of Additional Tokenizers The special treatment of references to legal documents is only effective for parseable sentences. However, some of these references are also covered by the addi- tional tokenizers that have been integrated into LogAnswer. For example, the GermanAnalyzer of Lucene that serves as one of the auxiliary tokenizers correctly analyzes 821/68 as consisting of a single token. When applied to the questions in the ResPubliQA 2009 test set, these tokenizers contributed tokens not found by WOCADI for 21 questions. Specifically, the auxiliary tokenizers produced useful tokens for all questions involving references to legal documents, including the four questions that contain abbreviated references to regulations. The benefit of analyzing 821/68 as a single token is, again, the increased precision of the retrieval step compared to using a conjunction of the two descriptors 821 and 68 in the retrieval query. Effectiveness of Changes to the Retrieval Module The most substantial change to the re- trieval subsystem of LogAnswer that we introduced for ResPubliQA was the inclusion of sentences with a failed or poor parse into the index. Considering the 202 correct top-ranked paragraphs that were found in the loga091dede run, we notice that only 49 of these answers was based on the retrieval of a sentence of the paragraph with a full parse, while 106 correct answers were based on a retrieved sentence with a chunk parse (incomplete parse), and 47 correct answers were based on a retrieved sentence with a failed parse. A similar picture arises for loga092dede where 56 correct answers were based on a retrieved sentence with a full parse, 100 answers based on a sentence with a chunk parse, and 43 correct answers were based on the retrieval of a sentence with a failed parse. This clearly demonstrates that extending the index beyond sentences with a full parse was essential for the success of LogAnswer in the ResPubliQA task. When we checked the baseline results and Gold standard results12 for German, we noticed that the subset of JRC Acquis that we used for generating the LogAnswer runs differs from the JRC Acquis subset that can now be downloaded from the ResPubliQA web page, most likely due to a version change that escaped our attention. As a result, 74 documents of the current subset are missing in the index of LogAnswer. This difference in the considered subset of JRC Acquis resulted in the loss of up to four possible correct answers which are present in the Gold standard or the baseline runs but not represented in the index of LogAnswer. Success Rate of Question Classification The question classification plays an important part in LogAnswer: it not only decides which phrases in a retrieved snippet can potentially answer the question, but also affects the retrieval process since possible matches with the question categories and expected answer types are also indexed. In order to assess the reliability of the classifica- tion rules of LogAnswer and their coverage of the new question categories, we have determined the success rate of the question classification of LogAnswer, as shown in Table 5. Note that the correctness of the recognized question category and (for factoid questions) also the correct recog- nition of the expected answer type was checked. Results for the subset of questions of a given category that have a full parse are also shown. These results are especially instructive since the 12 see http://celct.isti.cnr.it/ResPubliQA/index.php?page=Pages/downloads.php Table 5: Success rate of question classification (class-all is the classification rate for arbitrary questions and class-fp the classification rate for questions with a full parse) Category #questions class-all #full parse class-fp DEFINITION 95 85.3% 93 87.1% REASON 93 73.3% 82 85.4% FACTOID 139 70.5% 117 76.9% PURPOSE 94 67.0% 86 72.1% PROCEDURE 79 20.3% 72 22.2% (total) 500 65.6% 450 70.9% classification rules operate on the parse of a question. Therefore the rules should work reliably on questions with a full parse (but not necessarily for the remaining questions). The table shows that the question classification works as expected for DEFINITION questions, REASON questions and FACTOIDS. While LogAnswer achieved an acceptable (but average) recognition rate for PURPOSE questions, the recognition rate for PROCEDURE questions was very low. These findings for PURPOSE and PROCEDURE questions can be attributed to a few missing trigger words that control the recognition of these types. For example, ‘Zielvorstellung’ (objective) was not included in the list of PURPOSE triggers and ‘Verfahren’ (process) was not included in the list of PROCEDURE triggers. Another problem were nominal compounds of trigger words, such as Arbeitsverfahren (working procedure) or Hauptaufgabe (main task). Both problems are easy to fix. It is sufficient to add the few missing trigger words, and to allow nominal compounds that modify a known trigger word as additional trigger words for the corresponding question category. Effect of Correct Question Classification on Results Since ResPubliQA does not require exact answer phrases to be extracted, one may ask if the recognition of question categories and the identification of the expected answer type are still essential for finding correct answers. We have checked this dependency for the loga091dede run. We found that for all question categories except for definition questions, the accuracy of results was better for questions that were classified correctly. However, the observed difference between the accuracy for correctly classified and misclassified questions of a given category never exceeded 6%. A surprising result was obtained for definition questions where the accuracy for the 14 misclassified questions was 0.36, while for the 81 definition questions that were correctly classified, the accuracy was only 0.14. This once again points to a problem in the processing of definition questions. The main difference in the treatment of both cases is the form of the retrieval query: if the question is recognized as a definition question, then an obligatory condition is added to the retrieval query that cuts off all sentences except those known to contain a definition. On the other hand, if a definition question is not recognized as such, then this obligatory requirement is skipped. This suggests that the obligatory condition should be dropped, or turned into an optional part of the retrieval query. Further experiments are needed in order to determine the most suitable approach. Selection of Acceptance Threshold When generating the runs for ResPubliQA, a threshold of θ = 0.08 was used for cutting off poor answers with a low validation score. In retrospect, we can say that the optimum threshold for loga091dede would have been θ = 0.11, resulting in a c@1 score of 0.45 instead of 0.44. For loga092dede, the optimum threshold would have been θ = 0.09. This threshold also yields a c@1 score of 0.44 after rounding to two significant digits. These findings confirm that the method for determining thresholds for accepted results (by choosing the threshold that maximizes the c@1 score on the development set) was effective. The threshold θ = 0.08 determined in this way was close to the best choices, and it resulted in c@1 scores that were very close to the theoretical optima. Table 6: Experimental Results using Paragraph-Level and Document-Level Indexing run #right cand. accuracy irScoreps 205 0.41 irScores 202 0.40 irScoredps 198 0.40 irScorep 196 0.40 irScoredp 191 0.39 irScoreds 190 0.38 irScored 136 0.28 4.3 Experiments with Paragraph-Level and Document-Level Indexing One of the features used for determining the validation score of a retrieved sentence is the orig- inal retrieval score of the Lucene-based retrieval module of LogAnswer. In order to assess the potential benefit of paragraph-level and document-level indexing, we have prepared additional experiments based on different choices for the corresponding irScore feature. Suppose that c is a retrieved candidate sentence. Then the following variants have been tried: irScores (c) (the orig- inal retrieval score on the sentence level), irScorep (c) (the retrieval score of the paragraph that contains sentence c), irScored (c) (the retrieval score of the document that contains c), and also the following combinations based on the arithmetic mean: irScoreps (c) = 21 irScorep (c) + 12 irScores (c), irScoreds (c) = 12 irScored (c) + 12 irScores (c), irScoredp (c) = 12 irScored (c) + 12 irScorep (c), and finally irScoredps (c) = 13 (irScored (c) + irScorep (c) + irScores (c)). The corresponding results of LogAnswer are shown in Table 6; note that the irScores configuration corresponds to loga091dede. As wit- nessed by the poor results for irScored compared to the other configurations, the system obviously needs either sentence-level or paragraph-level information in order to be able to select correct answer paragraphs (this was not clear in advance because LogAnswer also uses other sentence- level features). The results for the remaining configurations are very similar and do not justify a clear preference for a specific choice. In order to better exploit the information available on the paragraph and document level, we will therefore experiment with further changes to LogAnswer. This will involve the incorporation of intersentential information in other validation features, and a retraining of the validation model for the resulting system configurations. 5 Conclusion The paper has described the current setup of the LogAnswer QA system and the changes that were made for ResPubliQA 2009. A detailed analysis of the results of LogAnswer in the ResPubliQA task has shown that most improvements were effective, but it has also revealed a problem in the treatment of definition questions and gaps in the classification rules for PURPOSE and PRO- CEDURE questions that will now be fixed. With its accuracy of 0.40 and c@1 metric of 0.44, LogAnswer scored better than the official baseline runs of ResPubliQA for German. The LogAnswer prototype is also available online13 and in actual use, the system generally presents the five top-ranked results for the given question instead of a single result. In order to assess to usefulness of LogAnswer on the ResPubliQA test set under these more realistic conditions, we have annotated the five top ranked paragraphs for each question. We then determined the MRR (mean reciprocal rank), cutting off after the first five answers, and the number of questions for which the system presents at least one correct result in the top-five list of answers shown to the user. For loga091dede an MRR of 0.48 was obtained.14 Moreover, 60% of the questions are answered by one of the paragraphs in the top-five list. If we ignore the definition questions that 13 see http://www.loganswer.de/, with German Wikipedia as the corpus 14 Results for loga092dede are very similar. were not adequately handled by LogAnswer for the moment, then the system presents at least one correct result for two out of three questions. Perhaps a tool with these characteristics will already be useful for searching information in administrative texts. References [1] Ulrich Furbach, Ingo Glöckner, Hermann Helbig, and Björn Pelzer. LogAnswer - A Deduction- Based Question Answering System. In Automated Reasoning (IJCAR 2008), Lecture Notes in Computer Science, pages 139–146. Springer, 2008. [2] Ulrich Furbach, Ingo Glöckner, and Björn Pelzer. An application of automated reasoning in natural language question answering. AI Communications, 2009. (to appear). [3] Ingo Glöckner. Towards logic-based question answering under time constraints. In Proc. of the 2008 IAENG Int. Conf. on Artificial Intelligence and Applications (ICAIA-08), pages 13–18, Hong Kong, 2008. [4] Ingo Glöckner. University of Hagen at QA@CLEF 2008: Answer validation exercise. In Peters et al. [10]. [5] Ingo Glöckner and Björn Pelzer. Exploring robustness enhancements for logic-based passage filtering. In Knowledge Based Intelligent Information and Engineering Systems (Proc. of KES2008, Part I), LNAI 5117, pages 606–614. Springer, 2008. [6] Ingo Glöckner and Björn Pelzer. Combining logic and machine learning for answering ques- tions. In Peters et al. [11]. (to appear). [7] Sven Hartrumpf. Hybrid Disambiguation in Natural Language Analysis. Der Andere Verlag, Osnabrück, Germany, 2003. [8] Hermann Helbig. Knowledge Representation and the Semantics of Natural Language. Springer, 2006. [9] Björn Pelzer and Christoph Wernhard. System Description: E-KRHyper. In Automated Deduction - CADE-21, Proceedings, pages 508–513, 2007. [10] Carol Peters, Thomas Deselaers, Nicola Ferro, Julio Gonzalo, Gareth J.F. Jones, Mikko Kurimo, Thomas Mandl, Anselmo Peñas, and Vivien Petras, editors. Working Notes for the CLEF 2008 Workshop, Aarhus, Denmark, September 2008. [11] Carol Peters, Thomas Deselaers, Nicola Ferro, Julio Gonzalo, Gareth J.F. Jones, Mikko Kurimo, Thomas Mandl, Anselmo Peñas, and Vivien Petras, editors. Evaluating Systems for Multilingual and Multimodal Information Access: 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus, Denmark, September 17–19, Revised Selected Papers, LNCS, Heidelberg, 2009. Springer. (to appear).