Combining Theorem Proving with Natural Language Processing Björn Pelzer1 and Ingo Glöckner2 1 Department of Computer Science, Artificial Intelligence Research Group University of Koblenz-Landau, Universitätsstr. 1, 56070 Koblenz bpelzer@uni-koblenz.de 2 Intelligent Information and Communication Systems Group (IICS), University of Hagen, 59084 Hagen, Germany ingo.gloeckner@fernuni-hagen.de Abstract. The LogAnswer system is an application of automated rea- soning to the field of open domain question answering, which aims at finding answers to natural language questions regarding arbitrary top- ics. In our system we have integrated an automated theorem prover in a framework of natural language processing tools to allow for deductive reasoning over an extensive knowledge base derived from textual sources. For this purpose we had to intertwine two opposing approaches: on the one hand formal logic with its precision but brittleness, and on the other hand, machine learning applied to shallow linguistic features, which are robust but less precise. In the paper we present implementation details and discuss obstacles and their proposed solutions. 1 Introduction Question answering (QA) systems generate natural language (NL) answers in response to NL questions, using a large collection of textual documents.3 Simple factual questions can be answered using only information retrieval and shallow linguistic methods like named entity recognition [1]. More advanced cases, like questions involving a temporal description, call for deduction based question answering which can provide support for temporal reasoning and other natural language related inferences [2]. Complete QA systems which integrate question answering and logical reasoning are [3,4,5]. The junctures of logic and answer validation are also addressed in research on recognizing textual entailment [6,7]. Our LogAnswer system uses first order logic to represent an extensive knowl- edge base, and a combination of NL processing tools and an automated theorem prover to derive answers within a few seconds, i.e. in a time frame appropriate for ad-hoc question answering on the web. The fields of automated reasoning and natural language processing (NLP) differ greatly in their methodologies. A 3 A survey on the progress in question answering technology is provided by the QA track of the TREC conference series (see http://trec.nist.gov) and by the CLEF workshops (see http://www.clef-campaign.org/). 71 Fig. 1. Screenshot of the LogAnswer prototype. The system is available online at www.loganswer.de. theorem prover generally implements a sound and complete deduction calcu- lus which can produce complex proofs using hundreds of inference steps, and it operates on a set of clauses or formulas where consistency or lack thereof is crit- ical for the result. By contrast, natural language is ambiguous, textual sources may be imperfect, and as a consequence a knowledge base derived from such sources cannot be expected to be consistent. Moreover, it is not feasible to pro- vide a complete formalization of the background knowledge used by persons in understanding natural language. A practical NLP approach must take this into account and employ robustness-enhancing methods to overcome the flaws and deliver useful results. Merging the reasoning depth of an automated theorem prover with the ro- bustness of NL processing presents a number of difficulties. We will provide a short overview of our LogAnswer system and then address the obstacles and solutions in detail. 2 Description of the LogAnswer System LogAnswer is a QA-system supporting the German language. A full system de- scription is presented in [8]. LogAnswer operates on a knowledge base consisting of general semantic background rules as well as factual knowledge derived from the German Wikipedia and a corpus of newspaper articles. This knowledge is represented by semantic networks in the MultiNet formalism [9]. The user inter- acts with LogAnswer via the user interface, where a NL question can be entered into a search box, as shown in Figure 1. The query is then processed in several stages. The WOCADI parser module [10] translates the query into a MultiNet 72 representation. The formalized text passages from the MultiNet knowledge base are then searched for the query terms, and the matching network fragments are returned as the basis for generating answer candidates. Each retrieved fragment corresponds to a text passage which may contain an answer to the question. Currently, 200 MultiNet passage representations are retrieved for each question. Shallow linguistic methods provide additional filtering, further cutting down the number of candidate passages. At this stage the deduction-based processing begins. The automated reason- ing component of LogAnswer is E-KRHyper [11], an automated theorem prover for first order logic with equality, based on the hyper tableaux calculus [12,13]. E-KRHyper is an in-house system, developed for embedding in knowledge rep- resentation applications. It has been equipped with several features, commands and modes of operation for this purpose. Our familiarity with the system allows us to perform deep modifications when required. Because of our experience in adapting this prover to numerous systems in the past, it was a natural choice for a reasoner in LogAnswer. Its performance is roughly comparable to Otter [14], a system which generally serves as a benchmark among automated theorem provers. The query and the MultiNet candidate passages are converted into first order logic representations. For each answer candidate the theorem prover E-KRHyper attempts to refute the negated query representation in conjunction with the background knowledge rules and the respective candidate. A successful refutation indicates that an answer has been found, and the queried information is extracted from the refutational proof. If answers are found for multiple candidate passages, then they are ranked according to features extracted from logic processing and from shallow linguistic analysis (e.g. lexical overlap). The five best answers are presented to the user. Depending on user choice, the answers are given only in the form of snippets from the textual sources or as precise answers together with the snippets providing context. The quality score shown with each result is an estimate of the correct- ness probability of the answer determined by a machine learning approach. 3 Knowledge Representation MultiNet (Multilayered Extended Semantic Networks) [9] is a formalism for knowledge representation via semantic networks, which is particularly suited for the meaning representation of natural language. Characteristic of MultiNet is its stable inventory of pre-defined relations (edge labels), which made possible the long-term development of a computational lexicon based on MultiNet [15], and the introduction of so-called layer attributes. These attributes are used in multi-dimensional node descriptions which serve to capture quantification and other aspects of meaning that cannot be expressed using the relational means of a semantic network. The formalism is generally independent of any particular natural language, although the tools for translating NL into MultiNet are only available for German 73 (and in a rudimentary form for English) at this time.4 The knowledge base of LogAnswer was therefore derived from textual sources in German, namely the German Wikipedia and a corpus of newspaper articles. All in all about 12 million sentences have been translated into semantic networks, which are stored using Scheme syntax. For the deduction-based processing in LogAnswer this knowledge base is fur- ther translated into first-order logic, stored in the TPTP format [16], a standard among automated theorem provers. The MultiNet nodes and attributed arcs can be translated in a fairly straightforward manner into relations and constants, and the same holds for the logical MultiNet rules which express the semantics of words and the logical properties of the MultiNet relations. However, it should be noted that MultiNet goes beyond the expressivity of pure first-order logic and TPTP. For example, MultiNet networks can contain generalized quantifiers like ‘most’ or ‘almost all’. These aspects of the MultiNet representation are lost in the translation to logical expressions. In fact, our current translation from Multi- Net to logic results in Horn logic (plus arithmetic expressions). It is planned to translate into a more powerful logic including equality in order to better capture the actual meaning of the natural language sentences. In order to allow such an extension, the E-KRHyper prover for full first-order logic with equality was chosen, which also supports arithmetic expressions. 4 Query Representation For our purposes of logic-based question answering, the NL question must be translated into a conjunctive list of query literals. Synonyms are normalized by replacing all lexical concepts with canonical synset representatives. If the ques- tion asks for specific information, then this is represented by a special F OCU S variable. In a successful proof this variable will be instantiated with the desired information from the knowledge base. For example, Rudy Giuliani war Bürgermeister welcher US-Stadt?5 translates into the following logical query: attr(X1 , X2 ), attr(X1 , X3 ), val(X2 , rudy.0), sub(X2 , vorname.1.1), val(X3 , giuliani.0), sub(X3 , nachname.1.1), sub(F OCU S, usstadt.1.1), attch(F OCU S, X1 ), sub(X1 , bürgermeister.1.1) 5 Deduction-based Query Processing Each of the candidate passages retrieved from the knowledge base may contain an answer to the query, so a separate proof attempt is made for each candidate. 4 In order to adapt the system to another language, a computational lexicon and a parser for generating MultiNet representations from expressions in that language must be provided. While the logical core of LogAnswer can remain the same, the lexical-semantic relations (synonyms, nominalizations etc.) used by LogAnswer must also be adapted to the language of interest. 5 Rudy Giuliani was the mayor of which city in the USA? 74 The proof is done by refutation, with the logical query representation being treated as a negated conjecture. E-KRHyper operates on a clause normal form (CNF) representation and converts its first order input accordingly. To continue our example, E-KRHyper uses the following representation of the query: ¬ attr(X1 , X2 ) ∨ ¬ attr(X1 , X3 ) ∨ ¬ val(X2 , rudy.0) ∨ ¬ sub(X2 , vorname.1.1)∨ ¬ val(X3 , giuliani.0) ∨ ¬ sub(X3 , nachname.1.1) ∨ ¬ sub(F OCU S, usstadt.1.1)∨ ¬ attch(F OCU S, X1 ) ∨ ¬ sub(X1 , bürgermeister.1.1) . The prover keeps track of the F OCU S variable throughout all clause trans- formations and inference steps, so that its binding can be extracted from a proof even if it has been renamed. As its derivation E-KRHyper builds a hyper tableau in the form of a literal tree, using the hyper extension inference: if the negative literals of a clause unify with complementary literals from a tableau branch, then the positive literals of the clause are added as leaves. A branch is closed once it is found to contain a contradiction. In a derivation for LogAnswer this is the case when all the negative literals from the query unify with the branch; given that the query has no positive literals, the branch gets closed. The term bound to the F OCU S variable in the unifying substitution used in the refutational proof then represents the queried information. The current logical background knowledge consists of Horn formulas only, but with the ongoing translation of the MultiNet knowledge the logical rules will eventually contain non-Horn formulas as well. In a hyper tableaux derivation these can lead to tableau branching, with multiple closed branches and thus multiple closing unifiers. In such a case E-KRHyper extracts all the bindings for the F OCU S variable from the proof and presents them as different answers to the main LogAnswer system. 6 Ensuring Robustness A knowledge base derived from textual sources is bound to have imperfections. Furthermore, the logical background knowledge provided to the prover will never completely cover the actual background knowledge of a person who reads the text. Finally, the candidate passages are not guaranteed to contain an answer to the query, since they have only been selected by relatively simple filtering. For these reasons it is not certain that E-KRHyper can find a proof within an answer candidate, even if the corresponding NL source actually contains the queried information. Given that LogAnswer is supposed to provide answers in a short amount of time, the maximum time slot dedicated to a single proof attempt is constrained severely to ensure that all answer candidates can be tested. However, while a time limit ensures that the prover will not work indefinitely on one futile candidate when answers would be readily available in others, it does not help against missing an answer due to minor mismatches. When the formulas being reasoned upon have all been derived from imperfect textual sources and by 75 imperfect tools for linguistic analysis, then formal logic may be too rigorous in demanding a perfect proof for an answer, and small compromises may actually be acceptable.6 For this reason E-KRHyper is embedded in a relaxation loop: if no proof is found within a time limit, then the query is relaxed by dropping a query literal and restarting the prover with the shortened query.7 In theory this can be continued until a proof of the simplified query succeeds, but since LogAnswer aims to produce useful answers the loop will be stopped before all literals are skipped. Also, rather than skipping a random query literal, the derivation progress during the failed refutation attempt can be used to guide the choice of which literal to drop. Given a negated query clause ¬Q1 ∨· · ·∨¬Qn , E-KRHyper tries to unify the query literals with fresh variants B1 , . . . , Bn of complementary branch literals from the current tableau branch b by testing the query literals from left to right. The unifying substitution is extended in every step such that starting with the empty substitution σ0 , σk is a substitution with Qi σk = Bi , ∀i ∈ {1 . . . k}, with k ∈ {1 . . . n}. If one query literal ¬Ql fails to find a matching partner in the branch which would allow an extension of the current substitution σl−1 to σl , then the remaining literals ¬Ql+1 , . . . , ¬Qn are not tested under σl−1 . Instead E-KRHyper generates a partial result. A partial result is represented by a triple ({Q1 , . . . , Ql−1 }, σl−1 , {Ql , . . . , Qn }), consisting of the list of success- fully unified query literals, the unifying substitution, and the list of query literals that were not unified, the first of which being the one that failed, whereas the remaining were not tested at all. If E-KRHyper fails to find a proof within the time limit, then all partial results generated so far are returned to the main Log- Answer system. One of the best partial results is selected (i.e. one of the partial results with the highest number of refuted query literals), the failed literal ¬Ql is removed from the negated query clause and E-KRHyper restarts its derivation with the new query clause ¬Q1 ∨ · · · ∨ ¬Ql−1 ∨ ¬Ql+1 ∨ · · · ∨ ¬Qn . When two partial results have the same number of failed literals, then additional criteria are used for selecting the best partial result: (a) partial results which provide a binding to the F OCU S variable are considered better than partial results which do not bind the queried variable (this criterion is important since the system can only generate answers when the F OCU S variable has been bound); and (b) a binding of the F OCU S variable to a constant is preferable to a binding of the F OCU S variable to a complex term (at the moment, the system is unable to generate answers for skolem terms, so answer extraction will only succeed when the queried variable is bound to a constant which directly represents a discourse entity). 6 A training set is used to learn a useful interpretation of results for failed proofs [17]. 7 See Sect. 8 for an example. Another system which uses relaxation for achieving more robustness in logic-based QA is COGEX [3]. 76 7 A Theorem Prover as a Reasoning Server At the time of this writing LogAnswer is deployed as a web-based QA-system with an interface analogous to that of a typical search engine. This usage places certain demands on the performance of LogAnswer which must be able to re- spond to numerous queries from multiple users in a short time. The reasoning stage within E-KRHyper can easily be the most time-consuming phase in the processing of a query, even with relaxation. Cutting back on operations here is crucial for maintaining responsiveness. Fortunately there are several opportuni- ties for such measures. To summarize, the clause input for a single proof attempt consists of: – background knowledge: 10,200 CNF clauses, – query: the negated query clause, on average 8 literals before relaxation, – answer candidate: the logic representation of the candidate passage, circa 230 unit clauses.8 E-KRHyper maintains this clause set with the help of several dicrimination- tree indexes [18]. The time required for the construction of these extensive tree data structures can exceed the allotment for the reasoning phase. Repeating this for each proof attempt would be prohibitive, in particular when considering that there may be m answer candidates to check for a single query, with n relaxation steps for each candidate, resulting in m × n proof attempts for each query. However, the clause sets and their indexing trees differ only very slightly between any two reasoning tasks. Only the current query and answer candi- date can change between two proof attempts. The background knowledge, which comprises approximately 97% of every clause set, remains the same. Therefore LogAnswer does not restart E-KRHyper for each reasoning task. Instead the prover is started once and provided with the background knowledge, so that the construction of the bulk of the indexing structures is only done once during this initialization. The task-specific clauses are then added and retracted again as needed. This is done using the mechanism of layered discrimination-trees: the task-specific clauses are not actually added to the same tree structures as the background knowledge. Instead additional trees (the layers) will be created to store the new clauses. Index reading operations access all layers, whereas writ- ing operations only use the latest layer intended for new clauses. Once a proof attempt is completed, the index layers with the now unnecessary clauses are simply discarded. This avoids a lengthy extraction of these clauses from a single, shared tree. There is even less difference between the clause sets used in two consecutive relaxation steps. The only change here is the removal of one query literal. This allows us to reuse even more clauses. Given an (interrupted) hyper tableaux derivation, all derived literals which were added to the initial branch before the 8 The problem set used for determining these numbers consists of 1806 query/candidate passage combinations for the CLEF 2007 questions for German. 77 first branch split can be treated as lemmas. Since the current logic representation of the knowledge base is Horn only, the set of lemmas actually corresponds to all derived branch literals, although this is bound to change in the future. The lemmas are then kept for the next relaxation step. Thus a repetition of inferences is avoided, and even if a relaxation step is not successful in finding an answer, it still makes use of its limited time slot to add new lemmas. This way we combine relaxation with incremental reasoning [19]. 8 Prover Result Evaluation The input for E-KRHyper is selected and updated by the LogAnswer main sys- tem, and the operation of the prover is also directed and controlled as described above. Given that many supporting text passages will be analysed for a query, the results of E-KRHyper must be subjected to further evaluation as well in order to select those answers that are most suitable for the user. For this we apply a machine learning approach which combines both logic- based and shallow syntactic features [17]: whenever E-KRHyper terminates with a refutation, then the respective text passage is regarded as containing an answer to the query. All such passages are ranked by an aggregated score, computed from logic-based criteria regarding the respective proofs, like the number of relaxation steps required, as well as from shallow syntactic features, like the relative pro- portion of lexical concepts and numerals in the question which find a match in the candidate passage. If the user has opted to receive answers in the form of text snippets, then the five best passages are selected for presentation. On the other hand, if precise answers are desired, then further processing is necessary. The queried information is extracted directly from a refutational proof as the binding of the F OCU S variable. The context of this instantiating answer value within the semantic network underlying the candidate passage is used to phrase the actual answer that can be presented to the user. Continuing our running example we take a look at one of the candidate pas- sages for which E-KRHyper will terminate: Hinter der Anklage stand der spätere Bürgermeister von New York, Rudolph Giuliani.9 The logical representation of the passage is shown here (actually, only the fragment of the representation which models the relational structure): hinter(c221, c210) ∧ sub(c220, nachname.1.1) ∧ val(c220, giuliani.0) ∧ sub(c219, vorname.1.1) ∧ val(c219, rudolph.0) ∧ prop(c218, spät.1.1) ∧ attr(c218, c220) ∧ attr(c218, c219) ∧ sub(c218, bürgermeister.1.1) ∧ val(c216, new york.0) ∧ sub(c216, name.1.1) ∧ sub(c215, stadt.1.1) ∧ attch(c215, c218) ∧ attr(c215, c216) ∧ subs(c211, stehen.1.1) ∧ loc(c211, c221) ∧ scar(c211, c218) ∧ sub(c210, anklage.1.1) A full proof of the query fails since the system is lacking knowledge that Rudy is a short form of Rudolph. Moreover, the fact that New York is a US city is not known to the system. Therefore two query literals cannot be proved, viz val(X2 , rudy.0) 9 Responsible for the charges was the future mayor of New York, Rudolph Giuliani. 78 and sub(F OCU S, usstadt.1.1). After two relaxation steps, which skip these two literals, a proof is found with a constant c215 bound to the F OCU S variable, which corresponds to an entity mentioned in the text. The information about the position of the entity in the text string is then used to extract the answer New York. Some answers generated this way are discarded immediately. This includes trivial answers which return the term from the query (Who is Virginia Kelley? - Virginia Kelley) and non-informative answers (the mother instead of the mother of Bill Clinton). Of those answers passing these sanity checks the five best in the aforementioned ranking are selected and then displayed together with the supporting passages. 9 Conclusions In this paper we have explored an application of automated reasoning to open domain question answering. With the ever growing amounts and availability of digitally stored information the utilization of this data is becoming an important but difficult task. Question answering systems strive to find specific information in a significantly more goal-directed manner than search engines. This requires deep reasoning over the semantics of stored data. Our approaches bridge the gaps between the precision yet brittleness of deduction and the robustness but limited accuracy of shallow linguistic techniques. This is achieved by combin- ing the results of both levels using machine learning. We have shown how the difficulties of a theorem prover dealing with imperfect NL-derived data can be solved by embedding the deduction component in a robust knowledge processing framework, which uses as feedback loop to gradually relax the logical query. The proposed approach to ensuring robustness has been evaluated in [17]. The task was that of identifying the correct answer passages in a set of 12,337 retrieved candidate passages. A baseline system using shallow features but no logical reasoning achieved an F-score of 42.7% in this experiment (other criteria for filtering quality were also studied). When combining shallow features and strict proofs, the F-score increased by 7.3%. Adding relaxation achieved another 7.9% improvement of the F-score. The best results were obtained in a configu- ration which allowed three relaxation steps. This experiment demonstrates that the performance of our QA system profits from the automated reasoning capa- bilities of the logic prover and from the proposed relaxation technique. Another benefit of the logic-based approach is that the answer bindings determined by the prover provide the basis for answer extraction. In the future we intend to further improve the translation of the MultiNet formalism into first-order logic, enabling us to fully exploit the expressivity of the semantic networks in combination with automated reasoning. References 1. Prager, J., Brown, E., Coden, A., Radev, D.: Question-answering by predictive annotation. In: SIGIR ’00: Proceedings of the 23rd Annual International ACM 79 SIGIR Conference on Research and Development in Information Retrieval, New York, NY, ACM Press (2000) 184–191 2. Moldovan, D., Bowden, M., Tatu, M.: A temporally-enhanced PowerAnswer in TREC 2006. In: Proc. of TREC-2006, Gaithersburg, MD (2006) 3. Moldovan, D., Clark, C., Harabagiu, S., Maiorano, S.: COGEX: A logic prover for question answering. In: Proc. of NAACL-HLT 2003. Volume 1., Morristown, NJ (2003) 87–93 4. Saias, J., Quaresma, P.: The Senso question answering approach to Portuguese QA@CLEF-2007. In: Working Notes for the CLEF 2007 Workshop, Budapest, Hungary (2007) 5. Glöckner, I., Hartrumpf, S., Leveling, J.: Logical validation, answer merging and witness selection: A study in multi-stream question answering. In: Proc. of RIAO- 07, Pittsburgh (2007) 6. Bos, J., Markert, K.: When logical inference helps determining textual entailment (and when it doesn’t). In: Proc. of 2nd PASCAL RTE Challenge Workshop. (2006) 7. Bobrow, D., Condoravdi, C., Crouch, R., de Paiva, V., Kaplan, R., Karttunen, L., King, T., Zaenen, A.: A basic logic for textual inference. In: Proceedings of the AAAI Workshop on Inference for Textual Question Answering, Pittsburgh, PA (Jul 2005) 8. Furbach, U., Glöckner, I., Helbig, H., Pelzer, B.: LogAnswer - A Deduction-Based Question Answering System. In: IJCAR 2008 - 4th International Joint Conference on Automated Reasoning, Sydney, Australia, 10th - 15th August, 2008, Proceed- ings, to appear. Lecture Notes in Computer Science, Springer (2008) 9. Helbig, H.: Knowledge Representation and the Semantics of Natural Language. Springer (2006) 10. Hartrumpf, S.: Hybrid Disambiguation in Natural Language Analysis. Der Andere Verlag, Osnabrück, Germany (2003) 11. Pelzer, B., Wernhard, C.: System Description: E-KRHyper. In: Automated De- duction - CADE-21, Proceedings. (2007) 508–513 12. Baumgartner, P., Furbach, U., Niemelä, I.: Hyper Tableaux. In: JELIA’96, Pro- ceedings. (1996) 1–17 13. Baumgartner, P., Furbach, U., Pelzer, B.: Hyper Tableaux with Equality. In: Automated Deduction - CADE-21, Proceedings. (2007) 14. McCune, W.: OTTER 3.3 Reference Manual. Argonne National Laboratory, Ar- gonne, Illinois (2003) 15. Hartrumpf, S., Helbig, H., Osswald, R.: The semantically based computer lexicon HaGenLex. Traitement automatique des langues 44(2) (2003) 81–105 16. Sutcliffe, G., Suttner, C.: The TPTP Problem Library: CNF Release v1.2.1. Jour- nal of Automated Reasoning 21(2) (1998) 177–203 17. Glöckner, I., Pelzer, B.: Exploring robustness enhancements for logic-based passage filtering. In: KES2008, Proceedings, to appear. Lecture Notes in Computer Science, Springer (2008) 18. McCune, W.: Experiments with Discrimination-Tree Indexing and Path Indexing for Term Retrieval. Journal of Automated Reasoning 9(2) (1992) 147–167 19. Beckert, B., Pape, C.: Incremental theory reasoning methods for semantic tableaux. In: Proceedings, 5th International Workshop on Theorem Proving with Analytic Tableaux and Related Methods. Lecture Notes in Computer Science, Springer (1996) 80