Three Stories on Automated Reasoning for Natural Language Understanding Johan Bos University of Rome “La Sapienza” bos@di.uniroma1.it Abstract Three recent applications of computerised reasoning in natural language process- ing are presented. The first is a text understanding system developed in the late 1990s, the second is a spoken-dialogue interface to a mobile robot and automated home, and the third is a system that determines textual entailment. In all of these applications, off-the-shelf tools for reasoning with first-order logic (theorem provers as well as model builders) are employed to assist in natural language understanding. This overview is not only an attempt to identify the added value of computerised reasoning in natural language understanding, but also to point out the limitations of the first-order inference techniques currently used in natural language processing. Introduction Since the mid 1990s I’ve been using tools from automated deduction (mainly theorem provers and model builders) to solve problems in natural language understanding. I’ve done this both from the perspective of computational linguistics (testing and improving a linguistic theory by means of a computational implementation) as well as that of natural language processing (using inference in applications such as question answering and textual entailment). In this paper I will write about some recent projects that I was involved in, with the aim of convincing researchers — both from the natural language engineering and the automated deduction communities — that computerised reasoning can be successfully applied to natural language understanding. Right — what’s new about this claim? Isn’t it obvious that one needs some form of computerised reasoning to model natural language understanding? Indeed, in my opinion it is. However, it is astonishing to see how few tools from automated reasoning have made it in real applications. Why is that? I think there are a couple of reasons for this. First of all, a rather high level of interdisciplinarity is required. One needs not only to know about linguistics (and in particular (formal) semantics), but also about natural language processing, knowledge representation, and automated inference. As a matter of fact, not many researchers match this profile, and looking back to the 1980s and 1990s, there seem fewer and fewer interested in taking on the pursuit. Or as Steve Pulman put it on the recent ICoS (Inference in Computational Semantics) conference: “I feel like an endangered species.” Empirically Successful Computerized Reasoning 81 Secondly, there is an enormous gap between formal linguistic theory and practical implementation. There is a vast amount of formal linguistics theory on the semantics of natural language. However interesting most of it is, it doesn’t always lead directly to computational implementation. Many of the phenomena that are accounted for in formal theory are quite rare in natural data, so a natural language engineer won’t loose much by ignoring it. But more crucially, from a an automated reasoning point of view, almost all semantic phenomena are formalised in higher-order logic (a trend set by noone less than Richard Montague), which is, as is well-known, a computational nightmare. Third, it is not trendy to use theorem proving in natural language processing. In contrast, stochastic approaches dominate the field, and after having been successfully applied to speech recognition and syntactic processing, it won’t take long until statistics will play a major role in semantic processing. Having said that, time and time again I am surprised by the opinion that using theorem proving in natural language understanding is classified as the “traditional approach.” What tradition? OK – it was tried in the 1970s and 1980s, but it never got to work really well, for reasons that should not really surprise us: theorem proving hadn’t matured into a state as we know it today, and moreover, trivially, computers lacked the memory and speed to perform the computations required by inference. Yet, we have only started to understand the limitations and opportunities of computerised reasoning in natural language understanding. After this introduction I will present three applications demonstrating successful use of first-order inference tools. The first is a text-understanding system that calculates presuppositions of sentences and performs consistency and informativeness checks on texts. The second is a spoken dialogue system, interfaced to a mobile robot and an au- tomated home environment, that uses theorem proving and model building for planning its linguistic and non-linguistic actions. The third is a system for recognising textual entailment. In all of these applications I will discuss the reasoning tools used, how they are used, what added value they had, and what their limitations were. 1 Presupposition Projection In the mid 1990s I started to implement tools for the semantic analysis of English texts, as part of my thesis work at the University of the Saarland in Saarbrücken, Germany. One of the aims was to follow linguistic theory as close as possible and see how much of it could be implemented straight away. I adopted Discourse Representation Theory (DRT), initially developed by Hans Kamp, because it accounted for a wide range of linguistic phenomena in a unified framework, and, crucially, had a model-theoretic semantics [KR93]. In particular I was interested in modelling presupposition projection, and I followed a recent proposal by Rob van der Sandt, whose theory of presupposition was casted in DRT [VdS92]. My implementation efforts resulted in the doris system (Figure 1). It had a rea- sonable grammer coverage (substantially more than a toy grammar, but certainly not reaching the level of today’s wide coverage grammars). It parsed English sentences and computed an underspecified discourse representation of the input sentence, followed by resolving ambiguities. The linguistic phenomena covered included pronouns, quantifier and negation scope, and presuppositions [Bos01]. 82 Empirically Successful Computerized Reasoning Figure 1: Screenshot of the DORIS system. In the upper window users can type or select an English sentence. The lower window provides several parameters, for instance the selection of various theorem provers. Presuppositions are propositions taken for granted by the speaker, and “triggered” by words or phrases. Rather informally, p presupposes q if both p and ¬p entail q. For instance, the phrase “Mia’s husband” presupposes that Mia is married, and that Mia is a woman, because “Jody likes Mia’s husband” and the negation of this sentence “Jody doesn’t like Mia’s husband” both entail that Mia is married and is a woman. The problem with presuppositions is that they are sometimes neutralised by the linguistic context, and that it is quite hard to pin down exactly when they are and when they are not, especially in sentences that contain some form of implication, negation, or disjunction. Consider, for instance, the following three sentences (with the relevant presupposition trigger typeset in bold face): (a) If Mia has a date with Vincent, then her husband is out of town. (b) If Mia has a husband, then her husband is out of town. (c) If Mia is married, then her husband is out of town. Here (a) presupposes that Mia is married, but (b) and (c) do not. Van der Sandt’s theory explained this by constructing various possibilities of positioning the presuppo- sition, and then checking whether these were acceptable by posing acceptability con- traints upon them: consistency as well as informativeness of the resulting text, both on the global and local level of discourse. For (c), there are two positions where the presupposition can “land” (which are underlined): (c-1) Mia has a husband. If Mia is married, then her husband is out of town. (c-2) If Mia is married/has a husband, then her husband is out of town. However, in the first paraphrase (c-1) the antecedent of the conditional violates the constraint of (local) informativeness: if Mia has a husband, then the fact that she Empirically Successful Computerized Reasoning 83 is married is not new information. In the second paraphrase (c-2) all acceptability constraints are satisfied. As a consequence, (c-1) is rejected, and (c-2) is accepted, as possible interpretation. Despite the adequacy of the predictions of the theory, there was a still a problem: the acceptability contraints required logical inference. But how cuold you implement this? Even though DRT was an established semantic theory backed up with a model- theory, there were no (efficient) theorem provers available that could reason with the semantic representations employed by DRT. Discussions with Patrick Blackburn and Michael Kohlhase (both in Saarbrücken, at the time) developed the idea of using first- order theorem provers to implement Van der Sandt’s acceptability constraints. As the core of DRT is a first-order language, it turned out to be pretty straightforward to translate the DRT representations into ordinary first-order formula syntax, something that first-order theorem provers could digest. Soon after, contacts were made with Hans de Nivelle, whose theorem prover bliksem was among the first that was put to the test [DN98]. And it looked promising: bliksem could handle most of the problems given to it in reasonable time. However, as some natural language examples could cause hundreds of consistency checking tasks (due to a combinatorial explosion of linguistic ambiguities), it took a long time before bliksem had dealt with them all. Michael Kohlhase and Andreas Franke came to the rescue, by offering MathWeb, a web-based inference service [FK99]. MathWeb farmed out a set of inference problems to different machines using a common software bus. Using the internet and many machines around the world (I recall that there were machines running in Edinburgh, Budapest, Saarbrücken, and Sydney, among other sites), MathWeb could basically be viewed as a parallel supercomputer. (This sounds perhaps quite ordinary right now, but at the time it was a sensation.) To cut a long story short, doris was interfaced directly to MathWeb, and many different theorem provers for first-order logic were added: spass [WAB + 99], fdpll [Bau00], otter [MP96], and vampire [RV02]. In sum, the doris system demonstrated that first-order inference could play an interesting role in natural language processing, albeit with limitations [BBKdN01]. It generated a new application area for automated deduction (in fact, some of the problems generated by doris made it to the TPTP collection, thanks to Geoff Sutcliffe), and it opened a whole new vista of research in computational semantics. (Incidentally, it also helped to precisely formulate the acceptability constraints of Van der Sandt’s theory of presupposition projection.) So, what were these limitations? Scalability was one of them. A theorem prover would do well on a couple of sentences, but — not surprisingly given the computational properties of first-order logic — it would just choke on larger texts. Linguistic coverage was another. Some linguistic phenomona require richer semantic representations and therefore harder problems (for instance, tense and aspect require a richer sortal hierar- chy, cardinal expression require counting, and plurals require elements of set theory). The last version of doris was released in 2001 [Bos01]. Although it was an important step in the development of computational semantics, its limited grammatical coverage and unfocussed application domain left it without a future. At the time I thought that it would take at least twenty years to develop a parser that achieved both wide-coverage and syntactic representations of enough detail to construct meaningful semantic repre- 84 Empirically Successful Computerized Reasoning sentations (I was, fortunately, very wrong! See Section 3). In order to reach a new level of sophistication in computational semantics, I instead focussed on small domains, in which the grammatical coverage and necessary background knowledge could be specified a-priory. Human-computer dialogue systems turned out to be the killer application. 2 Spoken Dialogue Systems At the University of Edinburgh I was involved in developing a spoken dialogue system which was interfaced with a (real) robot. The robot was a RWI Magellan Pro robot, with sonars, infrared sensors, bumpers, and a video camera. It had an on-board computer connected to the local network via a wireless LAN interface. The robot moved about at the basement of Buccleuch Place, and people could direct it, ask it questions, or provide it with new information, all via speech. A typical conversation could be: Human: Robot? Robot: Yes? Human: Where are you? Robot: I am in the hallway. Human: OK. Go to the rest room! Figure 2: An early version of Godot the talking robot with the roboticist Tetsushi Oka (6 Buccleuch Place, Edinburgh, 2001). Such kinds of dialogues were relatively straightforward to model with the then state- of-the-art in human-machine dialogue. Yet, the robot was still “semantically challeng- ing”: it had no means to draw inferences. What I aimed to do was using components of the doris system to give the robot means to do consistency checking, answer questions, Empirically Successful Computerized Reasoning 85 and calculate its next actions. In particular, I was interested in letting the robot react to inconsistent information or obvious information, envisioning dialogues such as: Human: Where are you? Robot: I am in the hallway. Human: You are in my office. Robot: No, that’s not true! Human: You are in the hallway. Robot: Yes, I know. I knew this was feasible because of the doris experience: theorem provers can easily handle the amount of information, and the amount of background knowledge, given the limited domain and environment, was easy to compute and maintain. And so it turned out to be: using spass as theorem prover, the robot was checking whether each assertion of the user was consistent with its current state and knowledge of the dialogue. After having implemented this, I was interested in using first-order theorem proving for planning the actions of a directive, primarily from a semantic point of view. I considered commands that involved negation, disjunction, or quantification. Examples of utterances that I had in mind included: (a) Turn on a light. (b) Switch in every light. (c) Switch on every light in the hallway. (d) Turn off every light except the light in the office. (e) Go to the kitchen or the rest room. In (a), the robot had to turn on a light that was currently switched off (and of course it had to complain when all the lights were already on). In (b), it had to turn on all lights that were currently off (some of the lights could be on already). In (c), it should only consider lights in the hallway (restrictive quantification). In (d), it should consider all lights minus those in the office (negation). In (e), it should have either gone to the kitchen or to the rest room (disjunction). This was hard to do with a theorem prover. It seemed a natural task for a finite model builder though. Via the doris system I already came into contact with Karsten Konrad’s model builder kimba [KW99], but I had never used model builders other than checking for satisfiability. I started using Bill McCune’s mace because it searched models by iteration over domain size, and generally generating models that were both domain- minimal and minimal in the extensions of the predicates [McC98]. Model building was successfully integrated such that the minimal models produced by mace where used to determine the actions that the robot had to perform [BO02]. The robot in action was regularly demonstrated to visitors at Buccleuch Place (Figure 2) as well as to the general public at the Scottish museum in Edinburgh (Figure 3). Of course there were the usual limitations. As I was using a first-order language with possible worlds to model the semantics of actions, I was forced to erase the dialogue memory after every second utterance in order to keep the response time acceptable. Also, the number of different objects in the domain was very limited (given the background axioms of the robot, mace produces models in reasonable time for models up to domain size 20). Nevertheless, the overall system was impressive, and showed what one could do with general purpose, off-the-shelf, first-order inference tools in a practical system. 86 Empirically Successful Computerized Reasoning Figure 3: Godot the talking robot at the Scottish Museum (Edinburgh, 2003). 3 Recognising Textual Entailment The rapid developments in statistical parsing were of course of interest to the semanticist. Yet most of these parsers produced syntactic derivations that were unsuitable to produce semantic representations in a systematic, principled way. It is not an exaggeration to say that the release of a stastistical wide-coverage parser for CCG (Combinatorial Categorial Grammar) in 2004 corresponded to a breakthrough in computational semantics. This CCG parser, implemented by Stephen Clark and James Curran [CC04], and trained on an annotated treebank developed by Julia Hockenmaier and Mark Steedman [HS02], had the best of both worlds: it achieved wide coverage on texts, and produced very detailed syntactic derivations. Because of the correspondence between syntax and semantic rules in CCG, this framework was the ideal setting for doing semantics. Because CCG is a heavily lexicalised theory, it has a large number of lexical cat- egories, and very few rules. In order to translate the output of the parser (a CCG derivation) into a DRT representation (which was my main aim, in order to reuse the existing tools that I developed for DRT and inference), I coded a lambda-expression for each of the ca. syntactic 400 categories that were known to the parser. Using the lambda-calculus to produce DRT representations, we simply had a parser that trans- lated newspaper texts into semantic representations, with a coverage of around 95% [BCS+ 04, Bos05]. This was a great starting point for doing computerised reasoning for natural language on a wider scale. In the same year the first challenge to recognising textual entailment (RTE) were organised. This is basically a competition for implemented systems to detect whether one (small) text entails another (small) text. To give an impression of the task, consider Empirically Successful Computerized Reasoning 87 an example of a positive and and an example of a negative entailment pair are (where T is the text, and H the hypothesis, using the terminology of the RTE): Example: 115 (TRUE) T: On Friday evening, a car bomb exploded outside a Shiite mosque in Iskandariyah, 30 miles south of the capital, killing seven people and wounding 10, doctors said on condition of anonymity. H: A bomb exploded outside a mosque. Example: 117 (FALSE) T: The release of its report led to calls for a complete ivory trade ban, and at the seventh conference in 1989, the African Elephant was moved to appendix one of the treaty. H: The ban on ivory trade has been effective in protecting the elephant from extinction. In the RTE challenge a participating system is given a set of entailment pairs and has to decide, for each T-H pair, whether T entails H or not. This turned out to be a very hard task. The baseline (randomly guessing) already gives a score of 50%, as half of the dataset correspond to true entailments, and the other half to false ones. The best systems on the RTE-1 campaign achieved a score approaching 60%. With the CCG parser and semantics at my disposal, I decided to implement a system that used logical inference to approach the RTE challenge. The overall idea, given the available tools, was straightforward: produce a DRT representation for T and H, translate these to first-order logic, and then use an off-the-shelf prover to check whether T ′ → H ′ . We used vampire as theorem prover [RV02], motivated by its performance at the recent CASCs. At first, the performance was limited. The system performed quite well on cases such as 115 above, but unfortunately the RTE challenge doesn’t contain many of these. Most of the examples require a lot of background knowledge to draw the correct inference. I used WordNet (an electronic dictionary) to compute some of these background axioms automatically. Yet still there were knowledge gaps. It would be nice to have a theorem prover that would be able to say that it “almost found a proof”, instead of just saying “yes” or “(probably) no”. Back to model building. Finite model builders, such as mace [McC98], said more than just “yes” or “no”. They also give you a finite model if there is one. As this model is a clearly defined mathematical entity, it is ideal to simulate the “almost found a proof” scenario. That is, by generating minimal models for T ′ and for T ′ ∧ H ′ , we hypothesised that comparing the number of entities of the two models would give us a useful handle on estimating entailment. If the difference is relatively small, it is likely that T entails H. Otherwise it is not. (To deal with negation, one also has to calculate the models for ¬T ′ and ¬(T ′ ∧ H ′ ) and compare the model sizes. To deal with disjunction, one has to do this for all minimal models for T and H — but as of now it is unclear to me how to implement the latter in a neat way...) This turned out to work well — not as good as one would hope (because it is hard to get the right background knowledge), but significantly better than the baseline. We used standard machine learning techniques to estimate the thresholds of the model sizes. We used both the size of the domain as well as the number of instantiated relations in the model. It turned out that mace was quite slow for longer sentences. We tried 88 Empirically Successful Computerized Reasoning paradox [CS03], which is faster, but it does not always return minimal models (with respect to the predicate extensions). To overcome this, we used paradox to calculate the domain size, and then called mace to generate the model given that domain size. Figure 4: Example output screen for recognising textual entailment using the theorem prover vampire and the model builders mace and paradox. Conclusion First-order inference tools, such as automated theorem provers and model builders, can be successfully used in natural language understanding applications. Obviously, there are limitations, but in many interesting applications these limitations play a subordinate role. Whether computerised (logical) reasoning will ever become part of mainstream re- search in natural language processing is questionable, though. We will have to see to what extent statistical approaches (currently dominating computational linguistics) can be applied to natural language understanding tasks. Meanwhile, collaboration between researchers working in automated deduction and computational linguistics should be stimulated to get a better understanding of the boundaries of applying automated rea- soning to natural language understanding. References [Bau00] Peter Baumgartner. FDPLL – A First-Order Davis-Putnam-Logeman- Loveland Procedure. In David McAllester, editor, CADE-17 – The 17th International Conference on Automated Deduction, volume 1831 of Lecture Notes in Artificial Intelligence, pages 200–219. Springer, 2000. Empirically Successful Computerized Reasoning 89 [BBKdN01] Patrick Blackburn, Johan Bos, Michael Kohlhase, and Hans de Nivelle. In- ference and Computational Semantics. In Harry Bunt, Reinhard Muskens, and Elias Thijsse, editors, Computing Meaning Vol.2, pages 11–28. Kluwer, 2001. [BCS+ 04] J. Bos, S. Clark, M. Steedman, J.R. Curran, and Hockenmaier J. Wide- Coverage Semantic Representations from a CCG Parser. In Proceedings of the 20th International Conference on Computational Linguistics (COLING ’04), Geneva, Switzerland, 2004. [BO02] Johan Bos and Tetsushi Oka. An Inference-based Approach to Dialogue System Design. In Shu-Chuan Tseng, editor, COLING 2002. Proceedings of the 19th International Conference on Computational Linguistics, pages 113–119, Taipei, Taiwan, 2002. [Bos01] Johan Bos. DORIS 2001: Underspecification, Resolution and Inference for Discourse Representation Structures. In Patrick Blackburn and Michael Kohlhase, editors, ICoS-3, Inference in Computational Semantics, pages 117–124, 2001. [Bos05] Johan Bos. Towards wide-coverage semantic interpretation. In Proceed- ings of Sixth International Workshop on Computational Semantics IWCS- 6, pages 42–53, 2005. [CC04] S. Clark and J.R. Curran. Parsing the WSJ using CCG and Log-Linear Models. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL ’04), Barcelona, Spain, 2004. [CS03] K. Claessen and N. Sörensson. New techniques that improve mace-style model finding. In Model Computation – Principles, Algorithms, Applica- tions (Cade-19 Workshop), Miami, Florida, USA, 2003. [DN98] Hans De Nivelle. A Resolution Decision Procedure for the Guarded Frag- ment. In Automated Deduction - CADE-15. 15th International Conference on Automated Deduction, pages 191–204. Springer-Verlag Berlin Heidel- berg, 1998. [FK99] Andreas Franke and Michael Kohlhase. System description: Mathweb, an agent-based communication layer for distributed automated theorem proving. In CADE’99, 1999. [HS02] J. Hockenmaier and M. Steedman. Generative Models for Statistical Pars- ing with Combinatory Categorial Grammar. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, 2002. [KR93] H. Kamp and U. Reyle. From Discourse to Logic; An Introduction to Modeltheoretic Semantics of Natural Language, Formal Logic and DRT. Kluwer, Dordrecht, 1993. 90 Empirically Successful Computerized Reasoning [KW99] Karsten Konrad and D.A. Wolfram. System description: Kimba, a model generator for many-valued first-order logics. In 16th International Confer- ence on Automated Deduction CADE-16, 1999. [McC98] W. McCune. Automatic Proofs and Counterexamples for Some Ortholat- tice Identities. Information Processing Letters, 65(6):285–291, 1998. [MP96] W. McCune and R. Padmanabhan. Automated Deduction in Equational Logic and Cubic Curves. Lecture Notes in Computer Science (AI subseries). Springer-Verlag, 1996. [RV02] A. Riazanov and A. Voronkov. The Design and Implementation of Vampire. AI Communications, 15(2–3), 2002. [VdS92] R.A. Van der Sandt. Presupposition Projection as Anaphora Resolution. Journal of Semantics, 9:333–377, 1992. [WAB+ 99] Christoph Weidenbach, Bijan Afshordel, Uwe Brahm, Christian Cohrs, Thorsten Engel, Enno Keen, Christian Theobalt, and Dalibor Topic. Sys- tem description: Spass version 1.0.0. In Harald Ganzinger, editor, 16th In- ternational Conference on Automated Deduction, CADE-16, volume 1632 of LNAI, pages 314–318. Springer-Verlag, Berlin, 1999. Empirically Successful Computerized Reasoning 91