On Datasets for Evaluating Architectures for Learning to Reason Naveen Sundar Govindarajulu,1 Jean-Claude Paquin, Shreya Banerjee, Paul Mayol, Selmer Bringsjord∗ Rensselaer Polytechnic Institute 110 8th Street, Troy, NY 12180, USA 1 naveensundarg@gmail.com Abstract Why New Datasets? In our poster, we will introduce new datasets in propositional Existing datasets that can be used for learning to reason logic and first-order logic that can be used for learning to rea- are either too complex or do not show much variation in son, and present some initial results on systems that use this the samples. For example, the Mizar repository (Naumow- data. icz and Kornilowicz 2009) has more than 50, 000 reason- ing problems (theorems in a first-order logic) and have been Introduction used in Area 1, but these problems are too complex to be There is a growing research interest in incorporating learn- useful in bootstraping a learning system from scratch. On ing in reasoning systems. Such efforts fall largely into two the other hand, simpler toy datasets such as the deduction different areas that we term Area I and Area II. We give task in bAbI (Weston et al. 2015) do not show that much a quick overview of what a reasoning system does before variation. For example, figures 1a and 1b show answers describing these two areas. and full proofs to two different questions in the bAbI dataset In general, a reasoning system can be modeled as a search represented in the Slate proof assistant system (Bringsjord through some space. This search usually relies on a number et al. 2008). Both the proofs can be obtained by applying of hand-written heuristics. Theorem provers make this quite the resolution inference rule twice. Moreover, the proofs are explicit, as one can specify these heuristics as an end-user. structurally similar and can be generated by switching con- For instance, in first-order resolution theorem provers, the stant symbols in a first-order proof. In fact, the entire dataset goal is to find a sequence of resolution operations using an in the bAbI deduction task follows this one single proof pat- initial set of clauses C that results in an empty clause. At tern. Ideally, we want a dataset that has problems of varying any point in the search, the prover has to choose a set of levels of complexity. clauses from an overall set of clauses it has derived. The- orem provers use heuristics such as the size of a clause, The Problem the complexity of a clause, age of a clause etc. to choose a clause. We present an abstract version of the problem we seek to Efforts in Area I, such as (Kaliszyk, Chollet, and Szegedy solve. Assume that we have a formal logic F ≡ hL, I, ⊥i, 2017), revolve around selecting or computing an appropriate where L is the language of the formal logic, I is the infer- set of heuristics using some form of learning while not tam- ence system and ⊥ ∈ L. Any set of formulae Γ in the lan- pering with the rest of the search process. Efforts in Area guage L can be consistent, Γ 6`I ⊥, or inconsistent Γ `I ⊥. II aim to learn a function from scratch that does the entire Any reasoning problem Γ ` φ can be posed as a consis- search. While there has been quite significant progress in tency problem Γ + ¬φ `I ⊥ if certain conditions C1 , C2 are Area I, there has been very little progress in Area II. We feel satisfied by I (Boolos, Burgess, and Jeffrey 2003). that one of the main reasons for this state of affairs, is that there is no standard dataset that can be leveraged for Area II. C1 The deduction theorem Γ + φ ` ψ ⇒ Γ ` φ → ψ. Datasets for Area I are built up with learning heuristics as a C2 The law of excluded middle. {} ` φ ∨ ¬φ. goal and are not ideal for Area II systems. In our poster, we will discuss new datasets in proposi- In fact, resolution-based theorem provers such as Vam- tional logic and first-order logic that can be used for learning pire (Kovács and Voronkov 2013) function in this manner to reason and present some initial results based on this data. by searching for a proof of ⊥ from Γ + ¬φ in order to prove ∗ Γ ` φ. Therefore, reasoning in logics with C can be reduced ONR to consistency checking. Copyright held by the author(s). In A. Martin, K. Hinkelmann, A. Gerber, D. Lenat, F. van Harmelen, P. Clark (Eds.), Proceedings of We pose the learning problem as a standard classifica- the AAAI 2019 Spring Symposium on Combining Machine Learn- tion task. Let con(Γ) ∈ {0, 1} denote whether Γ is con- ing with Knowledge Engineering (AAAI-MAKE 2019). Stanford sistent con(Γ) = 1 or not con(Γ) = 0. Given a training University, Palo Alto, California, USA, March 25-27, 2019. data Dtrain of sentences and their consistency information, φ has two or more nested quantifiers, can cause some state- of-the-art theorem provers into running for an unbounded amount of time. To address this, we use sorted first-order logic with a given set of relation, function, and constant sym- bols, along with certain complexity constraints. The sorts prevent us from generating nonsensical formulae. Given a sorted signature σ, we generate a certain number of unique formulae Γ and apply first-order natural deduction inference rules till we have a proof ρ of a certain complexity. One such problem and a corresponding proof in an imaginary mechan- ical domain is given below. See Figure 2. Sample FOL Problem (a) bAbI 1 Emily is a cat. Cats are afraid of wolves. What is emily afraid of? 1. If gear x is connected to gear 3. gear y and gear y is con- nected to gear z, then 5. Gear 3 is enclosed by gear x is connected to box 1. gear z. 6. All boxes are con- 2. If gear x and gear y are nected. connected, and gear x is 7. Lever 1 either fixes or broken, gear y is broken breaks all gears. too. 3. Gear 1 is connected to 8. Lever 1 breaks gear 1. gear 2. 9. If x encloses y and y is (b) bAbI 2 Winona is a sheep. Sheep are afraid of mice. What 4. Gear 2 is connected to broken, x is broken. is winona afraidof? Question: Is box 20 broken? Answer: Yes Figure 1: bAbI problems {hΓ1 , con(Γi )i |1 ≤ i ≤ n }, the goal is to learn a function that approximates con and is evaluated on a test set Dtest . Preview: Data and Data Generation We look at two different formal logics: propositional cal- culus and first-order logic. We randomly generate sen- tences and their consistency labels. The generation process is slightly different for the two different logics. Propositional Logic For propositional logic, we generate formula in conjunctive normal form. Each formula φ is the form of a disjunction l1 . . . li . . . ln . Each literal l is an atom P or its negation P . We generate a radom set of formulae Γ by randomly gener- ating u clauses c1 , c2 , . . . , cu where each clause has v ran- Figure 2: Proof for the Sample FOL Problem A resolution dom literals drawn from a set of w atomic propositions. For proof for the sample FOL problem given above. each such random set of sentences, we run a theorem prover to check whether the sentence is consistent or not. Using this method, we have three distinct datasets for propositional logic with u, v, w = 3, u, v, w = 4 and u, v, w = 5. First-order Logic First-order Logic Due to the expressivity of first-order logic, naı̈ve random generation of formulae can quickly lead to very difficult to solve problems or degenerate problems that do not have real-world analogs but are also difficult to solve. For instance, biconditionals such as φ ↔ φ, where References Boolos, G. S.; Burgess, J. P.; and Jeffrey, R. C. 2003. Com- putability and Logic (Fifth Edition). Cambridge, UK: Cam- bridge University Press. Bringsjord, S.; Taylor, J.; Shilliday, A.; Clark, M.; and Ark- oudas, K. 2008. Slate: An Argument-Centered Intelligent Assistant to Human Reasoners. In Grasso, F.; Green, N.; Kibble, R.; and Reed, C., eds., Proceedings of the 8th In- ternational Workshop on Computational Models of Natural Argument (CMNA 8), 1–10. Patras, Greece: University of Patras. Kaliszyk, C.; Chollet, F.; and Szegedy, C. 2017. Holstep: A Machine Learning Dataset for Higher-order Logic Theorem Proving. arXiv preprint arXiv:1703.00426. Kovács, L., and Voronkov, A. 2013. First-order theorem proving and vampire. In International Conference on Com- puter Aided Verification, 1–35. Springer. Naumowicz, A., and Kornilowicz, A. 2009. A Brief Overview of Mizar. In Berghofer, S.; Nipkow, T.; Urban, C.; and Wenzel, M., eds., Theorem Proving in Higher Order Logics, volume 5674 of Lecture Notes in Computer Science (LNCS). Berlin: Springer. 67–72. Weston, J.; Bordes, A.; Chopra, S.; Rush, A. M.; van Merriënboer, B.; Joulin, A.; and Mikolov, T. 2015. Towards ai-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698.