Propositional Reasoning via Neural Transformer Language Models Anthony Tomasic1 , Oscar J. Romero1 , John Zimmerman1 and Aaron Steinfeld1 1 Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, 15213,USA Abstract This paper presents exploratory work on how the key components of transformer-based neural language models (self-attention mechanism and large-scale pre-trained models) can be leveraged to perform context retrieval, symbol manipulation, and propositional reasoning. We fine-tuned off-the-shelf GPT-2 and GPT-3 language models to simulate the propositional logic resolution of non-recursive rules that combine conjunction, disjunction, and negation connectives. Next, we conducted a series of experiments and evaluated the performance of different syntactic representations. The promising results show that a transformer model can i) generalize an inference algorithm for propositional resolution, and ii) learn to solve simple instances of view updates (a well-studied problem in the field of databases) by leveraging the propositional reasoning framework. Keywords transformer neural language models, propositional reasoning, view updates problem 1. Introduction Transformer models continue to be the preferred model for many natural language processing (NLP) tasks, e.g. machine translation, natural generation, summarization, etc. In general, the larger the transformer model, the larger the context that captures the reasoning of a task, resulting in better performance. This scaling relationship between the size of the model and performance had led to a race to construct ever larger models [1]. However, symbolic reasoning is not directly available in transformer models [2]. For example, a common NLP problem that end-to-end transformer-based applications have yet to deal with is abstraction over concepts. To circumvent this issue, delexicalization/lexicalization techniques (i.e., the replacement of specific slot values by their corresponding generic slot tokens to allow learning value-independent parameters) need to be used by a human designer [3, 4]. This manipulation of de/lexicalized terms can be considered a simple form of symbolic manipulation and reasoning. More generally, the lack of symbolic reasoning functionality has lead researchers NeSy 2022, 16th International Workshop on Neural-Symbolic Learning and Reasoning, Cumberland Lodge, Windsor, UK Envelope-Open tomasic@andrew.cmu.edu (A. Tomasic); oscarr@andrew.cmu.edu (O. J. Romero); johnz@andrew.cmu.edu (J. Zimmerman); steinfeld@cmu.edu (A. Steinfeld) GLOBE https://www.cs.cmu.edu/~tomasic/ (A. Tomasic); http://www.cs.cmu.edu/~oscarr/ (O. J. Romero); https://design.cmu.edu/people/faculty/john-zimmerman/ (J. Zimmerman); http://www.cs.cmu.edu/~astein/ (A. Steinfeld) Orcid 0000-0001-7864-0364 (A. Tomasic); 0000-0002-4470-8209 (O. J. Romero); 0000-0001-5299-8157 (J. Zimmerman); 0000-0003-2274-0053 (A. Steinfeld) Β© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Prolog Program: Dataset Fine-tune facts([a,b,c…]). Compile Transformer Fine- prog0(KB, R) :- (a:-true) (b:-false) Evaluate Prolog Model tuned facts(F), (p :- (a,b);q Model Program GPT-2/ model mk_facts(F, R, KB). ... GPT-3 ... Figure 1: Pipeline to fine-tuning the transformer model. A Prolog program generates training data that is used to fine-tune both GPT-2 and GPT-3 transformer models for propositional resolution. to extend the deep neural network models to directly represent symbolic reasoning [5, 6, 7, 8, 9]. In this paper, we consider a complementary approach where we demonstrate the capability of transformer models to directly simulate reasoning in propositional logic. We use this reasoning to implement a simple form of belief revision [10], namely propositional view updates [11]. The main contributions of this paper are: (i) A data representation of conjunction, disjunction and negation symbolic logic designed to represent propositional resolution; (ii) A methodology for generation of fine-tuning examples; (iii) A fine-tuned transformer model that learns the underlying symbolic inference algorithm for propositional resolution; and (iv) An experimental analysis of the performance of the fine-tuned model. In addition, we explore the implications of these contributions in two ways: (i) We study the impact of the transformer model size and structure on performance; and (ii) We apply this work to propositional view updates. Transformer models have been demonstrated to generate semi-structured representations [12] that combine structured data, API calls, and natural language comprehension and synthesis. This paper points the way to extend transformer models to perform belief revision over these semi-structured representations. To illustrate a direct application of our contribution, consider the following conversational interaction extracted from the logs of our prior work [12]: User: I want to take a bus to Carnegie Mellon University System: I have 4 buses for you... The bus 41 leaves at 4:33pm and costs $5.50. There is 1 transfer. You'll need to take a connection with bus 61C... The system response is only partially correct, the bus information and transfer are correct, but the price information is wrong. The system has calculated the price of the trip as the sum of each leg, but this sum is incorrect since transfers to another bus are free. Enabling a conversational system with view updates would permit, for example, the conversation to proceed as β€œUser: No, the price information is wrong” and then start a (admittedly complex) dialog to correct the sum. In this paper, we adopt conventional logic programming syntax, semantics, and terminology for programs, rules, terms, goals, and resolution for non-recursive propositional programs. 2. Methodology Our objective is to fine-tune off-the-shelf transformer models GPT-2 [13] and GPT-3 [1] to simulate propositional logic resolution of rules (see Figure 1). For instance, consider the following example (written in Prolog syntax). 1) p :- q,r. % π‘Ÿπ‘’π‘™π‘’ ∢ 𝑝 ← π‘ž ∧ π‘Ÿ 5) q,r. % π‘Ÿπ‘’π‘ π‘œπ‘™π‘’π‘‘π‘–π‘œπ‘›... 2) q :- true. % 𝑓 π‘Žπ‘π‘‘ ∢ π‘ž ← π‘‘π‘Ÿπ‘’π‘’ 6) true,r. 3) r :- true. % 𝑓 π‘Žπ‘π‘‘ ∢ π‘Ÿ ← π‘‘π‘Ÿπ‘’π‘’ 7) r. 4) ?- p. % π‘žπ‘’π‘’π‘Ÿπ‘¦ ∢ 𝑖𝑠 𝑝 π‘‘π‘Ÿπ‘’π‘’? 8) true. The example is arranged to state the program in terms of facts and rules (lines 1-3), then a goal (line 4), and then the resolution of the goal (lines 5-8). Resolution proceeds by alternating between two types of transformations: rule retrieval and goal rewriting. The logic proof above can be encoded as a sequence of tokens (that serve as the input to GPT models) as follows: (q :- true) (r :- true) (p :- q,r) p Γ« (p :- q,r) (q :- true) true (r :- true) true Where < b o s > (beginning of sentence), < s o q > (start of query), < s o p > (start of proof), and < e o s > (end of sentence) are special tokens, which serve as segment indicators. During fine-tuning, the entire sequence is presented as an example, while during testing, the sub-sequence < b o s > . . . < s o p > is used to prompt the model and the remaining sub-sequence serves as a gold label. In addition to fine-tuning resolution, propositional programs contain some symmetries that must be accounted for when generating examples: (i) The ordering of terms and clauses is not relevant; and (ii) Clause bodies are composed with conjunction (,), disjunction (;), and negation (not) operators. These symmetries require that the examples randomize certain symbols so that the transformer learner does not incorrectly model sequential aspects of the problem. In addition, the algorithm of propositional resolution must be reflected in the fine-tuning data: (i) The state of the resolution is represented by a sequence of goals; (ii) Selection Rule: the left most goal is selected and replaced with its definition; (iii) true and false goals are removed when they appear; and (iv) Resolution ends when no goals remain. Three remaining issues must be settled to complete the specification of the representation: (i) A goal without a corresponding clause requires some interpretation. We eliminate this case by requiring all programs to have clauses for any goal. Thus, every chain of reasoning ends with either true or false; (ii) Disjunction can be represented in two ways: via multiple rules with the same head and by a semicolon disjunction operator. and (iii) Resolution is designed to determine the truth of a goal and thus the search terminates with the first false or all goals end with true. However, as discussed below, the resolution of view update reasoning requires the exploration of all proof paths. We explore this termination condition in the experiments. In the context of relational and deductive databases, view update refers to the problem of translating update requests against a view into update requests against the base data [11]. View updates are a simple form of belief revision reasoning [10] that restricts the update to be the head of a rule and the translation of the update to be facts (i.e., the body of the rule). A translation is an update to the program that makes the view update true. So, given the program: 1) p :- q;r. 3) q :- false. 5) s :- false. 2) p :- s. 4) r :- true. The view update i n s e r t p has two translations: i n s e r t q (i.e., replace q : - f a l s e with q : - t r u e ) and i n s e r t s . Note that i n s e r t r is redundant and thus not part of the transla- tion. View update translations are naturally ambiguous since there are many ways to translate Experiment conjπ‘‘π‘Ÿ conj𝑑𝑠 disj neg rules expl ctx Experiment conjπ‘‘π‘Ÿ conj𝑑𝑠 disj neg rules expl ctx Conjexplore 1-5 1-5 0 0 3 yes no Conjgpt3_in 10 1-10 0 0 3 no no Conjearly 1-5 1-5 0 0 3 no no Conjgpt3_ex 10 11-12 0 0 3 no no Conjctx 1-5 1-5 0 0 3 no yes Disjexplore 1-5 1-5 1-5 0 3 yes no Conjintrpl 10 1-10 0 0 3 no no Disjearly 1-5 1-5 1-5 0 3 no no Conjextrpl 10 11-12 0 0 3 no no Negbacktrack 1-3 1-3 1-3 1-3 3 no no Conj10g 1-10 1-10 0 0 3 no no Viewupdate 1-3 1-3 1-3 1-3 3 yes yes Table 1 Parameters for experiments. conjπ‘‘π‘Ÿ : # of conjuncts in train dataset. conj𝑑𝑠 : # of conjuncts in test dataset. disj: # of disjuncts. neg: # of negations. rules: # of rules per example. expl: β€œyes” = explore all branches, β€œno” = early termination. ctx: carry context. an update. A simple method for reducing ambiguity annotates terms in a rule to indicate a preference for a particular update [14, 15]. Annotating the second rule implies that i n s e r t s is the preferred translation. To compute all possible view update translations, the approach we use here explores all solutions to the goal p without early termination and then reverse engineers the proof trees to generate translation by converting false facts to true ones, or visa versa. This same reasoning extends to deletion (each true proof must be falsified) and negation (flip between insertion and deletion). The experiments below reflect these conditions. For each example, a generator program (written in Prolog – refer to Appendix A for further details) creates a logic program by randomly drawing the names of terms from a fixed vocabulary, randomly ordering the clauses, and randomly choosing among the operators in a syntactically valid way. The generator program then randomly chooses a rule head as the initial goal of the program. Finally, the generator program executes resolution on the generated logic program and the initial goal, and the resulting step-by-step execution is presented as a training example for the model to learn to perform resolution (Figure 1). 3. Experiment Framework We used the open-source implementation of GPT-2 transformer architecture1 (medium size) with default values for Adam learning rate (5e-5), epsilon for Adam optimizer (1e-8), and batch size (4). To generate coherent and syntactically valid outputs [16], we set parameters temperature (0.5), top-p nucleus sampling (0.95) and top-k sampling (50) using grid search. We ran 12 experiments where the transformer model was fine-tuned and tested to perform resolution for conjunction, disjunction, and negation. For each experiment, we randomly generated 3 datasets of increasing size (1K, 10K, and 100K examples) with a 75/25 train/test split. We reported average accuracy (for an exact match with the test example) and an error analysis. Conjexplore : in this experiment, the transformer model learns to explore all the paths of the search space defined by a rule clause, where conjunction is the only logical operator used to connect the rule’s goals (see parameters on Table 1). The representation used is as follows: (t :- true) (c :- false) (l :- false).. (e :- l,c,r,t)... e Γ« (e :- l,c,r,t) (l :- false) false. (c :- false) false. (r :- false) false. 1 https://github.com/huggingface/transformers Γ« (t :- true) true. Note that the model must remember the context of the derivation. By β€œcontext” we mean the remaining goals to be resolved (e.g., goals in the sequence l , c , r , t is the derivation context). Conjearly : this experiment is similar to Conjexplore , except that the reasoning process stops when a false term is found (early termination). That is, stopping at ( l : - f a l s e ) f a l s e . Conjctx : similar to experiment Conjearly , except that the context of the derivation is carried at each step. We aimed to determine whether this representation improves the performance of the system by repeatedly presenting relevant stimuli to the transformer’s attention mechanism. The following representation was used (the context is enclosed by squared brackets): ... e (e :- l,c,r,t). ((l :- false),[c,r,t]) false,[c,r,t]. Γ« ((c :- false),[r,t]) false,[r,t]. ((r :- false),[t]) false,[t]. ((t :- true)) Conj10g : a copy of experiment Conjearly except that we increased the number of goals per rule (1-10) for both train and test datasets. This experiment serves as a baseline for comparison against experiments Conjintrpl and Conjextrpl . Conjintrpl : We evaluated whether the transformer model could interpolate the resolution of clauses by training the model on 10 goals and evaluating it on tests with less than 10 goals. Conjextrpl : alike experiment Conj10g except that we trained the transformer model on rules with 10 goals and tested it on rules with 11-12 goals as a study for extrapolation. Conjgpt3_in and Conjgpt3_ex : We compared the performance of GPT-2 vs GPT-3 models by running experiments Conjintrpl and Conjextrpl on a GPT-3 fine-tuned model2 . Disjexplore : in this experiment, the transformer model is fine-tuned to learn the procedural semantics for resolution given a specific arrangement of goals connected by conjunctions and/or disjunctions. For instance, given the facts ( g : - t r u e ) ( j : - f a l s e ) ( q : - t r u e ) and a rule v , the 4 possible derivations are: 1. (v :- g,j,q) 3. (v :- g,(j;q)) [g,j,q] [true,j,q] [g,(j;q)] [true,(j;q)] [j,q] [false,q] [q] [false] [[j],[false]];[[q],[true]] 2. (v :- g;j,q) 4. (v :- g;j;q) [[g,[true]]];[[j,q],[false,q]] [[g],[true]]; [[q],[true]] [[[j],[false]];[[q],[true]]] Disjearly : this experiment is similar in nature to Disjexplore , except that an early termination is triggered when a false term is found on a conjunction branch. Negbacktrack : in this experiment, we added the negation operator to the training examples, so the model must learn backtracking, resolution, and negation. Consider the following example: (r :- false) (z :- true) ... [[not not r]] [[not not (r :- false)]] (q :- not r) (p :- not q;r,z) ... [[not not false]] [[not true]] p [[false]] [r,z] [p :- not q;r,z] [[not (q :- not r)]] [false,z] [fail] 2 https://openai.com/api/ Dataset size Dataset size Dataset size Experiment Experiment Experiment 1K 10K 100K 1K 10K 100K 1K 10K 100K Conjexplore 0.55 0.88 1.00 Conjintrpl 0.50 0.57 0.68 Disjexplore 0.27 0.54 0.98 Conjearly 0.66 0.93 1.00 Conjextrpl 0.68 0.96 0.99 Disjearly 0.31 1.00 1.00 Conjctx 1.00 1.00 1.00 Conjgpt3_in 1.00 N/A N/A Negbacktrack 0.33 1.00 1.00 Conj10g 0.69 1.00 1.00 Conjgpt3_ex 1.00 N/A N/A Viewupdate 0.17 0.90 1.00 Table 2 Average accuracy results for the experiments. In the example above, the transformer model tries to satisfy first the left-most goal of p , that is, n o t q . Since q is a rule, then the model reads the rule ( q : - n o t r ) and retrieves its value n o t ( r : - f a l s e ) . Then, the double negation is applied to goal r , and since the goal sequence is not satisfied, the model β€œbacktracks” to the next branch [ r , z ] . At this point, the model will fail to satisfy the goal given that r is false and no more branches remain. Note that the transformer model does not have any explicit notion of backtracking. Viewupdate : in this experiment, for each rule, we randomly marked one or more goals as the preferred derivation paths using the keywords ins (insertion) and del (deletion). We fine-tuned a model to understand this labeling and to explore each derivation path while carrying along a context (i.e., the results from previously resolved derivation paths). A simple case would be: (q :- false) (g :- true) (s :- true)... [ins q][ins [ins not g,s]] (d :- ins q,ins not g,s) ... [ins q][ins [del g,s]] ins d [ins q][ins [del (g :- true),s]] [ins [(d :- ins q,ins not g,s)]] [ins q][ins [del g true,s]] [ins [ins q,ins not g,s]] [ins q,del g][ins [s]] [ins [ins (q :- false),ins not g,s]] [insert q,delete g] [ins [ins q false,ins not g,s]] In this example, only marked goals q and n o t g are resolved while unmarked goal s is skipped. As mentioned on section 2, goals resolved to false can be inserted whereas goals resolved to true can be deleted, then those temporary results are added to the context (bold text enclosed in squared brackets) at the beginning of each line. Note that negation flips between deletion and insertion. Finally, the difference between d e l and d e l e t e , and i n s and i n s e r t is to distinguish temporary reasoning during the derivation versus the final answer. 4. Results and Discussion The results of the experiments are shown in Table 2. Also, we report the results of a post- hoc error analysis where three main types of errors were identified: (i) Syntactic error (e.g. incomplete sequences, misplaced tokens, etc.); (ii) Redundancy errors (e.g., the model generating unnecessary tokens); and (iii) Transcription errors (e.g., incorrectly flipping true/false and ins/del values during resolution, etc.) Some experiment results fall into multiple error types. syntactic redundancy transcription syntactic redundancy transcription Exp. Exp. 1K 10K 100K 1K 10K 100K 1K 10K 100K 1K 10K 100K 1K 10K 100K 1K 10K 100K Conjexplore 66 45 0 60 38 0 50 13 0 Conjgpt3_in 97 – – 84 – – 4 – – Conjearly 0 0 0 100 100 0 0 0 0 Conjgpt3_ex 0 – – 0 – – 0 – – Conjctx 0 0 0 0 0 0 0 0 0 Disjexplore 73 69 56 47 35 32 56 14 17 Conj10g 100 0 0 0 0 0 11 0 0 Disjearly 1 0 0 95 0 0 13 0 0 Conjintrpl 100 100 100 1 1 1 16 19 14 Negbacktrack 100 0 0 0 0 0 74 0 0 Conjextrpl 99 100 100 0 0 0 7 0 0 Viewupdate 99 99 0 0 1 0 55 41 0 Table 3 Per experiment error analysis expressed as a percentage of incorrect examples for the error category. Considered together, experiments Conjexplore , Conjearly , and Conjctx show that the generation of specific training examples successfully trains the GPT-2 model to perform symbolic reasoning over conjunctive predicates. In particular, carrying along the context of resolution as it proceeds, greatly reduces the number of examples needed to fine-tune the model. The most common type of error in Conjexplore is syntactic (55% avg.) followed by redundancy (48% avg.), whereas Conjearly completely eliminates both syntactic and transcription errors (see Table 3). Experiments Conj10g , Conjintrpl , and Conjextrpl explore the flexibility and brittleness of the transformer model. Comparing Conj10g with Conjearly , the only difference is that the former dataset contains more goals, but the performance is nearly the same. One could interpret this result as the model is relatively insensitive to the number of goals, however, the error analysis reveals that the more goals to process the more syntactic errors and the less redundancy errors. After close examination, we realized that Conjearly is more prone to redundancy errors when the last of the three rules is the one to be queried. Given that Conj10g processes more rules, the probability of querying the last defined rule is much lower. Contrasting Conjintrpl with Conj10g , the results show that the transformer model cannot easily interpolate to resolution with less than 10 goals even with 100K examples, if a broader range of training data is not provided. Surprisingly, when comparing Conjextrpl with Conj10g , the model performs better at extrapolating to a few more goals. Although syntactic errors are predominant in both Conjintrpl and Conjextrpl , interpolation experiments show that the model mainly struggles with the correct insertion of the < e o s > token and the placement of connectors. The next two results, Conjgpt3_in and Conjgpt3_ex briefly explore the impact of GPT-3. Both models show much better performance than their GPT-2 counterparts, as is normally to be expected since scaling the number of parameters for transformer models generally leads to improved performance. For interpolation, note that GPT-3 (1K) performs better than GPT-2 (100K), which corresponds to a 32% improvement on performance using only a tiny fraction (1/100) of the amount of data used by GPT-2. On the other hand, Conjgpt3_in achieved a perfect accuracy. These results demonstrate the capability of large pre-trained models such as GPT-3 to generalize an algorithm for symbolic reasoning using only a small amount of training data. For disjunction, interpretation of Disjexplore is a little tricky since the problem to solve is larger - i.e., there are more logical connectives involved. Because of the poor performance, clearly tracking disjunction state, even with the use of context, is a challenge. Our interpretation of the good performance of Disjearly is due simply to the fact that most examples terminate early (67%), thus requiring the model to track less state, and because shorter (failed) proofs are probabilistically less likely to contain an error. While Disjexplore contains errors of the three identified types, Disjearly mostly produces redundancy errors (due to the tendency of the model of generating longer sequences after the early-termination step). Negbacktrack provides evidence that a more complex logic can be learned with very little penalty. Negbacktrack mostly present syntactic and transcription errors (i.e., missing brackets and tokens during backtracking and incorrectly resolving nested negations, respectively). Experiment Viewupdate demonstrates that simple brief revision reasoning is also possible within transformer models, although many additional examples are required. The error analysis reveals that the most common types of errors are syntactic errors (e.g., missing i n s and d e l keywords from context and empty brackets at the end of the resolution – for instance, [ ] instead of [ i n s e r t q , d e l e t e g ] ) and transcription errors (flipping i n s to d e l and vice versa). Finally, overall, the experiments demonstrate that a transformer model, which is fined-tuned on data that uses a proper representation, is able to learn and generalize the underlying inference algorithm to simulate propositional logic resolution regardless the length and level of complexity of the derivation proofs (see Appendix B for further details). 5. Related Work Integration of symbolic and neural architectures is an active area of research. One strategy compiles symbolic reasoning into a neural network [17, 18, 19], where the network is constructed such that different predicates/rules can be semantically related to each other. While this approach requires that the construction of neural networks for each rule must be integrated with the application at the network level, in our work, reasoning is directly available at the model evaluation level. More generally, a classification of different systems focuses on the structure and reasoning of the underlying network [19]. For example, one category of the classification distinguishes symbolic from sub-symbolic systems. Transformer models are sub-symbolic since each lexical term is associated with an embedding. We use training data to essentially coerce the model into treating some sub-symbolic terms to behave as symbolic terms, i.e., propositions. Generically, the methodology of this paper injects symbolic reasoning by translating a logic program into training data for the transformer model. In effect, the lexical tokens of the logic program are blended with other lexical tokens provided in the training data, allowing for a mixing of different forms of reasoning. This representation trick is effectively a form of integration of symbolic language with inductive-bias [20]. Notably, our experiments show that transformers can effectively learn symbolic language-based representations and reasoning. A neuro-symbolic approach that is similar in nature to ours, proposes an RNN-based iterative network that is trained end-to-end to perform reasoning over logic programs [21]. From the implementation perspective, we take advantage of transformers’ capabilities over RNN networks, that is, transformers use positional embeddings and a self-attention mechanism, which allow our model to better capture the relationships between the different elements (symbols, logical connectives, and rules) of both the input and output sequences. This makes our model less prone to losing track during step-by-step resolution. Logical Neural Networks [22] propose a model that performs logical reasoning by creating 1-to-1 mappings between neurons and the elements of logical formulae, where the weights of neurons are constrained to act as AND or OR gates. This work differs from ours mainly because it uses fixed-length representations that constrain the number of symbols/connectors to be used as well as the expressibility of the resolution. As we demonstrated, our transformer model can flexibly adapt to variable-length inputs (context and query) and outputs (resolution). Another line of work proposes Logic-Integrated Neural Networks [23, 24], a framework that dynamically constructs its neural architecture based on an logical expression and then performs propositional reasoning. Although it supports compositionality, the computational cost exponentially increases as that number of expressions grow. In contrast, our model achieves high performance without resorting to restructuring the network architecture or training massive amounts of data (only fine-tuning was needed), resulting in much lower engineering costs. Although somewhat tangential, [25] provide an experimental evaluation on how transformer models perform on mathematical reasoning. Despite this study did not cover logical reasoning, it highlights the capabilities of transformer models to interpret free-form questions, perform mathematical reasoning, and generate the correct answers. Unlike our approach, this work only consider direct responses, for instance, given the question: β€œSolve -42*r + 27*c = -1167 and 130*r + 4*c = 372 for r” the transformer model directly outputs β€œ4” without intermediate steps. In contrast, our model uses step-by-step resolution, which can be further used for validation and, more importantly, for transparency, interpretability, and explainability purposes. 6. Conclusion and Future Work In this paper we describe a method to train transformer models in the steps of propositional logic resolution through examples. We systematically explore different possible representations of propositional clauses as data and measure the performance impact. The end result is a reasoning system for conjunctions, disjunction, and negation. We then apply this system to computing view updates (a form of belief revision reasoning). For future work, one step is to extend the techniques described here to more powerful logic systems [26] - the principal extension is the addition of data that trains the steps involved in unification. Unification is a context-specific substitution of symbols, similar to the types of substitutions performed in this paper. Finally, we end this paper with a call for researchers to examine a neuro-symbolic reasoning system with respect to its expressive power to simulate other reasoning systems. As this paper shows, a (low-cost) simulation provides a powerful argument to prefer one system over another. References [1] T. Brown, B. Mann, N. Ryder, et al., Language models are few-shot learners, in: Advances in Neural Information Processing Systems, volume 33, Curran Asso- ciates, Inc., 2020, pp. 1877–1901. URL: https://proceedings.neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. [2] J. W. Rae, S. Borgeaud, et al., Scaling language models: Methods, analysis & insights from training gopher, CoRR abs/2112.11446 (2021). URL: https://arxiv.org/abs/2112.11446. arXiv:2112.11446. [3] D. Ham, J.-G. Lee, Y. Jang, et al., End-to-end neural pipeline for goal-oriented dialogue systems using GPT-2, in: Proceedings of the 58th ACL Conference, 2020, pp. 583–592. URL: https://aclanthology.org/2020.acl-main.54. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . a c l - m a i n . 5 4 . [4] P. Budzianowski, I. VuliΔ‡, Hello, it’s gpt-2-how can i help you? towards the use of pretrained language models for task-oriented dialogue systems, in: Proceedings of the 3rd Workshop on Neural Generation and Translation, 2019, pp. 15–22. [5] A. Santoro, A. Lampinen, K. Mathewson, T. Lillicrap, D. Raposo, Symbolic behaviour in artificial intelligence, arXiv preprint arXiv:2102.03406 (2021). [6] A. d. Garcez, L. C. Lamb, Neurosymbolic ai: the 3rd wave, arXiv preprint arXiv:2012.05876 (2020). [7] M. Garnelo, M. Shanahan, Reconciling deep learning with symbolic artificial intelligence: representing objects and relations, Current Opinion in Behavioral Sciences 29 (2019) 17–23. [8] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, S. J. Gershman, Building machines that learn and think like people, Behavioral and brain sciences 40 (2017). [9] G. Marcus, The next decade in ai: Four steps towards robust artificial intelligence, arXiv preprint arXiv:2002.06177 (2020). [10] P. GΓ€rdenfors, Belief Revision, Cambridge University Press, 2003. [11] A. C. Kakas, P. Mancarella, Database updates through abduction., in: VLDB, volume 90, 1990, pp. 650–661. [12] O. J. Romero, A. Wang, J. Zimmerman, A. Steinfeld, A. Tomasic, A task-oriented dialogue architecture via transformer neural language models and symbolic injection, in: SIGDIAL, 2021, pp. 438–444. [13] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are unsu- pervised multitask learners, 2018. Https://d4mucfpksywv.cloudfront.net/better-language- models/language-models.pdf. [14] A. Tomasic, View update translation via deduction and annotation, in: International Conference on Database Theory, Springer, 1988, pp. 338–352. [15] A. Tomasic, Determining correct view update translations via query containment, in: International Conference on Logic Programming Workshop on Deductive Databases, Stanford InfoLab, 1994, pp. 75–83. URL: http://ilpubs.stanford.edu:8090/53/. [16] S. Welleck, I. Kulikov, J. Kim, et al., Consistency of a recurrent language model with respect to incomplete decoding, arXiv preprint arXiv:2002.02492 (2020). [17] P. Minervini, M. BoΕ‘njak, T. RocktΓ€schel, et al., Differentiable reasoning on large knowl- edge bases and natural language, in: Proceedings of the AAAI conference on artificial intelligence, volume 34, 2020, pp. 5182–5190. [18] S. N. Tran, Compositional Neural Logic Programming, in: 30th International Joint Conference on Artificial Intelligence, 2021, pp. 3059–3066. [19] L. De Raedt, S. DumančiΔ‡, R. Manhaeve, G. Marra, From Statistical Relational to Neural Symbolic Artificial Intelligence: a Survey, in: IJCAI-20, 2021, pp. 89–94. [20] H. Geffner, Target languages (vs. inductive biases) for learning to act and plan, arXiv preprint arXiv:2109.07195 (2021). [21] N. Cingillioglu, A. Russo, Deeplogic: Towards end-to-end differentiable logical reasoning, CEUR Workshop Proceedings, 2019. URL: http://hdl.handle.net/10044/1/71107. [22] R. Riegel, A. G. Gray, F. P. S. Luus, et al., Logical neural networks, CoRR abs/2006.13155 (2020). URL: https://arxiv.org/abs/2006.13155. a r X i v : 2 0 0 6 . 1 3 1 5 5 . [23] S. Shi, H. Chen, W. Ma, J. Mao, M. Zhang, Y. Zhang, Neural Logic Reasoning, Association for Computing Machinery, New York, NY, USA, 2020, p. 1365–1374. URL: https://doi.org/ 10.1145/3340531.3411949. [24] S. Shi, H. Chen, M. Zhang, Y. Zhang, Neural logic networks, CoRR abs/1910.08629 (2019). URL: http://arxiv.org/abs/1910.08629. a r X i v : 1 9 1 0 . 0 8 6 2 9 . [25] D. Saxton, E. Grefenstette, F. Hill, P. Kohli, Analysing mathematical reasoning abilities of neural models, in: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 2019, pp. 127–132. [26] W. Van Laer, L. De Raedt, How to upgrade propositional learners to first order logic: A case study, in: Relational data mining, Springer, 2001, pp. 235–261. A. Dataset genereration using Prolog The code snippet below denotes a Prolog logic program used for generating the examples. For simplicity, we only present the program for conjunction/disjunction generation. facts([a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z]). % fact generation mk_facts(0, Facts0, Facts0, []). mk_facts(K, Facts0, Facts2, [NewRule0|NewRule1]) :- pick(Fact, Facts0, Facts1), mk_fact_rule(Fact, NewRule0), K1 is K-1, mk_facts(K1, Facts1, Facts2, NewRule1). mk_fact_rule(Fact, (Fact :- Body)) :- random(X), (X > 0.7) -> Body = true; Body = false. pick(X, L1, L2) :- length(L1, Length), Length1 is Length+1, random(1, Length1, N), nth(N, L1, X), delete(L1, X, L2). prog0(KB, R) :- facts(F), mk_facts(12, F, R, KB). % rule generation mk_rules(0, _, _, R, R, []). mk_rules(K, P, KB, R0, R2, [NewRule|NewKB]) :- pick(Head, R0, R1), mk_rule(P, Head, KB, NewRule), K1 is K-1, mk_rules(K1, P, KB, R1, R2, NewKB). mk_rules_rand(0, _, R, R, []). mk_rules_rand(K, KB, R0, R2, [NewRule|NewKB]) :- pick(Head, R0, R1), random(4, 6, RN), mk_rule(RN, Head, KB, NewRule), K1 is K-1, mk_rules_rand(K1, KB, R1, R2, NewKB). mk_rule(P, Head, KB, (Head :- Body)) :- mk_rule_body(P, KB, Body). mk_rule_body(1, KB, Term) :- pick((Term :- _), KB, _). mk_rule_body(P, KB0, TermBody) :- pick((Term :- _), KB0, KB1), P1 is P-1, mk_rule_body(P1, KB1, Body), mk_rule_body_body(Term, Body, TermBody). mk_rule_body_body(Term, Body, TermBody) :- random(X), (X < 0.5) -> TermBody = (Term, Body); TermBody = (Term; Body). prog1(Query, KB, Rules) :- prog0(KB, R), mk_rules_rand(3, KB, R, _, Rules), pick((Query :- _), Rules, _). %%% the meta-interpreter - mi - Thanks to Markus Triska (triska@metalevel.at) clause_list((g(X)), [X]) :- X \= (_,_). clause_list((g(H), T), [H|T1]) :- clause_list(T, T1). clause_list((g(H); g(T)), [(H; T)]). % mi_backtrace is the core prover mi_defaulty_better(true, true). mi_defaulty_better(false, false). mi_defaulty_better((A,B), (BA,BB)) :- mi_defaulty_better(A, BA), mi_defaulty_better(B, BB). mi_defaulty_better((A;B), (BA;BB)) :- mi_defaulty_better(A, BA), mi_defaulty_better(B, BB). mi_defaulty_better(G, g(G)) :- G \= false, G \= true, G \= (_,_), G \= (_;_). mi_clause(Goal, KB, Body0, Body) :- member((Goal :- Body0), KB), mi_defaulty_better(Body0, Body). mi_tree(true, _, [true]). mi_tree(false, _, [false]). mi_tree((A,B), KB, T2) :- mi_tree(A, KB, T0), mi_tree(B, KB, T1), append(T0, T1, T2). mi_tree((A;B), KB, T2) :- mi_tree(A, KB, T0), mi_tree(B, KB, T1), T2 = [(T0; T1)]. mi_tree(g(G), KB, T1) :- mi_clause(G, KB, Body, G_Body), mi_tree(G_Body, KB, T0), append([(G:-Body)], T0, T1). new_conjunct_env(B, AEnv, ANewEnv) :- clause_list(B, BList), new_conjunct_env_(BList, AEnv, ANewEnv). new_conjunct_env_(_, [], []). new_conjunct_env_(BList, [H|T0], [ABList|T1]) :- append(H, BList, ABList), new_conjunct_env_(BList, T0, T1). mi_backtrace(G, KB, Trail) :- nonvar(G), mi_tree(g(G), KB, Trail). prog2(Q, NewKB, Trail) :- prog1(Q, KB, Rules), append(KB, Rules, NewKB), β†ͺ mi_backtrace(Q, NewKB, Trail). strip_g(g(X),X). dump_kb(_, []). dump_kb(File, [H|T]) :- write_ds(File, H), dump_kb(File, T). dump_query(File, Query) :- write_ds(File, Query). dump_proof(File, []). dump_proof(File, [[false|T0]|T]) :- write_ds(File, [false|T0]). dump_proof(File, [H0|T]) :- write_ds(File, H0), dump_proof(File, T). write_ds(File, Text) :- open(File, append, Stream), write(Stream, Text), nl(Stream), close(Stream). prog3 :- prog3(5). prog3(K) :- prog3('data/disjunction.txt', K). prog3(_, 0). prog3(File, K) :- prog2(Query, KB, Trail), write_ds(File, ''), write_ds(File, KB), write_ds(File, ''), write_ds(File, Query), write_ds(File, ''), dump_pair(File, KB, Trail), write_ds(File, ''), K1 is K-1, prog3(File, K1). dump_pair(_, _, []). dump_pair(File, KB, [H|T]) :- write_ds(File, KB), write_ds(File, H), dump_pair(File, KB, T). B. Generated Propositional Resolution for View-Updates Reasoning Below we provide some propositional proofs for the view-updates reasoning experiment gen- erated by the transformer model, which illustrates its capability to generalize the underlying inference algorithm that generates sequences of resolution steps with different lengths and different levels of complexity. B.1. Proof 1 (i :- true) (f :- false) (l :- false) (u :- true) (q :- false) (a :- true) (z :- false) (m :- true) (c :- true) (n :- false) (s :- false) (b :- true) (h :- n;f,m) (j :- b,ins c,s) (w :- not m,ins c,i) ins w [ins [(w :- not m,ins c,i)]] [ins [not m,ins c,i]] [ins [ins c,i]] [ins [ins (c :- true),i]] [ins [ins c true,i]] [ins [i]] [] B.2. Proof 2 (y :- false) (o :- false) (u :- true) (p :- true) (a :- true) (x :- true) (n :- true) (k :- false) (w :- true) (m :- false) (s :- false) (l :- false) (c :- ins y,(ins l;ins not p)) (d :- u;w,s) (b :- s,(o;k)) ins c [ins [(c :- ins y,(ins l;ins not p))]] [ins [ins y,(ins l;ins not p)]] [ins [ins (y :- false),(ins l;ins not p)]] [ins [ins y false,(ins l;ins not p)]] [[ins y][ins [ins l;ins not p]] [ins y][ins [ins l]] [ins y][ins [ins (l :- false)]] [ins y][ins [ins l false]]] [ins y,ins l][ins [ins not p]] [ins y,ins l][ins [del p]] [ins y,ins l][ins [del (p :- true)]] [ins y,ins l][ins [del p true]] [ins y,ins l,del p] [insert y,insert l,delete p]