1. Introduction

Language Models

Anthony Tomasic

tomasic@andrew.cmu.edu 0

Oscar J. Romero

oscarr@andrew.cmu.edu 0

John Zimmerman

johnz@andrew.cmu.edu 0

Aaron Steinfeld

steinfeld@cmu.edu 0 0 Carnegie Mellon University , 5000 Forbes Avenue, Pittsburgh, 15213 , USA

2021

This paper presents exploratory work on how the key components of transformer-based neural language models (self-attention mechanism and large-scale pre-trained models) can be leveraged to perform context retrieval, symbol manipulation, and propositional reasoning. We fine-tuned of-the-shelf GPT-2 and GPT-3 language models to simulate the propositional logic resolution of non-recursive rules that combine conjunction, disjunction, and negation connectives. Next, we conducted a series of experiments and evaluated the performance of diferent syntactic representations. The promising results show that a transformer model can i) generalize an inference algorithm for propositional resolution, and ii) learn to solve simple instances of view updates (a well-studied problem in the field of databases) by leveraging the propositional reasoning framework. transformer neural language models, propositional reasoning, view updates problem (A. Steinfeld) 0000-0003-2274-0053 (A. Steinfeld) NeSy 2022, 16th International Workshop on Neural-Symbolic Learning and Reasoning, Cumberland Lodge, Windsor, UK https://design.cmu.edu/people/faculty/john-zimmerman/ (J. Zimmerman); http://www.cs.cmu.edu/~astein/ 0000-0001-7864-0364 (A. Tomasic); 0000-0002-4470-8209 (O. J. Romero); 0000-0001-5299-8157 (J. Zimmerman);

1. Introduction

Transformer models continue to be the preferred model for many natural language processing (NLP) tasks, e.g. machine translation, natural generation, summarization, etc. In general, the larger the transformer model, the larger the context that captures the reasoning of a task, resulting in better performance. This scaling relationship between the size of the model and performance had led to a race to construct ever larger models [ 1 ].

However, symbolic reasoning is not directly available in transformer models [ 2 ]. For example, a common NLP problem that end-to-end transformer-based applications have yet to deal with is abstraction over concepts. To circumvent this issue, delexicalization/lexicalization techniques (i.e., the replacement of specific slot values by their corresponding generic slot tokens to allow learning value-independent parameters) need to be used by a human designer [ 3, 4 ]. This manipulation of de/lexicalized terms can be considered a simple form of symbolic manipulation and reasoning. More generally, the lack of symbolic reasoning functionality has lead researchers nEvelop-O LGOBE

https://www.cs.cmu.edu/~tomasic/ (A. Tomasic); http://www.cs.cmu.edu/~oscarr/ (O. J. Romero); Prolog Program: facts([a,b,c…]). prog0(KB, R) :facts(F), mk_facts(F, R, KB). ...

Dataset <bos> (a:-true) (b:-false) (p :- (a,b);q ... <eos>

Fine-tune Transformer

Model GPT-2/ GPT-3

Finetuned model to extend the deep neural network models to directly represent symbolic reasoning [5, 6, 7, 8, 9].

In this paper, we consider a complementary approach where we demonstrate the capability of transformer models to directly simulate reasoning in propositional logic. We use this reasoning to implement a simple form of belief revision [10], namely propositional view updates [11].

The main contributions of this paper are: (i) A data representation of conjunction, disjunction and negation symbolic logic designed to represent propositional resolution; (ii) A methodology for generation of fine-tuning examples; (iii) A fine-tuned transformer model that learns the underlying symbolic inference algorithm for propositional resolution; and (iv) An experimental analysis of the performance of the fine-tuned model. In addition, we explore the implications of these contributions in two ways: (i) We study the impact of the transformer model size and structure on performance; and (ii) We apply this work to propositional view updates.

Transformer models have been demonstrated to generate semi-structured representations [12] that combine structured data, API calls, and natural language comprehension and synthesis. This paper points the way to extend transformer models to perform belief revision over these semi-structured representations.

To illustrate a direct application of our contribution, consider the following conversational interaction extracted from the logs of our prior work [12]: U s e r : I w a n t t o t a k e a b u s t o C a r n e g i e M e l l o n U n i v e r s i t y S y s t e m : I h a v e 4 b u s e s f o r y o u . . .

T h e b u s 4 1 l e a v e s a t 4 : 3 3 p m a n d c o s t s $ 5 . 5 0 .

T h e r e i s 1 t r a n s f e r . Y o u ' l l n e e d t o t a k e a c o n n e c t i o n w i t h b u s 6 1 C . . .

The system response is only partially correct, the bus information and transfer are correct, but the price information is wrong. The system has calculated the price of the trip as the sum of each leg, but this sum is incorrect since transfers to another bus are free. Enabling a conversational system with view updates would permit, for example, the conversation to proceed as “User: No, the price information is wrong” and then start a (admittedly complex) dialog to correct the sum.

In this paper, we adopt conventional logic programming syntax, semantics, and terminology for programs, rules, terms, goals, and resolution for non-recursive propositional programs.

2. Methodology

Our objective is to fine-tune of-the-shelf transformer models GPT-2 [ 13] and GPT-3 [ 1 ] to simulate propositional logic resolution of rules (see Figure 1). For instance, consider the following example (written in Prolog syntax). 1) p :- q,r. 2) q :- true. 3) r :- true. 4) ?- p.

% ∶ ← ∧ % ∶ ← % ∶ ← % ∶ ?

The example is arranged to state the program in terms of facts and rules (lines 1-3), then a goal (line 4), and then the resolution of the goal (lines 5-8). Resolution proceeds by alternating between two types of transformations: rule retrieval and goal rewriting. The logic proof above can be encoded as a sequence of tokens (that serve as the input to GPT models) as follows: ( q : - t r u e ) ( r : - t r u e ) ( p : - q , r ) < s o q > p ë < s o p > ( p : - q , r ) ( q : - t r u e ) t r u e ( r : - t r u e ) t r u e < e o s >

Where (beginning of sentence), < s o q > (start of query), < s o p > (start of proof), and < e o s > (end of sentence) are special tokens, which serve as segment indicators. During fine-tuning, the entire sequence is presented as an example, while during testing, the sub-sequence . . . < s o p > is used to prompt the model and the remaining sub-sequence serves as a gold label.

In addition to fine-tuning resolution, propositional programs contain some symmetries that must be accounted for when generating examples: (i) The ordering of terms and clauses is not relevant; and (ii) Clause bodies are composed with conjunction (,), disjunction (;), and negation (not) operators. These symmetries require that the examples randomize certain symbols so that the transformer learner does not incorrectly model sequential aspects of the problem.

In addition, the algorithm of propositional resolution must be reflected in the fine-tuning data: (i) The state of the resolution is represented by a sequence of goals; (ii) Selection Rule: the left most goal is selected and replaced with its definition; (iii) true and false goals are removed when they appear; and (iv) Resolution ends when no goals remain.

Three remaining issues must be settled to complete the specification of the representation: (i) A goal without a corresponding clause requires some interpretation. We eliminate this case by requiring all programs to have clauses for any goal. Thus, every chain of reasoning ends with either true or false; (ii) Disjunction can be represented in two ways: via multiple rules with the same head and by a semicolon disjunction operator. and (iii) Resolution is designed to determine the truth of a goal and thus the search terminates with the first false or all goals end with true. However, as discussed below, the resolution of view update reasoning requires the exploration of all proof paths. We explore this termination condition in the experiments.

In the context of relational and deductive databases, view update refers to the problem of translating update requests against a view into update requests against the base data [11]. View updates are a simple form of belief revision reasoning [10] that restricts the update to be the head of a rule and the translation of the update to be facts (i.e., the body of the rule). A translation is an update to the program that makes the view update true. So, given the program: 1 ) p : - q ; r . 2 ) p : - s .

3 ) q : - f a l s e . 4 ) r : - t r u e .

5 ) s : - f a l s e .

The view update i n s e r t p has two translations: i n s e r t q (i.e., replace q : - f a l s e with q : - t r u e ) and i n s e r t s . Note that i n s e r t r is redundant and thus not part of the translation. View update translations are naturally ambiguous since there are many ways to translate an update. A simple method for reducing ambiguity annotates terms in a rule to indicate a preference for a particular update [14, 15]. Annotating the second rule implies that i n s e r t s is the preferred translation. To compute all possible view update translations, the approach we use here explores all solutions to the goal p without early termination and then reverse engineers the proof trees to generate translation by converting false facts to true ones, or visa versa. This same reasoning extends to deletion (each true proof must be falsified) and negation (flip between insertion and deletion). The experiments below reflect these conditions.

For each example, a generator program (written in Prolog – refer to Appendix A for further details) creates a logic program by randomly drawing the names of terms from a fixed vocabulary, randomly ordering the clauses, and randomly choosing among the operators in a syntactically valid way. The generator program then randomly chooses a rule head as the initial goal of the program. Finally, the generator program executes resolution on the generated logic program and the initial goal, and the resulting step-by-step execution is presented as a training example for the model to learn to perform resolution (Figure 1).

3. Experiment Framework

We used the open-source implementation of GPT-2 transformer architecture1 (medium size) with default values for Adam learning rate (5e-5), epsilon for Adam optimizer (1e-8), and batch size (4). To generate coherent and syntactically valid outputs [16], we set parameters temperature (0.5), top-p nucleus sampling (0.95) and top-k sampling (50) using grid search.

We ran 12 experiments where the transformer model was fine-tuned and tested to perform resolution for conjunction, disjunction, and negation. For each experiment, we randomly generated 3 datasets of increasing size (1K, 10K, and 100K examples) with a 75/25 train/test split. We reported average accuracy (for an exact match with the test example) and an error analysis.

Conjexplore: in this experiment, the transformer model learns to explore all the paths of the search space defined by a rule clause, where conjunction is the only logical operator used to connect the rule’s goals (see parameters on Table 1). The representation used is as follows: ( t : - t r u e ) ( c : - f a l s e ) ( l : - f a l s e ) . . ( e : - l , c , r , t ) . . . < s o q > e ë < s o p > ( e : - l , c , r , t ) ( l : - f a l s e ) f a l s e . ( c : - f a l s e ) f a l s e . ( r : - f a l s e ) f a l s e .

1https://github.com/huggingface/transformers ë ( t : - t r u e ) t r u e . < e o s >

Note that the model must remember the context of the derivation. By “context” we mean the remaining goals to be resolved (e.g., goals in the sequence l , c , r , t is the derivation context).

Conjearly: this experiment is similar to Conjexplore, except that the reasoning process stops when a false term is found (early termination). That is, stopping at ( l : - f a l s e ) f a l s e .

Conjctx: similar to experiment Conjearly, except that the context of the derivation is carried at each step. We aimed to determine whether this representation improves the performance of the system by repeatedly presenting relevant stimuli to the transformer’s attention mechanism. The following representation was used (the context is enclosed by squared brackets): . . . < s o q > e < s o p > ( e : - l , c , r , t ) . ( ( l : - f a l s e ) , [ c , r , t ] ) f a l s e , [ c , r , t ] . ë ( ( c : - f a l s e ) , [ r , t ] ) f a l s e , [ r , t ] . ( ( r : - f a l s e ) , [ t ] ) f a l s e , [ t ] . ( ( t : - t r u e ) ) < e o s >

Conj10g: a copy of experiment Conjearly except that we increased the number of goals per rule (1-10) for both train and test datasets. This experiment serves as a baseline for comparison against experiments Conjintrpl and Conjextrpl.

Conjintrpl: We evaluated whether the transformer model could interpolate the resolution of clauses by training the model on 10 goals and evaluating it on tests with less than 10 goals.

Conjextrpl: alike experiment Conj10g except that we trained the transformer model on rules with 10 goals and tested it on rules with 11-12 goals as a study for extrapolation.

Conjgpt3_in and Conjgpt3_ex: We compared the performance of GPT-2 vs GPT-3 models by running experiments Conjintrpl and Conjextrpl on a GPT-3 fine-tuned model 2.

Disjexplore: in this experiment, the transformer model is fine-tuned to learn the procedural semantics for resolution given a specific arrangement of goals connected by conjunctions and/or disjunctions. For instance, given the facts ( g : - t r u e ) ( j : - f a l s e ) ( q : - t r u e ) and a rule v , the 4 possible derivations are: 1 . ( v : - g , j , q ) [ g , j , q ] [ t r u e , j , q ] [ j , q ] [ f a l s e , q ] [ q ] [ f a l s e ] 2 . ( v : - g ; j , q ) [ [ g , [ t r u e ] ] ] ; [ [ j , q ] , [ f a l s e , q ] ] [ [ q ] , [ t r u e ] ] 3 . ( v : - g , ( j ; q ) ) [ g , ( j ; q ) ] [ t r u e , ( j ; q ) ] [ [ j ] , [ f a l s e ] ] ; [ [ q ] , [ t r u e ] ] 4 . ( v : - g ; j ; q ) [ [ g ] , [ t r u e ] ] ; [ [ [ j ] , [ f a l s e ] ] ; [ [ q ] , [ t r u e ] ] ]

Disjearly: this experiment is similar in nature to Disjexplore, except that an early termination is triggered when a false term is found on a conjunction branch.

Negbacktrack: in this experiment, we added the negation operator to the training examples, so the model must learn backtracking, resolution, and negation. Consider the following example: ( r : - f a l s e ) ( z : - t r u e ) . . . ( q : - n o t r ) ( p : - n o t q ; r , z ) . . . < s o q > p < s o p > [ p : - n o t q ; r , z ]

[ [ n o t ( q : - n o t r ) ] ] Experiment Conjexplore Conjearly Conjctx Conj10g 1K

Dataset size

10K 100K

In the example above, the transformer model tries to satisfy first the left-most goal of p , that is, n o t q . Since q is a rule, then the model reads the rule ( q : - n o t r ) and retrieves its value n o t ( r : - f a l s e ) . Then, the double negation is applied to goal r , and since the goal sequence is not satisfied, the model “backtracks” to the next branch [ r , z ] . At this point, the model will fail to satisfy the goal given that r is false and no more branches remain. Note that the transformer model does not have any explicit notion of backtracking.

Viewupdate: in this experiment, for each rule, we randomly marked one or more goals as the preferred derivation paths using the keywords ins (insertion) and del (deletion). We fine-tuned a model to understand this labeling and to explore each derivation path while carrying along a context (i.e., the results from previously resolved derivation paths). A simple case would be: ( q : - f a l s e ) ( g : - t r u e ) ( s : - t r u e ) . . . ( d : - i n s q , i n s n o t g , s ) . . . < s o q > i n s d < s o p > [ i n s [ ( d : - i n s q , i n s n o t g , s ) ] ] [ i n s [ i n s q , i n s n o t g , s ] ] [ i n s [ i n s ( q : - f a l s e ) , i n s n o t g , s ] ] [ i n s [ i n s q f a l s e , i n s n o t g , s ] ] [ i n s q ] [ i n s [ i n s n o t g , s ] ] [ i n s q ] [ i n s [ d e l g , s ] ] [ i n s q ] [ i n s [ d e l ( g : - t r u e ) , s ] ] [ i n s q ] [ i n s [ d e l g t r u e , s ] ] [ i n s q , d e l g ] [ i n s [ s ] ] [ i n s e r t q , d e l e t e g ] < e o s >

In this example, only marked goals q and n o t g are resolved while unmarked goal s is skipped. As mentioned on section 2, goals resolved to false can be inserted whereas goals resolved to true can be deleted, then those temporary results are added to the context (bold text enclosed in squared brackets) at the beginning of each line. Note that negation flips between deletion and insertion. Finally, the diference between d e l and d e l e t e , and i n s and i n s e r t is to distinguish temporary reasoning during the derivation versus the final answer.

4. Results and Discussion

The results of the experiments are shown in Table 2. Also, we report the results of a posthoc error analysis where three main types of errors were identified: (i) Syntactic error (e.g. incomplete sequences, misplaced tokens, etc.); (ii) Redundancy errors (e.g., the model generating unnecessary tokens); and (iii) Transcription errors (e.g., incorrectly flipping true/false and ins/del values during resolution, etc.) Some experiment results fall into multiple error types.

Considered together, experiments Conjexplore, Conjearly, and Conjctx show that the generation of specific training examples successfully trains the GPT-2 model to perform symbolic reasoning over conjunctive predicates. In particular, carrying along the context of resolution as it proceeds, greatly reduces the number of examples needed to fine-tune the model. The most common type of error in Conjexplore is syntactic (55% avg.) followed by redundancy (48% avg.), whereas Conjearly completely eliminates both syntactic and transcription errors (see Table 3).

Experiments Conj10g, Conjintrpl, and Conjextrpl explore the flexibility and brittleness of the transformer model. Comparing Conj10g with Conjearly, the only diference is that the former dataset contains more goals, but the performance is nearly the same. One could interpret this result as the model is relatively insensitive to the number of goals, however, the error analysis reveals that the more goals to process the more syntactic errors and the less redundancy errors. After close examination, we realized that Conj early is more prone to redundancy errors when the last of the three rules is the one to be queried. Given that Conj10g processes more rules, the probability of querying the last defined rule is much lower.

Contrasting Conjintrpl with Conj10g, the results show that the transformer model cannot easily interpolate to resolution with less than 10 goals even with 100K examples, if a broader range of training data is not provided. Surprisingly, when comparing Conjextrpl with Conj10g, the model performs better at extrapolating to a few more goals. Although syntactic errors are predominant in both Conjintrpl and Conjextrpl, interpolation experiments show that the model mainly struggles with the correct insertion of the < e o s > token and the placement of connectors.

The next two results, Conjgpt3_in and Conjgpt3_ex briefly explore the impact of GPT-3. Both models show much better performance than their GPT-2 counterparts, as is normally to be expected since scaling the number of parameters for transformer models generally leads to improved performance. For interpolation, note that GPT-3 (1K) performs better than GPT-2 (100K), which corresponds to a 32% improvement on performance using only a tiny fraction (1/100) of the amount of data used by GPT-2. On the other hand, Conjgpt3_in achieved a perfect accuracy. These results demonstrate the capability of large pre-trained models such as GPT-3 to generalize an algorithm for symbolic reasoning using only a small amount of training data.

For disjunction, interpretation of Disjexplore is a little tricky since the problem to solve is larger - i.e., there are more logical connectives involved. Because of the poor performance, clearly tracking disjunction state, even with the use of context, is a challenge. Our interpretation of the good performance of Disjearly is due simply to the fact that most examples terminate early (67%), thus requiring the model to track less state, and because shorter (failed) proofs are probabilistically less likely to contain an error. While Disjexplore contains errors of the three identified types, Disj early mostly produces redundancy errors (due to the tendency of the model of generating longer sequences after the early-termination step).

Negbacktrack provides evidence that a more complex logic can be learned with very little penalty. Negbacktrack mostly present syntactic and transcription errors (i.e., missing brackets and tokens during backtracking and incorrectly resolving nested negations, respectively).

Experiment Viewupdate demonstrates that simple brief revision reasoning is also possible within transformer models, although many additional examples are required. The error analysis reveals that the most common types of errors are syntactic errors (e.g., missing i n s and d e l keywords from context and empty brackets at the end of the resolution – for instance, [ ] instead of [ i n s e r t q , d e l e t e g ] ) and transcription errors (flipping i n s to d e l and vice versa).

Finally, overall, the experiments demonstrate that a transformer model, which is fined-tuned on data that uses a proper representation, is able to learn and generalize the underlying inference algorithm to simulate propositional logic resolution regardless the length and level of complexity of the derivation proofs (see Appendix B for further details).

5. Related Work

Integration of symbolic and neural architectures is an active area of research. One strategy compiles symbolic reasoning into a neural network [17, 18, 19], where the network is constructed such that diferent predicates/rules can be semantically related to each other. While this approach requires that the construction of neural networks for each rule must be integrated with the application at the network level, in our work, reasoning is directly available at the model evaluation level. More generally, a classification of diferent systems focuses on the structure and reasoning of the underlying network [19]. For example, one category of the classification distinguishes symbolic from sub-symbolic systems. Transformer models are sub-symbolic since each lexical term is associated with an embedding. We use training data to essentially coerce the model into treating some sub-symbolic terms to behave as symbolic terms, i.e., propositions. Generically, the methodology of this paper injects symbolic reasoning by translating a logic program into training data for the transformer model. In efect, the lexical tokens of the logic program are blended with other lexical tokens provided in the training data, allowing for a mixing of diferent forms of reasoning. This representation trick is efectively a form of integration of symbolic language with inductive-bias [20]. Notably, our experiments show that transformers can efectively learn symbolic language-based representations and reasoning.

A neuro-symbolic approach that is similar in nature to ours, proposes an RNN-based iterative network that is trained end-to-end to perform reasoning over logic programs [21]. From the implementation perspective, we take advantage of transformers’ capabilities over RNN networks, that is, transformers use positional embeddings and a self-attention mechanism, which allow our model to better capture the relationships between the diferent elements (symbols, logical connectives, and rules) of both the input and output sequences. This makes our model less prone to losing track during step-by-step resolution.

Logical Neural Networks [22] propose a model that performs logical reasoning by creating 1-to-1 mappings between neurons and the elements of logical formulae, where the weights of neurons are constrained to act as AND or OR gates. This work difers from ours mainly because it uses fixed-length representations that constrain the number of symbols/connectors to be used as well as the expressibility of the resolution. As we demonstrated, our transformer model can lfexibly adapt to variable-length inputs (context and query) and outputs (resolution).

Another line of work proposes Logic-Integrated Neural Networks [ 23, 24 ], a framework that dynamically constructs its neural architecture based on an logical expression and then performs propositional reasoning. Although it supports compositionality, the computational cost exponentially increases as that number of expressions grow. In contrast, our model achieves high performance without resorting to restructuring the network architecture or training massive amounts of data (only fine-tuning was needed), resulting in much lower engineering costs.

Although somewhat tangential, [ 25 ] provide an experimental evaluation on how transformer models perform on mathematical reasoning. Despite this study did not cover logical reasoning, it highlights the capabilities of transformer models to interpret free-form questions, perform mathematical reasoning, and generate the correct answers. Unlike our approach, this work only consider direct responses, for instance, given the question: “Solve -42*r + 27*c = -1167 and 130*r + 4*c = 372 for r” the transformer model directly outputs “4” without intermediate steps. In contrast, our model uses step-by-step resolution, which can be further used for validation and, more importantly, for transparency, interpretability, and explainability purposes.

6. Conclusion and Future Work

In this paper we describe a method to train transformer models in the steps of propositional logic resolution through examples. We systematically explore diferent possible representations of propositional clauses as data and measure the performance impact. The end result is a reasoning system for conjunctions, disjunction, and negation. We then apply this system to computing view updates (a form of belief revision reasoning). For future work, one step is to extend the techniques described here to more powerful logic systems [ 26 ] - the principal extension is the addition of data that trains the steps involved in unification. Unification is a context-specific substitution of symbols, similar to the types of substitutions performed in this paper. Finally, we end this paper with a call for researchers to examine a neuro-symbolic reasoning system with respect to its expressive power to simulate other reasoning systems. As this paper shows, a (low-cost) simulation provides a powerful argument to prefer one system over another. pretrained language models for task-oriented dialogue systems, in: Proceedings of the 3rd Workshop on Neural Generation and Translation, 2019, pp. 15–22. [5] A. Santoro, A. Lampinen, K. Mathewson, T. Lillicrap, D. Raposo, Symbolic behaviour in artificial intelligence, arXiv preprint arXiv:2102.03406 (2021). [6] A. d. Garcez, L. C. Lamb, Neurosymbolic ai: the 3rd wave, arXiv preprint arXiv:2012.05876 (2020). [7] M. Garnelo, M. Shanahan, Reconciling deep learning with symbolic artificial intelligence: representing objects and relations, Current Opinion in Behavioral Sciences 29 (2019) 17–23. [8] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, S. J. Gershman, Building machines that learn and think like people, Behavioral and brain sciences 40 (2017). [9] G. Marcus, The next decade in ai: Four steps towards robust artificial intelligence, arXiv preprint arXiv:2002.06177 (2020). [10] P. Gärdenfors, Belief Revision, Cambridge University Press, 2003. [11] A. C. Kakas, P. Mancarella, Database updates through abduction., in: VLDB, volume 90, 1990, pp. 650–661. [12] O. J. Romero, A. Wang, J. Zimmerman, A. Steinfeld, A. Tomasic, A task-oriented dialogue architecture via transformer neural language models and symbolic injection, in: SIGDIAL, 2021, pp. 438–444. [13] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are unsupervised multitask learners, 2018. Https://d4mucfpksywv.cloudfront.net/better-languagemodels/language-models.pdf. [14] A. Tomasic, View update translation via deduction and annotation, in: International

Conference on Database Theory, Springer, 1988, pp. 338–352. [15] A. Tomasic, Determining correct view update translations via query containment, in: International Conference on Logic Programming Workshop on Deductive Databases, Stanford InfoLab, 1994, pp. 75–83. URL: http://ilpubs.stanford.edu:8090/53/. [16] S. Welleck, I. Kulikov, J. Kim, et al., Consistency of a recurrent language model with respect to incomplete decoding, arXiv preprint arXiv:2002.02492 (2020). [17] P. Minervini, M. Bošnjak, T. Rocktäschel, et al., Diferentiable reasoning on large knowledge bases and natural language, in: Proceedings of the AAAI conference on artificial intelligence, volume 34, 2020, pp. 5182–5190. [18] S. N. Tran, Compositional Neural Logic Programming, in: 30th International Joint

Conference on Artificial Intelligence, 2021, pp. 3059–3066. [19] L. De Raedt, S. Dumančić, R. Manhaeve, G. Marra, From Statistical Relational to Neural

Symbolic Artificial Intelligence: a Survey, in: IJCAI-20, 2021, pp. 89–94. [20] H. Gefner, Target languages (vs. inductive biases) for learning to act and plan, arXiv preprint arXiv:2109.07195 (2021). [21] N. Cingillioglu, A. Russo, Deeplogic: Towards end-to-end diferentiable logical reasoning,

CEUR Workshop Proceedings, 2019. URL: http://hdl.handle.net/10044/1/71107. [22] R. Riegel, A. G. Gray, F. P. S. Luus, et al., Logical neural networks, CoRR abs/2006.13155 (2020). URL: https://arxiv.org/abs/2006.13155. a r X i v : 2 0 0 6 . 1 3 1 5 5 . [23] S. Shi, H. Chen, W. Ma, J. Mao, M. Zhang, Y. Zhang, Neural Logic Reasoning, Association for Computing Machinery, New York, NY, USA, 2020, p. 1365–1374. URL: https://doi.org/

A. Dataset genereration using Prolog

The code snippet below denotes a Prolog logic program used for generating the examples. For simplicity, we only present the program for conjunction/disjunction generation. f a c t s ( [ a , b , c , d , e , f , g , h , i , j , k , l , m , n , o , p , q , r , s , t , u , v , w , x , y , z ] ) . % f a c t g e n e r a t i o n m k _ f a c t s ( 0 , F a c t s 0 , F a c t s 0 , [ ] ) . m k _ f a c t s ( K , F a c t s 0 , F a c t s 2 , [ N e w R u l e 0 | N e w R u l e 1 ] ) : p i c k ( F a c t , F a c t s 0 , F a c t s 1 ) , m k _ f a c t _ r u l e ( F a c t , N e w R u l e 0 ) , K 1 i s K - 1 , m k _ f a c t s ( K 1 , F a c t s 1 , F a c t s 2 , N e w R u l e 1 ) . m k _ f a c t _ r u l e ( F a c t , ( F a c t : - B o d y ) ) : r a n d o m ( X ) , ( X > 0 . 7 ) - > B o d y = t r u e ; B o d y = f a l s e . p i c k ( X , L 1 , L 2 ) : l e n g t h ( L 1 , L e n g t h ) , L e n g t h 1 i s L e n g t h + 1 , r a n d o m ( 1 , L e n g t h 1 , N ) , n t h ( N , L 1 , X ) , d e l e t e ( L 1 , X , L 2 ) . p r o g 0 ( K B , R ) : - f a c t s ( F ) , m k _ f a c t s ( 1 2 , F , R , K B ) . % r u l e g e n e r a t i o n m k _ r u l e s ( 0 , _ , _ , R , R , [ ] ) . m k _ r u l e s ( K , P , K B , R 0 , R 2 , [ N e w R u l e | N e w K B ] ) : p i c k ( H e a d , R 0 , R 1 ) , m k _ r u l e ( P , H e a d , K B , N e w R u l e ) , K 1 i s K - 1 , m k _ r u l e s ( K 1 , P , K B , R 1 , R 2 , N e w K B ) . m k _ r u l e s _ r a n d ( 0 , _ , R , R , [ ] ) . m k _ r u l e s _ r a n d ( K , K B , R 0 , R 2 , [ N e w R u l e | N e w K B ] ) : p i c k ( H e a d , R 0 , R 1 ) , r a n d o m ( 4 , 6 , R N ) , m k _ r u l e ( R N , H e a d , K B , N e w R u l e ) , K 1 i s K - 1 , m k _ r u l e s _ r a n d ( K 1 , K B , R 1 , R 2 , N e w K B ) . m k _ r u l e ( P , H e a d , K B , ( H e a d : - B o d y ) ) :

m k _ r u l e _ b o d y ( P , K B , B o d y ) . m k _ r u l e _ b o d y ( 1 , K B , T e r m ) :

p i c k ( ( T e r m : - _ ) , K B , _ ) . m k _ r u l e _ b o d y ( P , K B 0 , T e r m B o d y ) : p i c k ( ( T e r m : - _ ) , K B 0 , K B 1 ) , P 1 i s P - 1 , m k _ r u l e _ b o d y ( P 1 , K B 1 , B o d y ) , m k _ r u l e _ b o d y _ b o d y ( T e r m , B o d y , T e r m B o d y ) . m k _ r u l e _ b o d y _ b o d y ( T e r m , B o d y , T e r m B o d y ) : r a n d o m ( X ) , ( X < 0 . 5 ) - > T e r m B o d y = ( T e r m , B o d y ) ; T e r m B o d y = ( T e r m ; B o d y ) . p r o g 1 ( Q u e r y , K B , R u l e s ) : p r o g 0 ( K B , R ) , m k _ r u l e s _ r a n d ( 3 , K B , R , _ , R u l e s ) , p i c k ( ( Q u e r y : - _ ) , R u l e s , _ ) . % % % t h e m e t a - i n t e r p r e t e r - m i - T h a n k s t o M a r k u s T r i s k a ( t r i s k a @ m e t a l e v e l . a t ) c l a u s e _ l i s t ( ( g ( X ) ) , [ X ] ) : - X \ = ( _ , _ ) . c l a u s e _ l i s t ( ( g ( H ) , T ) , [ H | T 1 ] ) : - c l a u s e _ l i s t ( T , T 1 ) . c l a u s e _ l i s t ( ( g ( H ) ; g ( T ) ) , [ ( H ; T ) ] ) . % m i _ b a c k t r a c e i s t h e c o r e p r o v e r m i _ d e f a u l t y _ b e t t e r ( t r u e , t r u e ) . m i _ d e f a u l t y _ b e t t e r ( f a l s e , f a l s e ) . m i _ d e f a u l t y _ b e t t e r ( ( A , B ) , ( B A , B B ) ) : m i _ d e f a u l t y _ b e t t e r ( A , B A ) , m i _ d e f a u l t y _ b e t t e r ( B , B B ) . m i _ d e f a u l t y _ b e t t e r ( ( A ; B ) , ( B A ; B B ) ) : m i _ d e f a u l t y _ b e t t e r ( A , B A ) , m i _ d e f a u l t y _ b e t t e r ( B , B B ) . p r o g 2 ( Q , N e w K B , T r a i l ) : - p r o g 1 ( Q , K B , R u l e s ) , a p p e n d ( K B , R u l e s , N e w K B ) , ↪ m i _ b a c k t r a c e ( Q , N e w K B , T r a i l ) . s t r i p _ g ( g ( X ) , X ) . d u m p _ k b ( _ , [ ] ) . d u m p _ k b ( F i l e , [ H | T ] ) : w r i t e _ d s ( F i l e , H ) , d u m p _ k b ( F i l e , T ) . d u m p _ q u e r y ( F i l e , Q u e r y ) :

w r i t e _ d s ( F i l e , Q u e r y ) . d u m p _ p r o o f ( F i l e , [ ] ) . d u m p _ p r o o f ( F i l e , [ [ f a l s e | T 0 ] | T ] ) :

w r i t e _ d s ( F i l e , [ f a l s e | T 0 ] ) . d u m p _ p r o o f ( F i l e , [ H 0 | T ] ) : w r i t e _ d s ( F i l e , H 0 ) , d u m p _ p r o o f ( F i l e , T ) . w r i t e _ d s ( F i l e , T e x t ) : o p e n ( F i l e , a p p e n d , S t r e a m ) , w r i t e ( S t r e a m , T e x t ) , n l ( S t r e a m ) , c l o s e ( S t r e a m ) .

B. Generated Propositional Resolution for View-Updates Reasoning

Below we provide some propositional proofs for the view-updates reasoning experiment generated by the transformer model, which illustrates its capability to generalize the underlying inference algorithm that generates sequences of resolution steps with diferent lengths and diferent levels of complexity.

B.1. Proof 1 B.2. Proof 2 ( y : - f a l s e ) ( o : - f a l s e ) ( u : - t r u e ) ( p : - t r u e ) ( a : - t r u e ) ( x : - t r u e ) ( n : - t r u e ) ( k : - f a l s e ) ( w : - t r u e )

[1]

Brown ,

Mann ,

Ryder , et al., Language models are few-shot learners , in: Advances in Neural Information Processing Systems , volume 33 , Curran

Associates

, Inc., 2020 , pp. 1877 - 1901 . URL: https://proceedings.neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.

[2]

J. W.

Rae ,

Borgeaud , et al., Scaling language models: Methods, analysis & insights from training gopher , CoRR abs/2112 .11446 ( 2021 ). URL: https://arxiv.org/abs/2112.11446. a r X i v : 2 1 1 2 . 1 1 4 4 6 .

[3]

Ham ,

J.-G.

Lee ,

Jang , et al., End-to-end neural pipeline for goal-oriented dialogue systems using GPT-2 , in: Proceedings of the 58th ACL Conference , 2020 , pp. 583 - 592 . URL: https://aclanthology.org/ 2020 .acl-main. 54. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . a c l - m a i n . 5 4 .

[4]

Budzianowski , I. Vulić , Hello, it's gpt-2 -how can i help you? towards the use of 10 .1145/3340531.3411949.

[24]

Shi ,

Chen ,

Zhang , Y. Zhang, Neural logic networks , CoRR abs/ 1910 .08629 ( 2019 ). URL: http://arxiv.org/abs/ 1910 .08629. a r X i v : 1 9 1 0 . 0 8 6 2 9 .

[25]

Saxton ,

Grefenstette ,

Hill ,

Kohli , Analysing mathematical reasoning abilities of neural models , in: 7th International Conference on Learning Representations, ICLR 2019 , New Orleans , LA, USA, 2019 , pp. 127 - 132 .

[26]

Van Laer , L . De Raedt, How to upgrade propositional learners to first order logic: A case study , in: Relational data mining , Springer, 2001 , pp. 235 - 261 .