<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anthony Tomasic</string-name>
          <email>tomasic@andrew.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oscar J. Romero</string-name>
          <email>oscarr@andrew.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>John Zimmerman</string-name>
          <email>johnz@andrew.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aaron Steinfeld</string-name>
          <email>steinfeld@cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Carnegie Mellon University</institution>
          ,
          <addr-line>5000 Forbes Avenue, Pittsburgh, 15213</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <abstract>
        <p>This paper presents exploratory work on how the key components of transformer-based neural language models (self-attention mechanism and large-scale pre-trained models) can be leveraged to perform context retrieval, symbol manipulation, and propositional reasoning. We fine-tuned of-the-shelf GPT-2 and GPT-3 language models to simulate the propositional logic resolution of non-recursive rules that combine conjunction, disjunction, and negation connectives. Next, we conducted a series of experiments and evaluated the performance of diferent syntactic representations. The promising results show that a transformer model can i) generalize an inference algorithm for propositional resolution, and ii) learn to solve simple instances of view updates (a well-studied problem in the field of databases) by leveraging the propositional reasoning framework. transformer neural language models, propositional reasoning, view updates problem (A. Steinfeld) 0000-0003-2274-0053 (A. Steinfeld) NeSy 2022, 16th International Workshop on Neural-Symbolic Learning and Reasoning, Cumberland Lodge, Windsor, UK https://design.cmu.edu/people/faculty/john-zimmerman/ (J. Zimmerman); http://www.cs.cmu.edu/~astein/ 0000-0001-7864-0364 (A. Tomasic); 0000-0002-4470-8209 (O. J. Romero); 0000-0001-5299-8157 (J. Zimmerman);</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Transformer models continue to be the preferred model for many natural language processing
(NLP) tasks, e.g. machine translation, natural generation, summarization, etc. In general, the
larger the transformer model, the larger the context that captures the reasoning of a task,
resulting in better performance. This scaling relationship between the size of the model and
performance had led to a race to construct ever larger models [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        However, symbolic reasoning is not directly available in transformer models [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. For example,
a common NLP problem that end-to-end transformer-based applications have yet to deal with
is abstraction over concepts. To circumvent this issue, delexicalization/lexicalization techniques
(i.e., the replacement of specific slot values by their corresponding generic slot tokens to allow
learning value-independent parameters) need to be used by a human designer [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. This
manipulation of de/lexicalized terms can be considered a simple form of symbolic manipulation
and reasoning. More generally, the lack of symbolic reasoning functionality has lead researchers
nEvelop-O
LGOBE
      </p>
      <p>https://www.cs.cmu.edu/~tomasic/ (A. Tomasic); http://www.cs.cmu.edu/~oscarr/ (O. J. Romero);
Prolog Program:
facts([a,b,c…]).
prog0(KB, R)
:facts(F),
mk_facts(F, R, KB).
...</p>
      <p>Dataset
&lt;bos&gt;
(a:-true) (b:-false)
(p :- (a,b);q
...
&lt;eos&gt;</p>
      <p>Fine-tune
Transformer</p>
      <p>Model
GPT-2/
GPT-3</p>
      <p>Finetuned
model
to extend the deep neural network models to directly represent symbolic reasoning [5, 6, 7, 8, 9].</p>
      <p>In this paper, we consider a complementary approach where we demonstrate the capability of
transformer models to directly simulate reasoning in propositional logic. We use this reasoning
to implement a simple form of belief revision [10], namely propositional view updates [11].</p>
      <p>The main contributions of this paper are: (i) A data representation of conjunction, disjunction
and negation symbolic logic designed to represent propositional resolution; (ii) A methodology
for generation of fine-tuning examples; (iii) A fine-tuned transformer model that learns the
underlying symbolic inference algorithm for propositional resolution; and (iv) An experimental
analysis of the performance of the fine-tuned model. In addition, we explore the implications of
these contributions in two ways: (i) We study the impact of the transformer model size and
structure on performance; and (ii) We apply this work to propositional view updates.</p>
      <p>Transformer models have been demonstrated to generate semi-structured representations [12]
that combine structured data, API calls, and natural language comprehension and synthesis.
This paper points the way to extend transformer models to perform belief revision over these
semi-structured representations.</p>
      <p>To illustrate a direct application of our contribution, consider the following conversational
interaction extracted from the logs of our prior work [12]:
U s e r : I w a n t t o t a k e a b u s t o C a r n e g i e M e l l o n U n i v e r s i t y
S y s t e m : I h a v e 4 b u s e s f o r y o u . . .</p>
      <p>T h e b u s 4 1 l e a v e s a t 4 : 3 3 p m a n d c o s t s $ 5 . 5 0 .</p>
      <p>T h e r e i s 1 t r a n s f e r . Y o u ' l l n e e d t o t a k e a c o n n e c t i o n w i t h b u s 6 1 C . . .</p>
      <p>The system response is only partially correct, the bus information and transfer are correct, but
the price information is wrong. The system has calculated the price of the trip as the sum of each
leg, but this sum is incorrect since transfers to another bus are free. Enabling a conversational
system with view updates would permit, for example, the conversation to proceed as “User: No,
the price information is wrong” and then start a (admittedly complex) dialog to correct the sum.</p>
      <p>In this paper, we adopt conventional logic programming syntax, semantics, and terminology
for programs, rules, terms, goals, and resolution for non-recursive propositional programs.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>
        Our objective is to fine-tune of-the-shelf transformer models GPT-2 [ 13] and GPT-3 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to
simulate propositional logic resolution of rules (see Figure 1). For instance, consider the
following example (written in Prolog syntax).
1) p :- q,r.
2) q :- true.
3) r :- true.
4) ?- p.
      </p>
      <p>%   ∶  ←  ∧ 
%   ∶  ←  
%   ∶  ←  
%   ∶    ?</p>
      <p>The example is arranged to state the program in terms of facts and rules (lines 1-3), then a
goal (line 4), and then the resolution of the goal (lines 5-8). Resolution proceeds by alternating
between two types of transformations: rule retrieval and goal rewriting. The logic proof above
can be encoded as a sequence of tokens (that serve as the input to GPT models) as follows:
&lt; b o s &gt; ( q : - t r u e ) ( r : - t r u e ) ( p : - q , r ) &lt; s o q &gt; p
ë &lt; s o p &gt; ( p : - q , r ) ( q : - t r u e ) t r u e ( r : - t r u e ) t r u e &lt; e o s &gt;</p>
      <p>Where &lt; b o s &gt; (beginning of sentence), &lt; s o q &gt; (start of query), &lt; s o p &gt; (start of proof), and &lt; e o s &gt;
(end of sentence) are special tokens, which serve as segment indicators. During fine-tuning, the
entire sequence is presented as an example, while during testing, the sub-sequence &lt; b o s &gt; . . . &lt; s o p &gt;
is used to prompt the model and the remaining sub-sequence serves as a gold label.</p>
      <p>In addition to fine-tuning resolution, propositional programs contain some symmetries that
must be accounted for when generating examples: (i) The ordering of terms and clauses is not
relevant; and (ii) Clause bodies are composed with conjunction (,), disjunction (;), and negation
(not) operators. These symmetries require that the examples randomize certain symbols so that
the transformer learner does not incorrectly model sequential aspects of the problem.</p>
      <p>In addition, the algorithm of propositional resolution must be reflected in the fine-tuning
data: (i) The state of the resolution is represented by a sequence of goals; (ii) Selection Rule: the
left most goal is selected and replaced with its definition; (iii) true and false goals are removed
when they appear; and (iv) Resolution ends when no goals remain.</p>
      <p>Three remaining issues must be settled to complete the specification of the representation:
(i) A goal without a corresponding clause requires some interpretation. We eliminate this case
by requiring all programs to have clauses for any goal. Thus, every chain of reasoning ends
with either true or false; (ii) Disjunction can be represented in two ways: via multiple rules
with the same head and by a semicolon disjunction operator. and (iii) Resolution is designed to
determine the truth of a goal and thus the search terminates with the first false or all goals end
with true. However, as discussed below, the resolution of view update reasoning requires the
exploration of all proof paths. We explore this termination condition in the experiments.</p>
      <p>In the context of relational and deductive databases, view update refers to the problem of
translating update requests against a view into update requests against the base data [11]. View
updates are a simple form of belief revision reasoning [10] that restricts the update to be the head
of a rule and the translation of the update to be facts (i.e., the body of the rule). A translation is
an update to the program that makes the view update true. So, given the program:
1 ) p : - q ; r .
2 ) p : - s .</p>
      <p>3 ) q : - f a l s e .
4 ) r : - t r u e .</p>
      <p>5 ) s : - f a l s e .</p>
      <p>The view update i n s e r t p has two translations: i n s e r t q (i.e., replace q : - f a l s e with
q : - t r u e ) and i n s e r t s . Note that i n s e r t r is redundant and thus not part of the
translation. View update translations are naturally ambiguous since there are many ways to translate
an update. A simple method for reducing ambiguity annotates terms in a rule to indicate a
preference for a particular update [14, 15]. Annotating the second rule implies that i n s e r t s
is the preferred translation. To compute all possible view update translations, the approach
we use here explores all solutions to the goal p without early termination and then reverse
engineers the proof trees to generate translation by converting false facts to true ones, or visa
versa. This same reasoning extends to deletion (each true proof must be falsified) and negation
(flip between insertion and deletion). The experiments below reflect these conditions.</p>
      <p>For each example, a generator program (written in Prolog – refer to Appendix A for further
details) creates a logic program by randomly drawing the names of terms from a fixed vocabulary,
randomly ordering the clauses, and randomly choosing among the operators in a syntactically
valid way. The generator program then randomly chooses a rule head as the initial goal of the
program. Finally, the generator program executes resolution on the generated logic program
and the initial goal, and the resulting step-by-step execution is presented as a training example
for the model to learn to perform resolution (Figure 1).</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiment Framework</title>
      <p>We used the open-source implementation of GPT-2 transformer architecture1 (medium size) with
default values for Adam learning rate (5e-5), epsilon for Adam optimizer (1e-8), and batch size
(4). To generate coherent and syntactically valid outputs [16], we set parameters temperature
(0.5), top-p nucleus sampling (0.95) and top-k sampling (50) using grid search.</p>
      <p>We ran 12 experiments where the transformer model was fine-tuned and tested to perform
resolution for conjunction, disjunction, and negation. For each experiment, we randomly
generated 3 datasets of increasing size (1K, 10K, and 100K examples) with a 75/25 train/test split.
We reported average accuracy (for an exact match with the test example) and an error analysis.</p>
      <p>Conjexplore: in this experiment, the transformer model learns to explore all the paths of the
search space defined by a rule clause, where conjunction is the only logical operator used to
connect the rule’s goals (see parameters on Table 1). The representation used is as follows:
&lt; b o s &gt; ( t : - t r u e ) ( c : - f a l s e ) ( l : - f a l s e ) . . ( e : - l , c , r , t ) . . . &lt; s o q &gt; e
ë &lt; s o p &gt; ( e : - l , c , r , t ) ( l : - f a l s e ) f a l s e . ( c : - f a l s e ) f a l s e . ( r : - f a l s e ) f a l s e .</p>
      <p>1https://github.com/huggingface/transformers
ë ( t : - t r u e ) t r u e . &lt; e o s &gt;</p>
      <p>Note that the model must remember the context of the derivation. By “context” we mean the
remaining goals to be resolved (e.g., goals in the sequence l , c , r , t is the derivation context).</p>
      <p>Conjearly: this experiment is similar to Conjexplore, except that the reasoning process stops
when a false term is found (early termination). That is, stopping at ( l : - f a l s e ) f a l s e .</p>
      <p>Conjctx: similar to experiment Conjearly, except that the context of the derivation is carried
at each step. We aimed to determine whether this representation improves the performance of
the system by repeatedly presenting relevant stimuli to the transformer’s attention mechanism.
The following representation was used (the context is enclosed by squared brackets):
&lt; b o s &gt; . . . &lt; s o q &gt; e &lt; s o p &gt; ( e : - l , c , r , t ) . ( ( l : - f a l s e ) , [ c , r , t ] ) f a l s e , [ c , r , t ] .
ë ( ( c : - f a l s e ) , [ r , t ] ) f a l s e , [ r , t ] . ( ( r : - f a l s e ) , [ t ] ) f a l s e , [ t ] . ( ( t : - t r u e ) ) &lt; e o s &gt;</p>
      <p>Conj10g: a copy of experiment Conjearly except that we increased the number of goals per
rule (1-10) for both train and test datasets. This experiment serves as a baseline for comparison
against experiments Conjintrpl and Conjextrpl.</p>
      <p>Conjintrpl: We evaluated whether the transformer model could interpolate the resolution of
clauses by training the model on 10 goals and evaluating it on tests with less than 10 goals.</p>
      <p>Conjextrpl: alike experiment Conj10g except that we trained the transformer model on rules
with 10 goals and tested it on rules with 11-12 goals as a study for extrapolation.</p>
      <p>Conjgpt3_in and Conjgpt3_ex: We compared the performance of GPT-2 vs GPT-3 models by
running experiments Conjintrpl and Conjextrpl on a GPT-3 fine-tuned model 2.</p>
      <p>Disjexplore: in this experiment, the transformer model is fine-tuned to learn the procedural
semantics for resolution given a specific arrangement of goals connected by conjunctions and/or
disjunctions. For instance, given the facts ( g : - t r u e ) ( j : - f a l s e ) ( q : - t r u e ) and a rule v ,
the 4 possible derivations are:
1 . ( v : - g , j , q )
[ g , j , q ] [ t r u e , j , q ]
[ j , q ] [ f a l s e , q ] [ q ] [ f a l s e ]
2 . ( v : - g ; j , q )
[ [ g , [ t r u e ] ] ] ; [ [ j , q ] , [ f a l s e , q ] ]
[ [ q ] , [ t r u e ] ]
3 . ( v : - g , ( j ; q ) )
[ g , ( j ; q ) ] [ t r u e , ( j ; q ) ]
[ [ j ] , [ f a l s e ] ] ; [ [ q ] , [ t r u e ] ]
4 . ( v : - g ; j ; q )
[ [ g ] , [ t r u e ] ] ;
[ [ [ j ] , [ f a l s e ] ] ; [ [ q ] , [ t r u e ] ] ]</p>
      <p>Disjearly: this experiment is similar in nature to Disjexplore, except that an early termination
is triggered when a false term is found on a conjunction branch.</p>
      <p>Negbacktrack: in this experiment, we added the negation operator to the training examples, so
the model must learn backtracking, resolution, and negation. Consider the following example:
&lt; b o s &gt; ( r : - f a l s e ) ( z : - t r u e ) . . .
( q : - n o t r ) ( p : - n o t q ; r , z ) . . .
&lt; s o q &gt; p &lt; s o p &gt;
[ p : - n o t q ; r , z ]</p>
      <p>[ [ n o t ( q : - n o t r ) ] ]
Experiment
Conjexplore
Conjearly
Conjctx
Conj10g
1K</p>
      <p>Dataset size</p>
      <p>10K 100K</p>
      <p>In the example above, the transformer model tries to satisfy first the left-most goal of p , that
is, n o t q . Since q is a rule, then the model reads the rule ( q : - n o t r ) and retrieves its value
n o t ( r : - f a l s e ) . Then, the double negation is applied to goal r , and since the goal sequence is
not satisfied, the model “backtracks” to the next branch [ r , z ] . At this point, the model will fail
to satisfy the goal given that r is false and no more branches remain. Note that the transformer
model does not have any explicit notion of backtracking.</p>
      <p>Viewupdate: in this experiment, for each rule, we randomly marked one or more goals as the
preferred derivation paths using the keywords ins (insertion) and del (deletion). We fine-tuned
a model to understand this labeling and to explore each derivation path while carrying along a
context (i.e., the results from previously resolved derivation paths). A simple case would be:
&lt; b o s &gt; ( q : - f a l s e ) ( g : - t r u e ) ( s : - t r u e ) . . .
( d : - i n s q , i n s n o t g , s ) . . .
&lt; s o q &gt; i n s d &lt; s o p &gt;
[ i n s [ ( d : - i n s q , i n s n o t g , s ) ] ]
[ i n s [ i n s q , i n s n o t g , s ] ]
[ i n s [ i n s ( q : - f a l s e ) , i n s n o t g , s ] ]
[ i n s [ i n s q f a l s e , i n s n o t g , s ] ]
[ i n s q ] [ i n s [ i n s n o t g , s ] ]
[ i n s q ] [ i n s [ d e l g , s ] ]
[ i n s q ] [ i n s [ d e l ( g : - t r u e ) , s ] ]
[ i n s q ] [ i n s [ d e l g t r u e , s ] ]
[ i n s q , d e l g ] [ i n s [ s ] ]
[ i n s e r t q , d e l e t e g ] &lt; e o s &gt;</p>
      <p>In this example, only marked goals q and n o t g are resolved while unmarked goal s is skipped.
As mentioned on section 2, goals resolved to false can be inserted whereas goals resolved to
true can be deleted, then those temporary results are added to the context (bold text enclosed in
squared brackets) at the beginning of each line. Note that negation flips between deletion and
insertion. Finally, the diference between d e l and d e l e t e , and i n s and i n s e r t is to distinguish
temporary reasoning during the derivation versus the final answer.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>The results of the experiments are shown in Table 2. Also, we report the results of a
posthoc error analysis where three main types of errors were identified: (i) Syntactic error (e.g.
incomplete sequences, misplaced tokens, etc.); (ii) Redundancy errors (e.g., the model generating
unnecessary tokens); and (iii) Transcription errors (e.g., incorrectly flipping true/false and
ins/del values during resolution, etc.) Some experiment results fall into multiple error types.</p>
      <p>Considered together, experiments Conjexplore, Conjearly, and Conjctx show that the generation
of specific training examples successfully trains the GPT-2 model to perform symbolic reasoning
over conjunctive predicates. In particular, carrying along the context of resolution as it proceeds,
greatly reduces the number of examples needed to fine-tune the model. The most common
type of error in Conjexplore is syntactic (55% avg.) followed by redundancy (48% avg.), whereas
Conjearly completely eliminates both syntactic and transcription errors (see Table 3).</p>
      <p>Experiments Conj10g, Conjintrpl, and Conjextrpl explore the flexibility and brittleness of the
transformer model. Comparing Conj10g with Conjearly, the only diference is that the former
dataset contains more goals, but the performance is nearly the same. One could interpret this
result as the model is relatively insensitive to the number of goals, however, the error analysis
reveals that the more goals to process the more syntactic errors and the less redundancy errors.
After close examination, we realized that Conj early is more prone to redundancy errors when
the last of the three rules is the one to be queried. Given that Conj10g processes more rules, the
probability of querying the last defined rule is much lower.</p>
      <p>Contrasting Conjintrpl with Conj10g, the results show that the transformer model cannot
easily interpolate to resolution with less than 10 goals even with 100K examples, if a broader
range of training data is not provided. Surprisingly, when comparing Conjextrpl with Conj10g,
the model performs better at extrapolating to a few more goals. Although syntactic errors are
predominant in both Conjintrpl and Conjextrpl, interpolation experiments show that the model
mainly struggles with the correct insertion of the &lt; e o s &gt; token and the placement of connectors.</p>
      <p>The next two results, Conjgpt3_in and Conjgpt3_ex briefly explore the impact of GPT-3. Both
models show much better performance than their GPT-2 counterparts, as is normally to be
expected since scaling the number of parameters for transformer models generally leads to
improved performance. For interpolation, note that GPT-3 (1K) performs better than GPT-2
(100K), which corresponds to a 32% improvement on performance using only a tiny fraction
(1/100) of the amount of data used by GPT-2. On the other hand, Conjgpt3_in achieved a perfect
accuracy. These results demonstrate the capability of large pre-trained models such as GPT-3 to
generalize an algorithm for symbolic reasoning using only a small amount of training data.</p>
      <p>For disjunction, interpretation of Disjexplore is a little tricky since the problem to solve is
larger - i.e., there are more logical connectives involved. Because of the poor performance,
clearly tracking disjunction state, even with the use of context, is a challenge. Our interpretation
of the good performance of Disjearly is due simply to the fact that most examples terminate
early (67%), thus requiring the model to track less state, and because shorter (failed) proofs are
probabilistically less likely to contain an error. While Disjexplore contains errors of the three
identified types, Disj early mostly produces redundancy errors (due to the tendency of the model
of generating longer sequences after the early-termination step).</p>
      <p>Negbacktrack provides evidence that a more complex logic can be learned with very little
penalty. Negbacktrack mostly present syntactic and transcription errors (i.e., missing brackets
and tokens during backtracking and incorrectly resolving nested negations, respectively).</p>
      <p>Experiment Viewupdate demonstrates that simple brief revision reasoning is also possible
within transformer models, although many additional examples are required. The error analysis
reveals that the most common types of errors are syntactic errors (e.g., missing i n s and d e l
keywords from context and empty brackets at the end of the resolution – for instance, [ ] instead
of [ i n s e r t q , d e l e t e g ] ) and transcription errors (flipping i n s to d e l and vice versa).</p>
      <p>Finally, overall, the experiments demonstrate that a transformer model, which is fined-tuned
on data that uses a proper representation, is able to learn and generalize the underlying inference
algorithm to simulate propositional logic resolution regardless the length and level of complexity
of the derivation proofs (see Appendix B for further details).</p>
    </sec>
    <sec id="sec-5">
      <title>5. Related Work</title>
      <p>Integration of symbolic and neural architectures is an active area of research. One strategy
compiles symbolic reasoning into a neural network [17, 18, 19], where the network is constructed
such that diferent predicates/rules can be semantically related to each other. While this approach
requires that the construction of neural networks for each rule must be integrated with the
application at the network level, in our work, reasoning is directly available at the model
evaluation level. More generally, a classification of diferent systems focuses on the structure
and reasoning of the underlying network [19]. For example, one category of the classification
distinguishes symbolic from sub-symbolic systems. Transformer models are sub-symbolic since
each lexical term is associated with an embedding. We use training data to essentially coerce
the model into treating some sub-symbolic terms to behave as symbolic terms, i.e., propositions.
Generically, the methodology of this paper injects symbolic reasoning by translating a logic
program into training data for the transformer model. In efect, the lexical tokens of the logic
program are blended with other lexical tokens provided in the training data, allowing for
a mixing of diferent forms of reasoning. This representation trick is efectively a form of
integration of symbolic language with inductive-bias [20]. Notably, our experiments show that
transformers can efectively learn symbolic language-based representations and reasoning.</p>
      <p>A neuro-symbolic approach that is similar in nature to ours, proposes an RNN-based iterative
network that is trained end-to-end to perform reasoning over logic programs [21]. From the
implementation perspective, we take advantage of transformers’ capabilities over RNN networks,
that is, transformers use positional embeddings and a self-attention mechanism, which allow
our model to better capture the relationships between the diferent elements (symbols, logical
connectives, and rules) of both the input and output sequences. This makes our model less
prone to losing track during step-by-step resolution.</p>
      <p>Logical Neural Networks [22] propose a model that performs logical reasoning by creating
1-to-1 mappings between neurons and the elements of logical formulae, where the weights of
neurons are constrained to act as AND or OR gates. This work difers from ours mainly because
it uses fixed-length representations that constrain the number of symbols/connectors to be used
as well as the expressibility of the resolution. As we demonstrated, our transformer model can
lfexibly adapt to variable-length inputs (context and query) and outputs (resolution).</p>
      <p>
        Another line of work proposes Logic-Integrated Neural Networks [
        <xref ref-type="bibr" rid="ref5">23, 24</xref>
        ], a framework
that dynamically constructs its neural architecture based on an logical expression and then
performs propositional reasoning. Although it supports compositionality, the computational cost
exponentially increases as that number of expressions grow. In contrast, our model achieves high
performance without resorting to restructuring the network architecture or training massive
amounts of data (only fine-tuning was needed), resulting in much lower engineering costs.
      </p>
      <p>
        Although somewhat tangential, [
        <xref ref-type="bibr" rid="ref6">25</xref>
        ] provide an experimental evaluation on how transformer
models perform on mathematical reasoning. Despite this study did not cover logical reasoning,
it highlights the capabilities of transformer models to interpret free-form questions, perform
mathematical reasoning, and generate the correct answers. Unlike our approach, this work
only consider direct responses, for instance, given the question: “Solve -42*r + 27*c = -1167 and
130*r + 4*c = 372 for r” the transformer model directly outputs “4” without intermediate steps.
In contrast, our model uses step-by-step resolution, which can be further used for validation
and, more importantly, for transparency, interpretability, and explainability purposes.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>
        In this paper we describe a method to train transformer models in the steps of propositional logic
resolution through examples. We systematically explore diferent possible representations of
propositional clauses as data and measure the performance impact. The end result is a reasoning
system for conjunctions, disjunction, and negation. We then apply this system to computing
view updates (a form of belief revision reasoning). For future work, one step is to extend the
techniques described here to more powerful logic systems [
        <xref ref-type="bibr" rid="ref7">26</xref>
        ] - the principal extension is the
addition of data that trains the steps involved in unification. Unification is a context-specific
substitution of symbols, similar to the types of substitutions performed in this paper. Finally,
we end this paper with a call for researchers to examine a neuro-symbolic reasoning system
with respect to its expressive power to simulate other reasoning systems. As this paper shows,
a (low-cost) simulation provides a powerful argument to prefer one system over another.
pretrained language models for task-oriented dialogue systems, in: Proceedings of the 3rd
Workshop on Neural Generation and Translation, 2019, pp. 15–22.
[5] A. Santoro, A. Lampinen, K. Mathewson, T. Lillicrap, D. Raposo, Symbolic behaviour in
artificial intelligence, arXiv preprint arXiv:2102.03406 (2021).
[6] A. d. Garcez, L. C. Lamb, Neurosymbolic ai: the 3rd wave, arXiv preprint arXiv:2012.05876
(2020).
[7] M. Garnelo, M. Shanahan, Reconciling deep learning with symbolic artificial intelligence:
representing objects and relations, Current Opinion in Behavioral Sciences 29 (2019)
17–23.
[8] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, S. J. Gershman, Building machines that learn
and think like people, Behavioral and brain sciences 40 (2017).
[9] G. Marcus, The next decade in ai: Four steps towards robust artificial intelligence, arXiv
preprint arXiv:2002.06177 (2020).
[10] P. Gärdenfors, Belief Revision, Cambridge University Press, 2003.
[11] A. C. Kakas, P. Mancarella, Database updates through abduction., in: VLDB, volume 90,
1990, pp. 650–661.
[12] O. J. Romero, A. Wang, J. Zimmerman, A. Steinfeld, A. Tomasic, A task-oriented dialogue
architecture via transformer neural language models and symbolic injection, in: SIGDIAL,
2021, pp. 438–444.
[13] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are
unsupervised multitask learners, 2018.
Https://d4mucfpksywv.cloudfront.net/better-languagemodels/language-models.pdf.
[14] A. Tomasic, View update translation via deduction and annotation, in: International
      </p>
      <p>Conference on Database Theory, Springer, 1988, pp. 338–352.
[15] A. Tomasic, Determining correct view update translations via query containment, in:
International Conference on Logic Programming Workshop on Deductive Databases,
Stanford InfoLab, 1994, pp. 75–83. URL: http://ilpubs.stanford.edu:8090/53/.
[16] S. Welleck, I. Kulikov, J. Kim, et al., Consistency of a recurrent language model with respect
to incomplete decoding, arXiv preprint arXiv:2002.02492 (2020).
[17] P. Minervini, M. Bošnjak, T. Rocktäschel, et al., Diferentiable reasoning on large
knowledge bases and natural language, in: Proceedings of the AAAI conference on artificial
intelligence, volume 34, 2020, pp. 5182–5190.
[18] S. N. Tran, Compositional Neural Logic Programming, in: 30th International Joint</p>
      <p>Conference on Artificial Intelligence, 2021, pp. 3059–3066.
[19] L. De Raedt, S. Dumančić, R. Manhaeve, G. Marra, From Statistical Relational to Neural</p>
      <p>Symbolic Artificial Intelligence: a Survey, in: IJCAI-20, 2021, pp. 89–94.
[20] H. Gefner, Target languages (vs. inductive biases) for learning to act and plan, arXiv
preprint arXiv:2109.07195 (2021).
[21] N. Cingillioglu, A. Russo, Deeplogic: Towards end-to-end diferentiable logical reasoning,</p>
      <p>CEUR Workshop Proceedings, 2019. URL: http://hdl.handle.net/10044/1/71107.
[22] R. Riegel, A. G. Gray, F. P. S. Luus, et al., Logical neural networks, CoRR abs/2006.13155
(2020). URL: https://arxiv.org/abs/2006.13155. a r X i v : 2 0 0 6 . 1 3 1 5 5 .
[23] S. Shi, H. Chen, W. Ma, J. Mao, M. Zhang, Y. Zhang, Neural Logic Reasoning, Association
for Computing Machinery, New York, NY, USA, 2020, p. 1365–1374. URL: https://doi.org/</p>
    </sec>
    <sec id="sec-7">
      <title>A. Dataset genereration using Prolog</title>
      <p>The code snippet below denotes a Prolog logic program used for generating the examples. For
simplicity, we only present the program for conjunction/disjunction generation.
f a c t s ( [ a , b , c , d , e , f , g , h , i , j , k , l , m , n , o , p , q , r , s , t , u , v , w , x , y , z ] ) .
% f a c t g e n e r a t i o n
m k _ f a c t s ( 0 , F a c t s 0 , F a c t s 0 , [ ] ) .
m k _ f a c t s ( K , F a c t s 0 , F a c t s 2 , [ N e w R u l e 0 | N e w R u l e 1 ] ) :
p i c k ( F a c t , F a c t s 0 , F a c t s 1 ) ,
m k _ f a c t _ r u l e ( F a c t , N e w R u l e 0 ) ,
K 1 i s K - 1 ,
m k _ f a c t s ( K 1 , F a c t s 1 , F a c t s 2 , N e w R u l e 1 ) .
m k _ f a c t _ r u l e ( F a c t , ( F a c t : - B o d y ) ) :
r a n d o m ( X ) ,
( X &gt; 0 . 7 ) - &gt; B o d y = t r u e ; B o d y = f a l s e .
p i c k ( X , L 1 , L 2 ) :
l e n g t h ( L 1 , L e n g t h ) ,
L e n g t h 1 i s L e n g t h + 1 ,
r a n d o m ( 1 , L e n g t h 1 , N ) ,
n t h ( N , L 1 , X ) ,
d e l e t e ( L 1 , X , L 2 ) .
p r o g 0 ( K B , R ) : - f a c t s ( F ) , m k _ f a c t s ( 1 2 , F , R , K B ) .
% r u l e g e n e r a t i o n
m k _ r u l e s ( 0 , _ , _ , R , R , [ ] ) .
m k _ r u l e s ( K , P , K B , R 0 , R 2 , [ N e w R u l e | N e w K B ] ) :
p i c k ( H e a d , R 0 , R 1 ) ,
m k _ r u l e ( P , H e a d , K B , N e w R u l e ) ,
K 1 i s K - 1 ,
m k _ r u l e s ( K 1 , P , K B , R 1 , R 2 , N e w K B ) .
m k _ r u l e s _ r a n d ( 0 , _ , R , R , [ ] ) .
m k _ r u l e s _ r a n d ( K , K B , R 0 , R 2 , [ N e w R u l e | N e w K B ] ) :
p i c k ( H e a d , R 0 , R 1 ) ,
r a n d o m ( 4 , 6 , R N ) ,
m k _ r u l e ( R N , H e a d , K B , N e w R u l e ) ,
K 1 i s K - 1 ,
m k _ r u l e s _ r a n d ( K 1 , K B , R 1 , R 2 , N e w K B ) .
m k _ r u l e ( P , H e a d , K B , ( H e a d : - B o d y ) ) :</p>
      <p>m k _ r u l e _ b o d y ( P , K B , B o d y ) .
m k _ r u l e _ b o d y ( 1 , K B , T e r m ) :</p>
      <p>p i c k ( ( T e r m : - _ ) , K B , _ ) .
m k _ r u l e _ b o d y ( P , K B 0 , T e r m B o d y ) :
p i c k ( ( T e r m : - _ ) , K B 0 , K B 1 ) ,
P 1 i s P - 1 ,
m k _ r u l e _ b o d y ( P 1 , K B 1 , B o d y ) ,
m k _ r u l e _ b o d y _ b o d y ( T e r m , B o d y , T e r m B o d y ) .
m k _ r u l e _ b o d y _ b o d y ( T e r m , B o d y , T e r m B o d y ) :
r a n d o m ( X ) ,
( X &lt; 0 . 5 ) - &gt; T e r m B o d y = ( T e r m , B o d y ) ; T e r m B o d y = ( T e r m ; B o d y ) .
p r o g 1 ( Q u e r y , K B , R u l e s ) :
p r o g 0 ( K B , R ) ,
m k _ r u l e s _ r a n d ( 3 , K B , R , _ , R u l e s ) ,
p i c k ( ( Q u e r y : - _ ) , R u l e s , _ ) .
% % % t h e m e t a - i n t e r p r e t e r - m i - T h a n k s t o M a r k u s T r i s k a ( t r i s k a @ m e t a l e v e l . a t )
c l a u s e _ l i s t ( ( g ( X ) ) , [ X ] ) : - X \ = ( _ , _ ) .
c l a u s e _ l i s t ( ( g ( H ) , T ) , [ H | T 1 ] ) : - c l a u s e _ l i s t ( T , T 1 ) .
c l a u s e _ l i s t ( ( g ( H ) ; g ( T ) ) , [ ( H ; T ) ] ) .
% m i _ b a c k t r a c e i s t h e c o r e p r o v e r
m i _ d e f a u l t y _ b e t t e r ( t r u e , t r u e ) .
m i _ d e f a u l t y _ b e t t e r ( f a l s e , f a l s e ) .
m i _ d e f a u l t y _ b e t t e r ( ( A , B ) , ( B A , B B ) ) :
m i _ d e f a u l t y _ b e t t e r ( A , B A ) ,
m i _ d e f a u l t y _ b e t t e r ( B , B B ) .
m i _ d e f a u l t y _ b e t t e r ( ( A ; B ) , ( B A ; B B ) ) :
m i _ d e f a u l t y _ b e t t e r ( A , B A ) ,
m i _ d e f a u l t y _ b e t t e r ( B , B B ) .
p r o g 2 ( Q , N e w K B , T r a i l ) : - p r o g 1 ( Q , K B , R u l e s ) , a p p e n d ( K B , R u l e s , N e w K B ) ,
↪ m i _ b a c k t r a c e ( Q , N e w K B , T r a i l ) .
s t r i p _ g ( g ( X ) , X ) .
d u m p _ k b ( _ , [ ] ) .
d u m p _ k b ( F i l e , [ H | T ] ) :
w r i t e _ d s ( F i l e , H ) ,
d u m p _ k b ( F i l e , T ) .
d u m p _ q u e r y ( F i l e , Q u e r y ) :</p>
      <p>w r i t e _ d s ( F i l e , Q u e r y ) .
d u m p _ p r o o f ( F i l e , [ ] ) .
d u m p _ p r o o f ( F i l e , [ [ f a l s e | T 0 ] | T ] ) :</p>
      <p>w r i t e _ d s ( F i l e , [ f a l s e | T 0 ] ) .
d u m p _ p r o o f ( F i l e , [ H 0 | T ] ) :
w r i t e _ d s ( F i l e , H 0 ) ,
d u m p _ p r o o f ( F i l e , T ) .
w r i t e _ d s ( F i l e , T e x t ) :
o p e n ( F i l e , a p p e n d , S t r e a m ) ,
w r i t e ( S t r e a m , T e x t ) , n l ( S t r e a m ) ,
c l o s e ( S t r e a m ) .</p>
    </sec>
    <sec id="sec-8">
      <title>B. Generated Propositional Resolution for View-Updates</title>
    </sec>
    <sec id="sec-9">
      <title>Reasoning</title>
      <p>Below we provide some propositional proofs for the view-updates reasoning experiment
generated by the transformer model, which illustrates its capability to generalize the underlying
inference algorithm that generates sequences of resolution steps with diferent lengths and
diferent levels of complexity.</p>
      <p>B.1. Proof 1
B.2. Proof 2
&lt; b o s &gt;
( y : - f a l s e )
( o : - f a l s e )
( u : - t r u e )
( p : - t r u e )
( a : - t r u e )
( x : - t r u e )
( n : - t r u e )
( k : - f a l s e )
( w : - t r u e )</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2020</year>
          , pp.
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          . URL: https://proceedings.neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Rae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Borgeaud</surname>
          </string-name>
          , et al.,
          <article-title>Scaling language models: Methods, analysis &amp; insights from training gopher</article-title>
          ,
          <source>CoRR abs/2112</source>
          .11446 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2112.11446.
          <article-title>a r X i v : 2 1 1 2 . 1 1 4 4 6</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-G.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jang</surname>
          </string-name>
          , et al.,
          <article-title>End-to-end neural pipeline for goal-oriented dialogue systems using GPT-2</article-title>
          , in:
          <source>Proceedings of the 58th ACL Conference</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>583</fpage>
          -
          <lpage>592</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <source>54. doi:1 0 . 1 8</source>
          <volume>6 5 3</volume>
          / v 1 /
          <article-title>2 0 2 0 . a c l - m a i n . 5 4 .</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Budzianowski</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Vulić</surname>
          </string-name>
          , Hello, it's gpt-2
          <article-title>-how can i help you? towards the use of 10</article-title>
          .1145/3340531.3411949.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Neural logic networks</article-title>
          , CoRR abs/
          <year>1910</year>
          .08629 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1910</year>
          .08629.
          <article-title>a r X i v : 1 9 1 0 . 0 8 6 2 9</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>D.</given-names>
            <surname>Saxton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Grefenstette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kohli</surname>
          </string-name>
          ,
          <article-title>Analysing mathematical reasoning abilities of neural models</article-title>
          ,
          <source>in: 7th International Conference on Learning Representations, ICLR</source>
          <year>2019</year>
          ,
          <article-title>New Orleans</article-title>
          , LA, USA,
          <year>2019</year>
          , pp.
          <fpage>127</fpage>
          -
          <lpage>132</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>W.</given-names>
            <surname>Van Laer</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . De Raedt,
          <article-title>How to upgrade propositional learners to first order logic: A case study</article-title>
          ,
          <source>in: Relational data mining</source>
          , Springer,
          <year>2001</year>
          , pp.
          <fpage>235</fpage>
          -
          <lpage>261</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>