<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>OVERLAY</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Formal Logical Reasoning With Transformers and Their Place on the Chomsky Hierarchy</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefan Reifberger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ludwig-Maximilians-Universität München and Technical University of Munich</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>6</volume>
      <fpage>28</fpage>
      <lpage>29</lpage>
      <abstract>
        <p>Building on recent interest in the capability of the transformer architecture to learn symbolic reasoning I present results in training deep learning architectures on a classical natural deduction calculus in propositional logic. For tests on novel derivations within the length of training derivations, the transformer outperforms encoder-decoder LSTMs significantly. This is despite contrary previous results for test derivations longer than seen during training. Users want deep learning models like GPT to tell them the truth, not just empirical truths like Munich being the capital of Bavaria. Rather they expect them to make logical inferences of the form “If A then B” and “Not B” then “Not A” deterministically from provided information. However, deep learning models would have to emulate this deductive process via probabilistic induction: learning the most probable next words or characters given some inference rule. The crucial question now is how well they can do it and if so, why. To make this inquiry I test an LSTM encoder-decoder and transformer trained on making derivations in a classical propositional logic. To accomplish the task, the architecture of choice must be at least context-free on the Chomsky hierarchy as the language of propositional logic is context-free see (for an explanation, see [1]). First, in Section 2 I will introduce the logic and then elaborate on recent results in Section 3. On this basis, I introduce my experimental setup and its results in Sections 4 and 5 and discuss them in Section 6.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;deep learning</kwd>
        <kwd>logic</kwd>
        <kwd>symbolic reasoning</kwd>
        <kwd>transformer</kwd>
        <kwd>Chomsky hierarchy</kwd>
        <kwd>formal grammars</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. The Logic</title>
      <p>For my inquiry, I consider propositional logic. There, propositions such as the content of ‘If Bob is in
Munich, then Bob is in Bavaria’ are represented in a symbolic language as, for example, ‘( → )’. For
example, by applying deductive rules, it can be derived that ‘If Bob is not in Bavaria, then Bob is not in
Munich,’ as ‘(¬ → ¬)’. More formally, this paper refers to a language ℒ:
Definition 1 (Alphabet of</p>
      <p>ℒ). The alphabet of ℒ is:</p>
      <sec id="sec-2-1">
        <title>1. logical connectives: ¬, ∧, ∨, →,</title>
        <p>2. propositional variables: {1, . . . , } =  ,
3. auxilliary signs: ), (.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Definition 2 ( ℒ-formulas). The set of ℒ-formulas, are defined by:</title>
      </sec>
      <sec id="sec-2-3">
        <title>1. Each propositional variable  ∈  is an ℒ-formula.</title>
        <p>Further, there are the following inference rules, where ‘, , , ...’ represent ℒ-formulas, Greek
letters at least one ℒ-formula and ‘⊢’ that formulas to the right of it can be derived from the left ones:
Definition 3 (Classical logic. CL). Classical Logic (CL) is the set of rules of inference:
1. (⊥) ⊥ ⊢ 
2. (¬) , ¬ ⊢ ⊥
3. (¬) If Γ ,  ⊢ ⊥, then Γ ⊢ ¬
4. (∧) ,  ⊢  ∧ 
5. (∧)  ∧  ⊢  and  ∧  ⊢ 
6. (→ ) ,  →  ⊢ 
7. (→ ) If Γ ,  ⊢  then Γ ⊢  → 
8. (∨)  ⊢  ∨  and  ⊢  ∨ 
9. (∨) If Γ ,  ⊢  and ,  ⊢ , then Γ , ,  ∨  ⊢ 
10. ( ) ¬¬ ⊢</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Transformers on the Chomsky Hierarchy</title>
      <p>
        Some previous results show that simple recurrent neural networks and transformers are Turing complete
[
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5">2, 3, 4, 5</xref>
        ], which is to say that they are Turing machines and should, therefore, be capable of producing
any of the grammars on the Chomsky hierarchy. The practical relevancy of the theoretical results is,
however, to be questioned, as shown in a recent paper by researchers from Google Deep Mind [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The
authors used synthetic data, where production rules of a level on the hierarchy were necessary and
suficient to produce correct expressions. These were sets of sequence-to-sequence tasks that could only
be solved by automata at a minimum level on the hierarchy. During testing, the output sequences had to
be longer than seen during training. Via these, they tried to test diferent architectures experimentally
on which type of automaton they correspond to. They show that RNNs and LSTMs cannot produce
languages higher on the hierarchy except when augmented with a stack or stash memory. Transformers
cannot even recognize certain regular languages.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>The plan is to generate a set of premises and conclusions in ℒ, where the conclusions follow logically
from the premises by CL.1 Diferent architectures are then trained on these premise-conclusion pairs.
After each training epoch, model accuracy and loss are then evaluated on premise-conclusion pairs not
seen during training, such as the simple: ‘, , ( ∧ )’. ‘’ and ‘’ are the premises and ‘( ∧ )’ is the
conclusion.</p>
      <p>Because the exact derivations in the test data have not been seen during training, success in learning
these would imply that the underlying derivation rule that generated them was learned. In other
words: If from ‘ ∧ ,  → , , ’ as ground truth in the training for training input ‘ ∧ ,  → , ’ it
had learned to produce ‘ ∧ ,  → , , ’ from ‘ ∧ ,  → , ’, then I reason that it had learned the
underlying rules ∧ and → .
1The Github repository with the implementation and produced sample formulas analyzed in Section 5.2 can be found here
under https://github.com/stereifberger/master-s-thesis-finished.</p>
      <p>All computed paths in the training and test dataset have a length of one or two derived formulas.
In this output space, there are for example the following derivations that correspond to the above
premise conclusion pair: ‘, , ( ∧ )’ and ‘, , A, ( ∧ )’, where ‘A’ denotes all formulas that can be
derived from ‘’ and ‘’ via rule application. Therefore, there may be several correct derivations for
each premise-conclusion pair as in supervised learning models are typically trained on single ground
truth data entries.</p>
      <p>Therefore, I introduce a custom loss calculation. Model outputs are compared to all correct derivations.
Then, Euclidean distance is used to find the closest correct derivation to the given model output.</p>
      <p>(, ) = √︀( − )2</p>
      <p>This closest derivation is taken as ground truth. With it, the cross entropy loss is calculated, which
is standard for the sequence-to-sequence tasks. For the models, I consider three encoder-decoder
architectures. As a baseline, a feedforward-encoder-decoder with no hidden layers is tested. As
competitive architectures, LSTM encoder-decoders and the transformer architecture are tested. I test
them under diferent architectural specifications for the number of hidden layers in both the encoder
and decoder.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>The results of the two competitive models are rather diferent from the feed-forward network baseline. 2
This shows both in their learning curves under varied architectural specifications and the logical
derivations they could produce.
5.1. Train and Test Results
The LSTM-encoder-decoder shows a learning curve (Figure 1a) with both the training and test loss
consistently going down and a mean of .340 between epochs 40 and 50. Training and test accuracy
steadily increased, reaching a mean of .870 accuracy at epoch 50. The transformer improves on this
even further. First, as Figure 1b shows, it learns much faster than the LSTM, reaching training and test
accuracy of .90 after only three epochs.</p>
      <p>(a) LSTM-Encoder-Decoder.</p>
      <p>(b) Transformer.</p>
      <p>Next the architectures’ robustness towards architectural changes is tested by first testing for diferent
depths in the decoder and then the encoder. Figure 2 shows the test loss and test accuracy with LSTMs
2All results are documented in https://github.com/stereifberger/master-s-thesis-finished/blob/new-cleaned-branch/results.pdf.
and transformer models with the encoder layer fixed to one and the decoder layers varied between one
and four.</p>
      <p>Only for the LSTM, there is some significant diference by increasing the number of decoder layers,
making the performance worse. For accuracy, there is no significant diference. Because the latter is
the main performance measure, the architectures seem to be robust to changes in decoder depth. Also,
a direct comparison of loss and accuracy shows that the transformer outperforms the LSTM in how
quickly it reduces the test loss and gains in test accuracy.</p>
      <p>(a) Test loss of the models.</p>
      <p>(b) Test accuracy of the models.</p>
      <p>In the second robustness test, decoder layers are fixed to three, and encoder layers varied, with
results plotted in Figure 3. The diference in performance is significant for both architectures for the
loss but not accuracy, except for the three-encoder layer LSTM. This indicates that in formula encoding,
some relevant information is retrieved by the additional encoder layers. However, the diference in
performance is very small. Efect-wise, the most important factor seems to be the choice of architecture
where the transformer outperforms the LSTM under all conditions.</p>
      <p>(a) Test loss of the models.</p>
      <p>(b) Test accuracy history.
5.2. Produced Formulas
For each of the three architectures with one encoder and three decoder layers, 50 random derivations
were sampled. In the case of the transformer, there were four attention heads in the encoder and decoder.
Derivations were produced for the test data, and I checked them for correctness by hand.</p>
      <p>The feedforward network baseline could only learn a simple bracket language. Here, all 50 outputs
were no derivations; neither the premises nor the conclusions were correct. The bracket language had
some similarities with ℒ-formulas that also contained brackets, opened and closed in even numbers,
with logical connectives and propositional variables within. These were not yet well-formed formulas
but showed the feedforward network’s ability to learn structure, even though it did not have recurrence.
One example of these bracket expressions is:
, ( ∧ ( → )) ⊢ ( ∧ ( → ( ∧ ( → ))))
((→ ∧)(∧))(((( ∧ ( ∧ ))))
(Input)
(Output)</p>
      <p>The LSTM above the simple bracket regularities put out well-formed formulas about half the time.
Outputs were, in 46% of cases, sequences of only well-formed formulas of ℒ. In the case of the other
54% of non-well-formed formulas, 75.6% again contained expressions of a bracket language like the
ones the feedforward network produced. However, in all cases of well-formed formulas, propositional
variables are exchanged as in:</p>
      <p>( → ),  ⊢ (( → ) ∧ ( ∨ ( → )))
( → ), , ( ∨ ( → )), (( → ) ∧ ( ∨ ( → )))
(Input)
(Output)</p>
      <p>There was no observable regularity to the replacement as if the trained model could not preserve the
correct propositional variables. Still, in 40% of output derivations, the premises matched the premises in
the input up to the replacement of propositional variables by random (the same propositional variables
were replaced by diferent ones in the output, indicating no systematicity). Here, the output derivations
were correct up to the random replacement of variables.</p>
      <p>The transformer managed to reproduce the input premises in the output derivations in all 50 cases.
Conclusions matched in 86% of cases. There was no replacement of variables present as in the LSTM.
70% of derivations were correct, for example:</p>
      <p>, ((¬) ∧ ) ⊢ ( ∨ ( ∨ ))
, ((¬) ∧ ), ( ∨ ), ( ∨ ( ∨ ))
(Input)
(Output)</p>
      <p>Errors in derivations happened in several forms. The most common was the model falling back to a
simple bracket language in 46.6% of incorrect derivations. 26.6% of incorrect derivations were due to
some propositional variable being deleted at some step in the derivation. In 20% of incorrect derivations,
the wrong derivation rule was used. No regularity was observable with respect to derivation rules
needing to be applied to derive a conclusion and errors happening at this sample size. This could,
however, be the basis for further inquiry, where the rules that needed to be applied are written to the
test dataset. Then, the correlation between these rules and low accuracy for a given output could be
measured.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>
        The feedforward network learned a bracket language, where brackets are closed in an even manner.
For this, a context-free grammar is also necessary because a regular grammar would not be capable
of keeping track of the brackets already opened. Given that all the architectures considered were
trained to become models that could produce context-free languages, it can be said that they are at least
part of the set of push-down automata. This seems to go against the findings of [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which were that
transformers have problems with even regular tasks. However, the diferences between their definition
of rule-following and mine must be considered here. Because, for them, rule-following amounts to
deriving longer strings with given rules, they tested the models on deriving longer strings than they
had to derive during training. In my case, training and test outputs were sampled from the same dataset.
They were only diferent with respect to the input, the premise-conclusion pairs. Thus, there is no
contradiction between their results and mine.
      </p>
      <p>
        The main reason [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], named for the bad performance of the transformer architecture, was its
permutation invariance, which was only augmented by positional encoding. However, given the surveyed
literature and my results, this seems only to be a problem of deriving longer strings. But also there, the
generality of the results by Delétang et al. can be taken into question as the meta-learning challenge
incorporated also the derivation of longer strings. Even still, positional encoding preserves syntactic
structural information better than recurrence in LSTMs. This is the case even though the transformer
used the same feedforward networks as the baseline.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. D.</given-names>
            <surname>McCracken</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Reilly</surname>
          </string-name>
          ,
          <article-title>Backus-Naur form (BNF)</article-title>
          , John Wiley and Sons Ltd.,
          <string-name>
            <surname>GBR</surname>
          </string-name>
          ,
          <year>2003</year>
          , p.
          <fpage>129</fpage>
          -
          <lpage>131</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gilroy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maletti</surname>
          </string-name>
          , J. May,
          <string-name>
            <given-names>K.</given-names>
            <surname>Knight</surname>
          </string-name>
          ,
          <article-title>Recurrent neural networks as weighted language recognizers</article-title>
          ,
          <source>in: Proceedings of the 2018 Conference of the North American Chapter, Human Language Technologies</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Marinkovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Barceló</surname>
          </string-name>
          ,
          <article-title>On the turing completeness of modern neural network architectures</article-title>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Barceló</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Marinkovic</surname>
          </string-name>
          , Attention is turing-complete,
          <source>J. Mach. Learn. Res</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Siegelmann</surname>
          </string-name>
          , E. Sontag,
          <article-title>Analog computation via neural nets</article-title>
          ,
          <source>Theoretical Computer Science</source>
          <volume>131</volume>
          (
          <year>1994</year>
          )
          <fpage>331</fpage>
          -
          <lpage>360</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Delétang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ruoss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Grau-Moya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Genewein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Wenliang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Catt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cundy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hutter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Legg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Veness</surname>
          </string-name>
          , Ortega,
          <source>Neural networks and the chomsky hierarchy</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>