<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Neuro-Symbolic Reasoning Shortcuts: Mitigation Strategies and their Limitations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Emanuele Marconato</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Teso</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Passerini</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Mind/Brain Sciences (CIMeC), University of Trento</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science (DI), University of Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Engineering and Information Science (DISI), University of Trento</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Neuro-symbolic predictors learn a mapping from sub-symbolic inputs to higher-level concepts and then carry out (probabilistic) logical inference on this intermediate representation. This setup ofers clear advantages in terms of consistency to symbolic prior knowledge, and is often believed to provide interpretability benefits in that - by virtue of complying with the knowledge - the learned concepts can be better understood by human stakeholders. However, it was recently shown that this setup is afected by reasoning shortcuts whereby predictions attain high accuracy by leveraging concepts with unintended semantics [1, 2], yielding poor out-of-distribution performance and compromising interpretability. In this short paper, we establish a formal link between reasoning shortcuts and the optima of the loss function, and identify situations in which reasoning shortcuts can arise. Based on this, we discuss limitations of natural mitigation strategies such as reconstruction and concept supervision.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Neuro-symbolic (NeSy) integration of learning and reasoning is a key challenge in AI. NeSy
predictors achieve integration by learning a neural network mapping low-level representations
(e.g., MNIST images) to high-level symbolic concepts (e.g., digits), and then predicting a label
(e.g., the sum) by reasoning over concepts and prior knowledge [3]. Most works on the topic
focus on how to best integrate knowledge into the loop, cf. [4]. The issue of concept quality
is, however, generally neglected. Loosely speaking, the consensus is that knowledge ensures
learning high quality concepts and that issues with these should be viewed as “learning artifacts”.</p>
      <p>This is not the case. Recently, Li et al. [2] and Marconato et al. [1] have shown that NeSy
predictors can learn reasoning shortcuts (RSs), that is, mappings from inputs to concepts that
yield high accuracy on the training set by predicting the wrong concepts. While RSs – by
definition – do not hinder the model’s accuracy on the training task, they prevent identification
of concepts with the “right” semantics, and as such compromise generalization beyond the
training distribution and interpretability [1]. As an example, consider MNIST Addition [3]. Here,
the model has to determine the sum of two MNIST digits, under the constraint that the sum
is correct. Given the examples “ + = 1” and “ + = 2”, there exist two alternative</p>
      <p>G</p>
      <p>C
X</p>
      <p>Y
(a)
(b)
)
G
(
h
t
u
r
t
d
n
u
o
r
G
(c)
solutions: the intended one ( → 0, → 1, → 2) and a RS ( → 1, → 0, → 1). Both
of them ensure the sum is correct, but only one of them captures the correct semantics.</p>
      <p>This begs the question: under what conditions do reasoning shortcuts appear, and what strategies
can be used to mitigate them? In this short paper, we outline answers to these questions. First,
we go beyond existing works and show how to count the number of RSs afecting a NeSy
prediction task. Based on this result, we show that, in the general case, it is impossible to identify
the correct concepts from label supervision only. We also consider two mitigation strategies,
namely reconstruction and concept supervision, and study their efects and limitations.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Neuro-symbolic task construction</title>
      <p>
        We consider a NeSy prediction task where, given sub-symbolic inputs X, the goal is to infer
one or more labels Y ∈ {0, 1}ℓ consistent with a given propositional formula K encoding prior
knowledge. We focus on DeepProbLog [3], a representative and sound framework for such
tasks. From a probabilistic perspective, DeepProbLog: (i) Extracts  concepts C ∈ {0, 1} from
a X via a neural network  (C | X), and (ii) Models the distribution over the labels Y as a
uniform K(y | c) = 1{(y, c) |= K}. The label distribution is obtained by marginalizing C:
 (y | x; K) = ∑︀c K(y | c) (c | x)
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
DeepProbLog is then trained via maximum likelihood.
      </p>
      <p>
        In order to understand when doing so recovers concepts C with the “correct semantics”, we
have to first define the unobserved generative mechanism underlying the training data whose
concepts we wish to identify. Motivated by work on identifiability in (causal) representation
learning [5, 6, 7, 8], we assume there exist  ground-truth concepts G ∈ {0, 1} spanning a
space , and that the examples (X, Y) = ( (G), ℎ(G)) are generated by an invertible function
 :  →  ⊂ R and a surjective function ℎ :  → , with || ≤ || . Here, ℎ plays
the role of the ground-truth reasoning module that infers the label Y from the ground-truth
concepts G according to K, while  generates the observations themselves.1 Cf. Fig. 1 for an
1Due to space constraints, we assume X depends on G only. In practice, it might also depend on additional “stylistic”
illustration. In the next sections, we will show how maximum likelihood training can recover
the mechanism  ∘  − 1, but not the ground-truth mapping from inputs to concepts  − 1, i.e.,
the “correct semantics”.
3. Reasoning shortcuts and mitigation strategies
We consider training points (x, y) ∈ X,Y, each originated by corresponding ground-truth
concepts g ∈ G.2 Our starting point is the log-likelihood, which constitutes the objective of
training:
ℒ( ) := ∑︀(x,y)∈X,Y log  (y | x; K) ≡
∑︀g∈G log  (︀ ℎ(g) |  (g); K)︀
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
Notice that all optima of Eq. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) satisfy  (y | x; K) = 1 for all examples. By Eq. (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), this entails
that any c ∼  (c | x) must satisfy the knowledge K, that is, (c, y) |= K (see [1, Theorem 3.2]).
How many alternative distributions  (c | g) :=  (c |  (g)) do attain maximum likelihood?
Since  is a neural network, there may be infinitely many, yet all of them except one are
RSs. This is suficient to show that RSs cannot be discriminated from the ground-truth concept
distribution based on likelihood alone [1].
      </p>
      <p>
        Importantly, it turns out all optimal distributions  (c | x) are convex combinations of the
deterministic optima (det-opts), that is, those distributions  (c | g) mapping each g to a unique
c with probability one. If the likelihood admits a single det-opt, this is also the only solution
and – by construction – it recovers the ground-truth concepts. RSs arise when there are two
or more det-opts. How many det-ops are there? Let y = {︀ c : (c, y) |= K︀} be the set of
c’s that K assigns to label y. Notice that if  (c | g) attains maximum likelihood, then any
c ∼  (c | g) falls within ℎ(g). In this sense, a det-opt implicitly maps each vector g ∈  to
a vector c ∈ ℎ(g). This gives us a mechanism to count det-opts: for each g there are exactly
|ℎ(g)| vectors c that it can be mapped to, meaning that number of det-opts for Eq. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) is:
#det-opts(ℒ) = ∏︀y∈ |y||y|
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
As a consequence, the ground-truth concepts can only be retrieved if |y| = 1, i.e., each label y
can be deduced from a unique c. This is seldom the case in NeSy tasks, meaning that maximizing
the likelihood of the labels Y cannot rule out RSs in general.
      </p>
      <p>In the following, we discuss two natural mitigation strategies and their impact in reducing
the total number of det-ops.</p>
      <p>
        Reconstruction is insuficient. Given the likelihood is incapable of discriminating intended
and RS solutions, one option is to augment it with a term encouraging learned concepts C to
capture information necessary to reconstruct the input X, for instance:
ℛ( ) = ∑︀x∈X ︀[ ∑︀c  (c | x) log  (x | c)]︀ ≡
∑︀g∈G ︀[ ∑︀c  (c | g) log  (g | c)]︀
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
factors of variation (e.g., font) [9]. Our results apply to this more complex case with minimal modifications.
2We assume the training examples are noiseless and cover all possible combinations of ground-truth factors G, as
even this “ideal” setting admits RSs.
      </p>
      <p>
        Here,  (x | c) is the distribution output by a neural decoder with parameters  , and we
introduced  (g | c) :=  ( (g) | c). The optima of Eq. (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) must satisfy  (g | c) = 1 for
all c ∼  (c | g). In other words, restricting again to det-ops for the encoder, the only det-ops
that ensure perfect reconstruction are those mapping distinct g’s to distinct c’s, i.e., that ensure
the encoder is injective. How many such det-opts are there? Notice that these det-opts can be
enumerated by taking each g ∈  in turn and mapping it to an arbitrary c in ℎ(g) without
replacement (to ensure injectivity), until all g’s have been mapped. This entails that the number
of det-opts – under perfect reconstruction – becomes:
#det-opts(ℒ + ℛ) = ∏︀y∈ |y|!
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
Once again, unless |y| = 1 for all y’s, there are multiple possible solutions, most of which
are RSs. In other words, adding a reconstruction term can be insuficient to completely rule out
learning reasoning shortcuts.
      </p>
      <p>
        The efect of concept supervision. Next, we consider a scenario where concept supervision
is provided (for all concepts) for at least some examples (x, g) ∈ X,G. We consider the 2
loss for fitting the supervision, for simplicity:
( ) ∝ ∑︀(x,g)∈X,G (c − g)2 (c | x) ≡
∑︀
g∈G ∑︀c(c − g)2 (c | g)
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
The only concept distributions  (c | g) minimizing Eq. (
        <xref ref-type="bibr" rid="ref6">6</xref>
        ) are those that allocate all probability
mass to the annotated concepts. Now, let  y be the number of vectors c ∈ y for which we have
supervision g, for a total of |G| = ∑︀  y. The situation is analogous to Eq. (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) and Eq. (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ),
y
except that now for exactly  y vectors c we know exactly what g they should be mapped to,
leaving the remaining |y| −  y vectors dangling. This gives:
#det-opts(ℒ + ) = ∏︀y∈ |y||y|−  y ,
      </p>
      <p>
        #det-opts(ℒ + ℛ + ) = ∏︀y∈ (|y| −  y)! (
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
Here, the first term counts how many det-opts optimize both the label likelihood and the concept
supervision, and the second one those optimizing the likelihood, reconstruction and concept
supervision. This shows providing concept supervision can dramatically reduce the number of
det-opts but also that a substantial amount is necessary to rule out all RSs.
      </p>
    </sec>
    <sec id="sec-3">
      <title>4. Empirical Verification</title>
      <p>
        We outline a toy experiment showing how reasoning shortcuts afect even a simple NeSy
task. Let g = (1, 2, 3) be three bits and consider the task of predicting their parity, that is,
 = 1 ⊕ 2 ⊕ 3. Each label  ∈ {0, 1} can be deduced from 4 possible concept vectors g. We
train two MLPs, one encoding directly g into  (c | g), and another decoding c into  (g | c).
Labels are predicted as per Eq. (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ). Given the problem at hand, the total number of det-opts given
by Eq. (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) is #det-opts(ℒ) = (44 · 44), and that given by Eq. (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) is #det-opts(ℒ + ℛ) = (4! · 4!).
Empirically, what happens is that without concept supervision, the model picks up reasoning
shortcuts to solve the task. Fig. 1 shows two such RSs, both optimal, obtained by our model
when optimizing (b) only the likelihood, and (c) both the likelihood and the reconstruction term.
In both cases, the solutions fail to recover the ground-truth concepts.
      </p>
      <p>Conclusion. Our results altogether show that the ground-truth concepts are hard, if not
impossible, to recover empirically, and that two natural mitigation strategies do not completely
address the problem. In particular, the amount of concept supervision required grows linearly
with the number of possible concept combinations. We envisage well-tuned strategies based on
targeted concept-supervision, combined with additional restrictions on the model itself (and
specifically disentanglement between concepts [10]), will likely facilitate (provable) identification
of the ground-truth concepts. This is left to future work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Marconato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Bontempo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ficarra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Calderara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passerini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Teso</surname>
          </string-name>
          ,
          <article-title>Neuro symbolic continual learning: Knowledge, reasoning shortcuts and concept rehearsal</article-title>
          ,
          <source>arXiv preprint arXiv:2302.01242</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jian</surname>
          </string-name>
          , et al.,
          <article-title>Learning with logical constraints but without shortcut satisfaction</article-title>
          ,
          <source>in: The Eleventh International Conference on Learning Representations</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Manhaeve</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dumancic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kimmig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Demeester</surname>
          </string-name>
          , L. De Raedt,
          <article-title>DeepProbLog: Neural Probabilistic Logic Programming</article-title>
          ,
          <source>NeurIPS</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>De Raedt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dumančić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Manhaeve</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Marra, From statistical relational to neuralsymbolic artificial intelligence</article-title>
          ,
          <source>in: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>4943</fpage>
          -
          <lpage>4950</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Locatello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lucic</surname>
          </string-name>
          , G. Raetsch,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schölkopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bachem</surname>
          </string-name>
          ,
          <article-title>Challenging common assumptions in the unsupervised learning of disentangled representations</article-title>
          ,
          <source>in: ICML</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Schölkopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Locatello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. R.</given-names>
            <surname>Ke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kalchbrenner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Toward causal representation learning</article-title>
          ,
          <source>Proceedings of the IEEE</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>I.</given-names>
            <surname>Khemakhem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Monti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hyvarinen</surname>
          </string-name>
          ,
          <article-title>Variational autoencoders and nonlinear ICA: A unifying framework</article-title>
          ,
          <source>in: AISTATS</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ahuja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mahajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Syrgkanis</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Mitliagkas</surname>
          </string-name>
          ,
          <article-title>Towards eficient representation identiifcation in supervised learning</article-title>
          ,
          <source>in: Conference on Causal Learning and Reasoning</source>
          , PMLR,
          <year>2022</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>43</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>J. von Kügelgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gresele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Brendel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schölkopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Besserve</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Locatello</surname>
          </string-name>
          ,
          <article-title>Self-supervised learning with data augmentations provably isolates content from style</article-title>
          , in: NeurIPS,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Suter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Miladinovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schölkopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bauer</surname>
          </string-name>
          ,
          <article-title>Robustly disentangled causal mechanisms: Validating deep representations for interventional robustness</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>6056</fpage>
          -
          <lpage>6065</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>