1. Introduction

Neuro-Symbolic Reasoning Shortcuts: Mitigation Strategies and their Limitations

Emanuele Marconato

1 2

Stefano Teso

0 2

Andrea Passerini

2 0 Centre for Mind/Brain Sciences (CIMeC), University of Trento , Italy 1 Department of Computer Science (DI), University of Pisa , Italy 2 Department of Engineering and Information Science (DISI), University of Trento , Italy

Neuro-symbolic predictors learn a mapping from sub-symbolic inputs to higher-level concepts and then carry out (probabilistic) logical inference on this intermediate representation. This setup ofers clear advantages in terms of consistency to symbolic prior knowledge, and is often believed to provide interpretability benefits in that - by virtue of complying with the knowledge - the learned concepts can be better understood by human stakeholders. However, it was recently shown that this setup is afected by reasoning shortcuts whereby predictions attain high accuracy by leveraging concepts with unintended semantics [1, 2], yielding poor out-of-distribution performance and compromising interpretability. In this short paper, we establish a formal link between reasoning shortcuts and the optima of the loss function, and identify situations in which reasoning shortcuts can arise. Based on this, we discuss limitations of natural mitigation strategies such as reconstruction and concept supervision.

1. Introduction

Neuro-symbolic (NeSy) integration of learning and reasoning is a key challenge in AI. NeSy predictors achieve integration by learning a neural network mapping low-level representations (e.g., MNIST images) to high-level symbolic concepts (e.g., digits), and then predicting a label (e.g., the sum) by reasoning over concepts and prior knowledge [3]. Most works on the topic focus on how to best integrate knowledge into the loop, cf. [4]. The issue of concept quality is, however, generally neglected. Loosely speaking, the consensus is that knowledge ensures learning high quality concepts and that issues with these should be viewed as “learning artifacts”.

This is not the case. Recently, Li et al. [2] and Marconato et al. [1] have shown that NeSy predictors can learn reasoning shortcuts (RSs), that is, mappings from inputs to concepts that yield high accuracy on the training set by predicting the wrong concepts. While RSs – by definition – do not hinder the model’s accuracy on the training task, they prevent identification of concepts with the “right” semantics, and as such compromise generalization beyond the training distribution and interpretability [1]. As an example, consider MNIST Addition [3]. Here, the model has to determine the sum of two MNIST digits, under the constraint that the sum is correct. Given the examples “ + = 1” and “ + = 2”, there exist two alternative

C X

Y (a) (b) ) G ( h t u r t d n u o r G (c) solutions: the intended one ( → 0, → 1, → 2) and a RS ( → 1, → 0, → 1). Both of them ensure the sum is correct, but only one of them captures the correct semantics.

This begs the question: under what conditions do reasoning shortcuts appear, and what strategies can be used to mitigate them? In this short paper, we outline answers to these questions. First, we go beyond existing works and show how to count the number of RSs afecting a NeSy prediction task. Based on this result, we show that, in the general case, it is impossible to identify the correct concepts from label supervision only. We also consider two mitigation strategies, namely reconstruction and concept supervision, and study their efects and limitations.

2. Neuro-symbolic task construction

We consider a NeSy prediction task where, given sub-symbolic inputs X, the goal is to infer one or more labels Y ∈ {0, 1}ℓ consistent with a given propositional formula K encoding prior knowledge. We focus on DeepProbLog [3], a representative and sound framework for such tasks. From a probabilistic perspective, DeepProbLog: (i) Extracts concepts C ∈ {0, 1} from a X via a neural network (C | X), and (ii) Models the distribution over the labels Y as a uniform K(y | c) = 1{(y, c) |= K}. The label distribution is obtained by marginalizing C: (y | x; K) = ∑︀c K(y | c) (c | x) ( 1 ) DeepProbLog is then trained via maximum likelihood.

In order to understand when doing so recovers concepts C with the “correct semantics”, we have to first define the unobserved generative mechanism underlying the training data whose concepts we wish to identify. Motivated by work on identifiability in (causal) representation learning [5, 6, 7, 8], we assume there exist ground-truth concepts G ∈ {0, 1} spanning a space , and that the examples (X, Y) = ( (G), ℎ(G)) are generated by an invertible function : → ⊂ R and a surjective function ℎ : → , with || ≤ || . Here, ℎ plays the role of the ground-truth reasoning module that infers the label Y from the ground-truth concepts G according to K, while generates the observations themselves.1 Cf. Fig. 1 for an 1Due to space constraints, we assume X depends on G only. In practice, it might also depend on additional “stylistic” illustration. In the next sections, we will show how maximum likelihood training can recover the mechanism ∘ − 1, but not the ground-truth mapping from inputs to concepts − 1, i.e., the “correct semantics”. 3. Reasoning shortcuts and mitigation strategies We consider training points (x, y) ∈ X,Y, each originated by corresponding ground-truth concepts g ∈ G.2 Our starting point is the log-likelihood, which constitutes the objective of training: ℒ( ) := ∑︀(x,y)∈X,Y log (y | x; K) ≡ ∑︀g∈G log (︀ ℎ(g) | (g); K)︀ ( 2 ) Notice that all optima of Eq. ( 2 ) satisfy (y | x; K) = 1 for all examples. By Eq. ( 1 ), this entails that any c ∼ (c | x) must satisfy the knowledge K, that is, (c, y) |= K (see [1, Theorem 3.2]). How many alternative distributions (c | g) := (c | (g)) do attain maximum likelihood? Since is a neural network, there may be infinitely many, yet all of them except one are RSs. This is suficient to show that RSs cannot be discriminated from the ground-truth concept distribution based on likelihood alone [1].

Importantly, it turns out all optimal distributions (c | x) are convex combinations of the deterministic optima (det-opts), that is, those distributions (c | g) mapping each g to a unique c with probability one. If the likelihood admits a single det-opt, this is also the only solution and – by construction – it recovers the ground-truth concepts. RSs arise when there are two or more det-opts. How many det-ops are there? Let y = {︀ c : (c, y) |= K︀} be the set of c’s that K assigns to label y. Notice that if (c | g) attains maximum likelihood, then any c ∼ (c | g) falls within ℎ(g). In this sense, a det-opt implicitly maps each vector g ∈ to a vector c ∈ ℎ(g). This gives us a mechanism to count det-opts: for each g there are exactly |ℎ(g)| vectors c that it can be mapped to, meaning that number of det-opts for Eq. ( 2 ) is: #det-opts(ℒ) = ∏︀y∈ |y||y| ( 3 ) As a consequence, the ground-truth concepts can only be retrieved if |y| = 1, i.e., each label y can be deduced from a unique c. This is seldom the case in NeSy tasks, meaning that maximizing the likelihood of the labels Y cannot rule out RSs in general.

In the following, we discuss two natural mitigation strategies and their impact in reducing the total number of det-ops.

Reconstruction is insuficient. Given the likelihood is incapable of discriminating intended and RS solutions, one option is to augment it with a term encouraging learned concepts C to capture information necessary to reconstruct the input X, for instance: ℛ( ) = ∑︀x∈X ︀[ ∑︀c (c | x) log (x | c)]︀ ≡ ∑︀g∈G ︀[ ∑︀c (c | g) log (g | c)]︀ ( 4 ) factors of variation (e.g., font) [9]. Our results apply to this more complex case with minimal modifications. 2We assume the training examples are noiseless and cover all possible combinations of ground-truth factors G, as even this “ideal” setting admits RSs.

Here, (x | c) is the distribution output by a neural decoder with parameters , and we introduced (g | c) := ( (g) | c). The optima of Eq. ( 4 ) must satisfy (g | c) = 1 for all c ∼ (c | g). In other words, restricting again to det-ops for the encoder, the only det-ops that ensure perfect reconstruction are those mapping distinct g’s to distinct c’s, i.e., that ensure the encoder is injective. How many such det-opts are there? Notice that these det-opts can be enumerated by taking each g ∈ in turn and mapping it to an arbitrary c in ℎ(g) without replacement (to ensure injectivity), until all g’s have been mapped. This entails that the number of det-opts – under perfect reconstruction – becomes: #det-opts(ℒ + ℛ) = ∏︀y∈ |y|! ( 5 ) Once again, unless |y| = 1 for all y’s, there are multiple possible solutions, most of which are RSs. In other words, adding a reconstruction term can be insuficient to completely rule out learning reasoning shortcuts.

The efect of concept supervision. Next, we consider a scenario where concept supervision is provided (for all concepts) for at least some examples (x, g) ∈ X,G. We consider the 2 loss for fitting the supervision, for simplicity: ( ) ∝ ∑︀(x,g)∈X,G (c − g)2 (c | x) ≡ ∑︀ g∈G ∑︀c(c − g)2 (c | g) ( 6 ) The only concept distributions (c | g) minimizing Eq. ( 6 ) are those that allocate all probability mass to the annotated concepts. Now, let y be the number of vectors c ∈ y for which we have supervision g, for a total of |G| = ∑︀ y. The situation is analogous to Eq. ( 3 ) and Eq. ( 5 ), y except that now for exactly y vectors c we know exactly what g they should be mapped to, leaving the remaining |y| − y vectors dangling. This gives: #det-opts(ℒ + ) = ∏︀y∈ |y||y|− y ,

#det-opts(ℒ + ℛ + ) = ∏︀y∈ (|y| − y)! ( 7 ) Here, the first term counts how many det-opts optimize both the label likelihood and the concept supervision, and the second one those optimizing the likelihood, reconstruction and concept supervision. This shows providing concept supervision can dramatically reduce the number of det-opts but also that a substantial amount is necessary to rule out all RSs.

4. Empirical Verification

We outline a toy experiment showing how reasoning shortcuts afect even a simple NeSy task. Let g = (1, 2, 3) be three bits and consider the task of predicting their parity, that is, = 1 ⊕ 2 ⊕ 3. Each label ∈ {0, 1} can be deduced from 4 possible concept vectors g. We train two MLPs, one encoding directly g into (c | g), and another decoding c into (g | c). Labels are predicted as per Eq. ( 1 ). Given the problem at hand, the total number of det-opts given by Eq. ( 3 ) is #det-opts(ℒ) = (44 · 44), and that given by Eq. ( 5 ) is #det-opts(ℒ + ℛ) = (4! · 4!). Empirically, what happens is that without concept supervision, the model picks up reasoning shortcuts to solve the task. Fig. 1 shows two such RSs, both optimal, obtained by our model when optimizing (b) only the likelihood, and (c) both the likelihood and the reconstruction term. In both cases, the solutions fail to recover the ground-truth concepts.

Conclusion. Our results altogether show that the ground-truth concepts are hard, if not impossible, to recover empirically, and that two natural mitigation strategies do not completely address the problem. In particular, the amount of concept supervision required grows linearly with the number of possible concept combinations. We envisage well-tuned strategies based on targeted concept-supervision, combined with additional restrictions on the model itself (and specifically disentanglement between concepts [10]), will likely facilitate (provable) identification of the ground-truth concepts. This is left to future work.

[1]

Marconato ,

Bontempo ,

Ficarra ,

Calderara ,

Passerini ,

Teso , Neuro symbolic continual learning: Knowledge, reasoning shortcuts and concept rehearsal , arXiv preprint arXiv:2302.01242 ( 2023 ).

[2]

Li ,

Liu ,

Yao ,

Xu ,

Chen ,

Ma ,

Jian , et al., Learning with logical constraints but without shortcut satisfaction , in: The Eleventh International Conference on Learning Representations , 2023 .

[3]

Manhaeve ,

Dumancic ,

Kimmig ,

Demeester , L. De Raedt, DeepProbLog: Neural Probabilistic Logic Programming , NeurIPS ( 2018 ).

[4]

De Raedt ,

Dumančić ,

Manhaeve , G. Marra, From statistical relational to neuralsymbolic artificial intelligence , in: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence , 2021 , pp. 4943 - 4950 .

[5]

Locatello ,

Bauer ,

Lucic , G. Raetsch,

Gelly ,

Schölkopf ,

Bachem , Challenging common assumptions in the unsupervised learning of disentangled representations , in: ICML , 2019 .

[6]

Schölkopf ,

Locatello ,

Bauer ,

N. R.

Ke ,

Kalchbrenner ,

Goyal ,

Bengio , Toward causal representation learning , Proceedings of the IEEE ( 2021 ).

[7]

Khemakhem ,

Kingma ,

Monti ,

Hyvarinen , Variational autoencoders and nonlinear ICA: A unifying framework , in: AISTATS , 2020 .

[8]

Ahuja ,

Mahajan ,

Syrgkanis , I. Mitliagkas , Towards eficient representation identiifcation in supervised learning , in: Conference on Causal Learning and Reasoning , PMLR, 2022 , pp. 19 - 43 .

[9] J. von Kügelgen ,

Sharma ,

Gresele ,

Brendel ,

Schölkopf ,

Besserve ,

Locatello , Self-supervised learning with data augmentations provably isolates content from style , in: NeurIPS, 2021 .

[10]

Suter ,

Miladinovic ,

Schölkopf ,

Bauer , Robustly disentangled causal mechanisms: Validating deep representations for interventional robustness , in: International Conference on Machine Learning, PMLR , 2019 , pp. 6056 - 6065 .