Neuro-Symbolic Reasoning Shortcuts: Mitigation Strategies and their Limitations Emanuele Marconato1,2 , Stefano Teso1,3 and Andrea Passerini1 1 Department of Engineering and Information Science (DISI), University of Trento, Italy 2 Department of Computer Science (DI), University of Pisa, Italy 3 Centre for Mind/Brain Sciences (CIMeC), University of Trento, Italy. Abstract Neuro-symbolic predictors learn a mapping from sub-symbolic inputs to higher-level concepts and then carry out (probabilistic) logical inference on this intermediate representation. This setup offers clear advantages in terms of consistency to symbolic prior knowledge, and is often believed to provide interpretability benefits in that – by virtue of complying with the knowledge – the learned concepts can be better understood by human stakeholders. However, it was recently shown that this setup is affected by reasoning shortcuts whereby predictions attain high accuracy by leveraging concepts with unintended semantics [1, 2], yielding poor out-of-distribution performance and compromising interpretability. In this short paper, we establish a formal link between reasoning shortcuts and the optima of the loss function, and identify situations in which reasoning shortcuts can arise. Based on this, we discuss limitations of natural mitigation strategies such as reconstruction and concept supervision. 1. Introduction Neuro-symbolic (NeSy) integration of learning and reasoning is a key challenge in AI. NeSy predictors achieve integration by learning a neural network mapping low-level representations (e.g., MNIST images) to high-level symbolic concepts (e.g., digits), and then predicting a label (e.g., the sum) by reasoning over concepts and prior knowledge [3]. Most works on the topic focus on how to best integrate knowledge into the loop, cf. [4]. The issue of concept quality is, however, generally neglected. Loosely speaking, the consensus is that knowledge ensures learning high quality concepts and that issues with these should be viewed as “learning artifacts”. This is not the case. Recently, Li et al. [2] and Marconato et al. [1] have shown that NeSy predictors can learn reasoning shortcuts (RSs), that is, mappings from inputs to concepts that yield high accuracy on the training set by predicting the wrong concepts. While RSs – by definition – do not hinder the model’s accuracy on the training task, they prevent identification of concepts with the “right” semantics, and as such compromise generalization beyond the training distribution and interpretability [1]. As an example, consider MNIST Addition [3]. Here, the model has to determine the sum of two MNIST digits, under the constraint that the sum is correct. Given the examples “ + = 1” and “ + = 2”, there exist two alternative NeSy 2023, 17th International Workshop on Neural-Symbolic Learning and Reasoning, Certosa di Pontignano, Siena, Italy $ emanuele.marconato@unitn.it (E. Marconato); stefano.teso@unitn.it (S. Teso); passerini@disi.unitn.it (A. Passerini) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Predicted (C) Predicted (C) Ground-truth (G) G X Y C (a) (b) (c) Figure 1: (a) Graphical model of our setup: the black arrows encode the data generation process, the red arrow indicate the learned concept distribution, and the blue arrow the reasoning module. (b) Confusion matrices of (Boolean) concepts learned by DeepProblog for our three bits XOR task without any mitigation strategy, and (c) with a reconstruction term, cf. Eq. (5). In both cases the learned concepts are consistent with the knowledge and in the second one they also manage to reconstruct the input. The confusion matrices immediately show that, despite this, the learned concepts are reasoning shortcuts. solutions: the intended one ( → 0, → 1, → 2) and a RS ( → 1, → 0, → 1). Both of them ensure the sum is correct, but only one of them captures the correct semantics. This begs the question: under what conditions do reasoning shortcuts appear, and what strategies can be used to mitigate them? In this short paper, we outline answers to these questions. First, we go beyond existing works and show how to count the number of RSs affecting a NeSy prediction task. Based on this result, we show that, in the general case, it is impossible to identify the correct concepts from label supervision only. We also consider two mitigation strategies, namely reconstruction and concept supervision, and study their effects and limitations. 2. Neuro-symbolic task construction We consider a NeSy prediction task where, given sub-symbolic inputs X, the goal is to infer one or more labels Y ∈ {0, 1}ℓ consistent with a given propositional formula K encoding prior knowledge. We focus on DeepProbLog [3], a representative and sound framework for such tasks. From a probabilistic perspective, DeepProbLog: (i) Extracts 𝑘 concepts C ∈ {0, 1}𝑘 from a X via a neural network 𝑝𝜃 (C | X), and (ii) Models the distribution over the labels Y as a uniform 𝑢K (y | c) = 1{(y, c) |= K}. The label distribution is obtained by marginalizing C: (1) ∑︀ 𝑝𝜃 (y | x; K) = c 𝑢K (y | c)𝑝𝜃 (c | x) DeepProbLog is then trained via maximum likelihood. In order to understand when doing so recovers concepts C with the “correct semantics”, we have to first define the unobserved generative mechanism underlying the training data whose concepts we wish to identify. Motivated by work on identifiability in (causal) representation learning [5, 6, 7, 8], we assume there exist 𝑘 ground-truth concepts G ∈ {0, 1}𝑘 spanning a space 𝒢, and that the examples (X, Y) = (𝑓 (G), ℎ(G)) are generated by an invertible function 𝑓 : 𝒢 → 𝒳 ⊂ R𝑑 and a surjective function ℎ : 𝒢 → 𝒴, with |𝒴| ≤ |𝒢|. Here, ℎ plays the role of the ground-truth reasoning module that infers the label Y from the ground-truth concepts G according to K, while 𝑓 generates the observations themselves.1 Cf. Fig. 1 for an 1 Due to space constraints, we assume X depends on G only. In practice, it might also depend on additional “stylistic” illustration. In the next sections, we will show how maximum likelihood training can recover the mechanism 𝑔 ∘ 𝑓 −1 , but not the ground-truth mapping from inputs to concepts 𝑓 −1 , i.e., the “correct semantics”. 3. Reasoning shortcuts and mitigation strategies We consider training points (x, y) ∈ 𝒟X,Y , each originated by corresponding ground-truth concepts g ∈ 𝒟G .2 Our starting point is the log-likelihood, which constitutes the objective of training: (2) ∑︀ ∑︀ (︀ )︀ ℒ(𝜃) := (x,y)∈𝒟X,Y log 𝑝𝜃 (y | x; K) ≡ g∈𝒟G log 𝑝𝜃 ℎ(g) | 𝑓 (g); K Notice that all optima of Eq. (2) satisfy 𝑝𝜃 (y | x; K) = 1 for all examples. By Eq. (1), this entails that any c ∼ 𝑝𝜃 (c | x) must satisfy the knowledge K, that is, (c, y) |= K (see [1, Theorem 3.2]). How many alternative distributions 𝑝𝜃 (c | g) := 𝑝𝜃 (c | 𝑓 (g)) do attain maximum likelihood? Since 𝑝𝜃 is a neural network, there may be infinitely many, yet all of them except one are RSs. This is sufficient to show that RSs cannot be discriminated from the ground-truth concept distribution based on likelihood alone [1]. Importantly, it turns out all optimal distributions 𝑝𝜃 (c | x) are convex combinations of the deterministic optima (det-opts), that is, those distributions 𝑝𝜃 (c | g) mapping each g to a unique c with probability one. If the likelihood admits a single det-opt, this is also the only solution and – by construction – it recovers the ground-truth concepts. {︀ RSs arise when}︀ there are two or more det-opts. How many det-ops are there? Let 𝑆y = c : (c, y) |= K be the set of c’s that K assigns to label y. Notice that if 𝑝𝜃 (c | g) attains maximum likelihood, then any c ∼ 𝑝𝜃 (c | g) falls within 𝑆ℎ(g) . In this sense, a det-opt implicitly maps each vector g ∈ 𝒢 to a vector c ∈ 𝑆ℎ(g) . This gives us a mechanism to count det-opts: for each g there are exactly |𝑆ℎ(g) | vectors c that it can be mapped to, meaning that number of det-opts for Eq. (2) is: |𝑆y | (3) ∏︀ #det-opts(ℒ) = y∈𝒴 |𝑆y | As a consequence, the ground-truth concepts can only be retrieved if |𝑆y | = 1, i.e., each label y can be deduced from a unique c. This is seldom the case in NeSy tasks, meaning that maximizing the likelihood of the labels Y cannot rule out RSs in general. In the following, we discuss two natural mitigation strategies and their impact in reducing the total number of det-ops. Reconstruction is insufficient. Given the likelihood is incapable of discriminating intended and RS solutions, one option is to augment it with a term encouraging learned concepts C to capture information necessary to reconstruct the input X, for instance: (4) ∑︀ [︀ ∑︀ ]︀ ∑︀ [︀ ∑︀ ]︀ ℛ(𝜃) = x∈𝒟X c 𝑝𝜃 (c | x) log 𝑝𝜓 (x | c) ≡ g∈𝒟G c 𝑝𝜃 (c | g) log 𝑝𝜓 (g | c) factors of variation (e.g., font) [9]. Our results apply to this more complex case with minimal modifications. 2 We assume the training examples are noiseless and cover all possible combinations of ground-truth factors G, as even this “ideal” setting admits RSs. Here, 𝑝𝜓 (x | c) is the distribution output by a neural decoder with parameters 𝜓, and we introduced 𝑝𝜓 (g | c) := 𝑝𝜓 (𝑓 (g) | c). The optima of Eq. (4) must satisfy 𝑝𝜓 (g | c) = 1 for all c ∼ 𝑝𝜃 (c | g). In other words, restricting again to det-ops for the encoder, the only det-ops that ensure perfect reconstruction are those mapping distinct g’s to distinct c’s, i.e., that ensure the encoder is injective. How many such det-opts are there? Notice that these det-opts can be enumerated by taking each g ∈ 𝒢 in turn and mapping it to an arbitrary c in 𝑆ℎ(g) without replacement (to ensure injectivity), until all g’s have been mapped. This entails that the number of det-opts – under perfect reconstruction – becomes: (5) ∏︀ #det-opts(ℒ + ℛ) = y∈𝒴 |𝑆y |! Once again, unless |𝑆y | = 1 for all y’s, there are multiple possible solutions, most of which are RSs. In other words, adding a reconstruction term can be insufficient to completely rule out learning reasoning shortcuts. The effect of concept supervision. Next, we consider a scenario where concept supervision is provided (for all concepts) for at least some examples (x, g) ∈ 𝒟X,G . We consider the 𝐿2 loss for fitting the supervision, for simplicity: 𝒞(𝜃) ∝ (x,g)∈𝒟X,G (c − g)2 𝑝𝜃 (c | x) ≡ g∈𝒟G c (c − g)2 𝑝𝜃 (c | g) (6) ∑︀ ∑︀ ∑︀ The only concept distributions 𝑝𝜃 (c | g) minimizing Eq. (6) are those that allocate all probability mass to the annotated concepts. Now, ∑︀let 𝜈y be the number of vectors c ∈ 𝑆y for which we have supervision g, for a total of |𝒟G | = y 𝜈y . The situation is analogous to Eq. (3) and Eq. (5), except that now for exactly 𝜈y vectors c we know exactly what g they should be mapped to, leaving the remaining |𝑆y | − 𝜈y vectors dangling. This gives: #det-opts(ℒ + 𝒞) = y∈𝒴 |𝑆y ||𝑆y |−𝜈y , #det-opts(ℒ + ℛ + 𝒞) = y∈𝒴 (|𝑆y | − 𝜈y )! (7) ∏︀ ∏︀ Here, the first term counts how many det-opts optimize both the label likelihood and the concept supervision, and the second one those optimizing the likelihood, reconstruction and concept supervision. This shows providing concept supervision can dramatically reduce the number of det-opts but also that a substantial amount is necessary to rule out all RSs. 4. Empirical Verification We outline a toy experiment showing how reasoning shortcuts affect even a simple NeSy task. Let g = (𝑔1 , 𝑔2 , 𝑔3 ) be three bits and consider the task of predicting their parity, that is, 𝑦 = 𝑔1 ⊕ 𝑔2 ⊕ 𝑔3 . Each label 𝑦 ∈ {0, 1} can be deduced from 4 possible concept vectors g. We train two MLPs, one encoding directly g into 𝑝𝜃 (c | g), and another decoding c into 𝑝𝜓 (g | c). Labels are predicted as per Eq. (1). Given the problem at hand, the total number of det-opts given by Eq. (3) is #det-opts(ℒ) = (44 · 44 ), and that given by Eq. (5) is #det-opts(ℒ + ℛ) = (4! · 4!). Empirically, what happens is that without concept supervision, the model picks up reasoning shortcuts to solve the task. Fig. 1 shows two such RSs, both optimal, obtained by our model when optimizing (b) only the likelihood, and (c) both the likelihood and the reconstruction term. In both cases, the solutions fail to recover the ground-truth concepts. Conclusion. Our results altogether show that the ground-truth concepts are hard, if not impossible, to recover empirically, and that two natural mitigation strategies do not completely address the problem. In particular, the amount of concept supervision required grows linearly with the number of possible concept combinations. We envisage well-tuned strategies based on targeted concept-supervision, combined with additional restrictions on the model itself (and specifically disentanglement between concepts [10]), will likely facilitate (provable) identification of the ground-truth concepts. This is left to future work. References [1] E. Marconato, G. Bontempo, E. Ficarra, S. Calderara, A. Passerini, S. Teso, Neuro symbolic continual learning: Knowledge, reasoning shortcuts and concept rehearsal, arXiv preprint arXiv:2302.01242 (2023). [2] Z. Li, Z. Liu, Y. Yao, J. Xu, T. Chen, X. Ma, L. Jian, et al., Learning with logical constraints but without shortcut satisfaction, in: The Eleventh International Conference on Learning Representations, 2023. [3] R. Manhaeve, S. Dumancic, A. Kimmig, T. Demeester, L. De Raedt, DeepProbLog: Neural Probabilistic Logic Programming, NeurIPS (2018). [4] L. De Raedt, S. Dumančić, R. Manhaeve, G. Marra, From statistical relational to neural- symbolic artificial intelligence, in: Proceedings of the Twenty-Ninth International Confer- ence on International Joint Conferences on Artificial Intelligence, 2021, pp. 4943–4950. [5] F. Locatello, S. Bauer, M. Lucic, G. Raetsch, S. Gelly, B. Schölkopf, O. Bachem, Challenging common assumptions in the unsupervised learning of disentangled representations, in: ICML, 2019. [6] B. Schölkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, Y. Bengio, Toward causal representation learning, Proceedings of the IEEE (2021). [7] I. Khemakhem, D. Kingma, R. Monti, A. Hyvarinen, Variational autoencoders and nonlinear ICA: A unifying framework, in: AISTATS, 2020. [8] K. Ahuja, D. Mahajan, V. Syrgkanis, I. Mitliagkas, Towards efficient representation identi- fication in supervised learning, in: Conference on Causal Learning and Reasoning, PMLR, 2022, pp. 19–43. [9] J. von Kügelgen, Y. Sharma, L. Gresele, W. Brendel, B. Schölkopf, M. Besserve, F. Locatello, Self-supervised learning with data augmentations provably isolates content from style, in: NeurIPS, 2021. [10] R. Suter, D. Miladinovic, B. Schölkopf, S. Bauer, Robustly disentangled causal mechanisms: Validating deep representations for interventional robustness, in: International Conference on Machine Learning, PMLR, 2019, pp. 6056–6065.