1. Introduction

International Journal of Computer Vision 115 (2015) 211-252. URL: https://doi.org/10.1007/ s11263

2640-3498

10.1145/3375627.3375830

Attack logics, not outputs: Towards eficient robustification of deep neural networks by falsifying concept-based properties

Raik Dankworth

Gesina Schwalbe

0 0 University of Lübeck , Germany

2020

4 180 186

Deep neural networks (NNs) for computer vision are vulnerable to adversarial attacks, i.e., miniscule malicious changes to inputs may induce unintuitive outputs. One key approach to verify and mitigate such robustness issues is to falsify expected output behavior. This allows, e.g., to locally proof security, or to (re)train NNs on obtained adversarial input examples. Due to the black-box nature of NNs, current attacks only falsify a class of the final output , such as flipping from stop_sign to ¬stop_sign. In this short position paper we generalize this to search for generally illogical behavior, as considered in NN verification: falsify constraints ( concept-based properties) involving further human-interpretable concepts, like red ∧ octogonal → stop_sign. For this, an easy implementation of concept-based properties on already trained NNs is proposed using techniques from explainable artificial intelligence. Further, we sketch the theoretical proof that attacks on concept-based properties are expected to have a reduced search space compared to simple class falsification, whilst arguably be more aligned with intuitive robustness targets. As an outlook to this work in progress we hypothesize that this approach has potential to eficiently and simultaneously improve logical compliance and robustness.

eol>Trustworthy AI Neural Network Verification Adversarial Attack Explainable Neural Network Concept-based XAI Computer Vision

1. Introduction

Neural Networks (NNs) excel in processing subsymbolic inputs like images, and are increasingly being considered for use in safety-critical domains [ 1 ]. This makes it crucial to ensure their robust and intuitive generalization, at least around known training cases. One tool to evaluate vulnerability to malicious attacks are Adversarial Attacks (AAs): These craft inputs that induce incorrect or unexpected predictions, using minimal modifications to a correctly handled input x with y = (x) [ 2, 3, 4, 5, 6, 7, 8 ]. However, existing attacks solely focus on altering the model’s final output, i.e., falsify ∀x′ ∈ Nbhd(x) : (x′) = for some neighborhood Nbhd(x) around x like an -ball. This disregards whether the prediction still conforms to high-level, interpretable properties. Common examples of known properties are suficient conditions, e.g., red(x) ∧ octogonal(x) =⇒ stop_sign(x) in trafic sign recognition from images ; and necessary conditions, like ¬octogonal(x) =⇒ ¬stop_sign(x). More general, rules involving unary predicates not available from the NN outputs are here called concept-based properties. Rich semantic rules are known to be well suited for runtime plausibility monitoring [ 9, 10 ] and respective fixing of NN outputs [ 10, 11, 12 ]. In particular, they don’t constrain the local output to be correct, but the underlying general logical reasoning locally around the sample.

One reason why falsification of such informative constraints are not considered for attack generation is that they require outputs for all involved predicates—not only the available final output, like stop_sign. These however, might need a considerable amount of training data or hyperparameter tuning if added right away during the training; or, even worse, not all properties and thus not all required concepts might be known at training time due to specification gaps or later domain transfer. 7th International Workshop on Artificial Intelligence and Formal Verification, Logic, Automata, and Synthesis (OVERLAY 2025), October 26, 2025, Bologna, Italy $ r.dankworth@uni-luebeck.de (R. Dankworth); gesina.schwalbe@uni-luebeck.de (G. Schwalbe) https://isp.uni-luebeck.de/staf/r-dankworth (R. Dankworth); https://isp.uni-luebeck.de/staf/g-schwalbe (G. Schwalbe) 0009-0001-5617-2069 (R. Dankworth); 0000-0003-2690-2478 (G. Schwalbe)

The trick we now use here is that NNs automatically learn to encode task-related concepts in their intermediate outputs. For example, when trained for stop_sign recognition, the NN may implicitly learn to identify octagons, red color, and the stop_label. Post-hoc supervised concept-based explainability methods [ 13, 14, 15, 16 ] can recover this information in a very sample-eficient manner with minimal additions to the NN structure.

Altogether, we propose and theoretically analyze a general AA goal —the Concept-based Property Attack (ConPAtt)—that explicitly targets falsification of symbolic concept-based properties over nonsymbolic inputs. As we will show, our formulation ofers a more general way to define both targeted and untargeted attacks. Furthermore, as opposed to classical attacks that purely change the output, our attack on ¬octogonal =⇒ ¬stop_sign can produce an image still classified as stop_sign, in which the octogonal concept is no longer recognized. This newly allows to uncover failure cases with semantically inconsistent yet possibly high-confidence predictions that are invisible to standard attacks. As we show, standard white-box attack techniques can still easily and eficiently be applied, producing meaningful attacks and a more constrained adversarial space as compared to traditional AAs. Contributions. Our main contributions are: • We introduce ConPAtt, a general XAI-supported adversarial attack goal that targets concept-based properties rather than just NN outputs. • We proof that ConPAtt generalizes both classical targeted and untargeted AA formulations, but same-sized or smaller adversarial space. • We hypothesize several advantages of ConPAtts for certifying robustness and for adversarial retraining, posing the chance to eficiently improve both semantic consistency and robustness.

2. Related Work

Adversarial Attacks AAs generally search within the vicinity of an input sample for minimally perturbed variants ˜ = + that have a malicious efect on the NN’s output [ 17 ]. The perturbations can be arbitrary (digital AAs, considered) [ 18, 19, 20, 21, 6, 7 ], or further constrained to realistic changes (physical AAs) [ 3, 4, 22, 23, 2, 5 ]. However, the minimality makes the changes often invisible or dificult to see for humans. At the methodological level, black-box approaches only require access to NN inputs and outputs [ 23, 2, 4, 24, 25, 5 ]. White-box attacks as considered here instead exploit NN model internals, such as the gradient, for a more eficient search [ 18, 19, 21, 3, 6, 7 ]. Generally, AAs can be seen as a subfield of NN verification that falsifies a continuity property [ 26, 27]. Thus, usual search, reachability analysis, and—most prominently—optimization techniques are applicable to find or disprove adversarial examples [ 17 ]. Regarding types of specifications beyond continuity properties, approaches such as Scenic [28] and VerifAI [29] demonstrate how formal specifications can be used to generate and analyze simulation-based scenarios with symbolic inputs. In contrast, our approach targets AAs on non-symbolic image inputs, which prevents the direct use of such tools but similarly requires formal specifications.

Concept-based Explainability Concept-based explainability generally aims to associate humaninterpretable concepts with representations in NN latent space [30, 31, 32]. This includes understanding which concepts are relevant to the decision and to what extent [ 33, 16 ], and how these can be accurately recognized in NNs [ 14, 34 ]. If concept definitions in form of labeled samples are available at training time, ante-hoc approaches [35, 33, 36, 37, 38] can train individual neurons to activate for the concept. We here instead consider post-hoc approaches: These train a simple model to predict the concept of interest from an NN layer’s activation [39]. Other than single-neuron-associations [40, 41], or complex models [42, 43], linear models considered here [44, 39, 45, 46] pose a good tradeof between capturing the entanglement of representations [44, 47], interpretability [39], and favorably simple representation of the concept as halfspace in the NN’s latent space.

XAI and Verification Prior work has shown that concept-based explanation methods are vulnerable to adversarial attacks. Perturbations can mislead attribution [48] and concept-based tools [49, 50], and adversarial examples significantly alter the internal concept composition of NNs [ 49], confirming the general fragility of interpretability methods [51]. However, these studies target concepts in isolation, without considering their joint relation to model predictions.

Beyond highlighting vulnerabilities, concept outputs have also been used for verification. Mangal et al. [52] employed vision–language models to check concept-based properties. While expressive, this approach relies on semantic similarity in multimodal embeddings (e.g., CLIP [53]), which can introduce linguistic ambiguity as well as imprecision for similar terms with small visual diferences, e.g., circle versus octagon. Moreover, it is restricted to the latent space of a specific layer, although simple visual concepts may predominantly appear earlier and diminish in later layers. Cheng et al. [54] proposed specifications close to the output layer, but without decomposing them into underlying concepts and by employing an additional NN. Semantic losses [ 55, 12 ] like logic tensor networks [ 12 ] suggest to directly train concept-based rules into the network. These techniques, however, are only used for updating the NN, not for verification as done in [ 9 ], and not for AAs. Furthermore, they rely on concepts being direct outputs of the NN. Even further decoupling the verification from the NN’s learned representations and thus exacerbating training eforts, Xie et al. [ 56] even trained completely separate NNs for predicting the concepts. Our work also directly addresses the relationship between concepts and model outputs like, a perspective that has received little attention so far [ 9 ]. However, similar to the verification testing techniques from [ 54, 9 ], we suggest to keep training and verification eforts low by using faithful explainability techniques to access concept predictions, and we newly apply the setup to AAs.

3. Background

Adversarial Attacks Let x ∈ be a real image, y ∈ be its true label, and : → be a NN. An AA seeks an adversarial example xadv := x + ∈ so that its output is (suficiently) diferent from the original, and the perturbation is minimal to an objective function (usually the L1, L2, or L-infinity norm on the input for digital attacks). Suficient diference can be formulated in terms of a y-specific partition of the output set into a benign output set + ⊂ with (x) ∈ +, and a malicious one − := ∖ +. The search for the minimum perturbation then is the optimization problem argmin ( ) s.t. (x + ) ∈ − .

(1) Adversarial attack strategies for classification are categorized as targeted or untargeted according to their choice of − : Let : → [ 0, 1 ] denote the confidence assigned to class , and ∈ [ 0, 1 ] the threshold required to accept class . In untargeted attacks, the goal is to reduce the confidence of the true class below threshold, i.e., − = {y ∈ | (y) < }. In contrast, targeted attacks aim to raise the confidence of an incorrect class ′ above a threshold, i.e., − = {y ∈ | ′ (y) ≥ ′ }. Post-hoc Concept Extraction Let be a set of concepts (e.g., = {red, orthogonal}), and assume a possibly small classification dataset = ((x, y,)) is available per concept ∈ . Further denote by → : → the NN part that maps from the th to the th layer. Through linear post-hoc concept extraction, additional concept outputs are added to the NN by attaching for each a linear classification model → : → = [ 0, 1 ] to the th hidden layer as illustrated in Figure 1. Keeping the NN’s weights fixed, the weights of → are trained on pairs ((→(x), y,)), such that ’s concept function = → ∘ → : → correctly predicts presence of the concept in an input image. Note that → being linear conveniently makes any subspace { ∈ | →() > } an afine linear half-space. In the following, we denote by = ()∈ : → = ()∈ the complete prediction of all concepts, and by = × the complete output set after attaching the concept outputs. T-Norm Fuzzy Logic The standard Boolean logical connectives (and ∧, or ∨, not ¬) can only operate on binary truth values in B = {0, 1}. T-norm fuzzy logics extend the connectives to many-valued input layer hidden layer 1 hidden layer 2 hidden layer 3 output layer concept truth values in B = [ 0, 1 ] using a so-called t-norm ∧ : [ 0, 1 ] × [ 0, 1 ] → [ 0, 1 ] to replace the ∧. A valid t-norm must be monotonic, commutative and associative, have a neutral element (the 1), and match ∧ on Boolean values. Typical choices for ∧ are Product ( · ), Łukasiewicz (max(0, + − 1)), and Gödel (min(, )) t-norms [57], since these form a generating system for all continuous t-norms. Given a ∧, then ¬ := 1 − , ∨, =⇒ : [ 0, 1 ]2 → [ 0, 1 ] can be derived and maintain desirable properties, giving the resulting t-norm logic.

Desirable properties for use of t-norm logic with NN classification outputs are: (1) The NN typically produces a confidence prediction in [ 0, 1 ] instead of a Boolean value, which can be propagated by t-norm fuzzy logic to the confidence of entire logical expressions. (2) The classicale piece-wise continuous t-norm logic connectives are also piece-wise diferentiable like ReLU activations of NNs. So, they can directly be used in backpropagation [ 12 ].

4. Approach

In this chapter we first define our new notion of concept-based AAs. Then we show that standard AAs are a special case, and existing attack techniques can easily be adopted to our new attack.

4.1. Concept-based Property Attacks

These classical AA types can also be interpreted as special cases of property attacks, where class predictions are treated as logical literals. Using fuzzy logic (see paragraph 3), we can evaluate logical expressions over outputs using a function solve : → B that returns the truth value of a property . A property attack falsifies a given property, i.e., − = − = {y ∈ | solve¬ (y)}. Untargeted and targeted attacks correspond to properties = and = ¬, respectively.

This perspective allows adversarial examples to be crafted with higher-order conditions — e.g., enforcing both “dog” () and “cat” () simultaneously. The corresponding attacked property is its logical negation: = ¬ ∨ ¬.

The point of view of property attacks can also be applied to NNs that are boosted with XAI techniques. The additional concept outputs can also be used as well as the original task output of the NN to define property attacks— Concept-based Property Attacks (ConPAtt). For denoting the properties we propose to use the following intuitive and convenient implication format generalizing our introductory examples (all logical expressions can be reformulated like this, see Lemma 1). Note that for simplicity we shorten (x) to , and (¬) shorthands possibly negated .

Lemma 1. Each logical expression with two disjoint literal sets and can be reformulated into a term of conjunctively linked implication terms where antecedents consist only of conjunctively linked, possibly negated literals of , and consequences consist only of disjunctively linked, possibly negated literals of . Proof. Each logical expression can be reformulated into the conjunctive normal form ≡ ⋀︀>0(⋁︀∈⊆ (¬) ⋁︀∈⊆ (¬)). Let us introduce two additional variable families , that condense the disjunctive subformulas: := ¬

⋁︁ ∈⊆ (¬) ≡

⋀︁ ∈⊆ (¬) :=

⋁︁ (¬) ∈⊆ The subformulas can be replaced by these variables and the whole logical expression reformulates to ≡ ⋀︀ (¬ ∨ ) ≡ ⋀︀ ( =⇒ ).

>0 >0 Definition 1 (Concept-based property). A concept-based property is a logical expression with two disjoint literal sets —the concept literals—and —the task literals—in the form of conjunctively linked implication terms whose antecedents consist only of conjunctively linked, possibly negated concept literals and whose consequences consist only of disjunctively linked, possibly negated task literals. := ⋀︁ ( =⇒ ) , >0 with := ⋀︁(¬)

and := ⋁︁(¬) ∈⊆ ∈⊆ Definition 2 (Concept-based Property Attack). Let solve : → B be the function to calculate the truth value of a concept-based property which evaluates to true at an input x, and a minimality measure for perturbations . A Concept-based Property Attack of is the search for a -minimal perturbation to an input x into an adversarial example xadv = x + which falsifies , i.e., lies in the malicious output set

− = − = {z ∈ | solve¬ (z)}

Intuitively, a ConPAtt adversarial example xadv to = (⋀︀ =⇒ ⋁︀ ) like red ∧ octogonal =⇒ stop_sign, causes the NN to predict all as true, and all as false. This can happen if (1) some is predicted true even though it should be false (e.g., red predicted true even though the change turned the sign gray), and/or (2) some is predicted negative even though it should be positive (e.g., stop_sign flipped to false). (2) (3) (4)

4.2. ConPAtts as Generalized Adversarial Attacks

Note that falsifying one implication term is enough to falsify a concept-based property and thus, it is suficient to consider one implication = =⇒ for an attack. The set of adversarial example task outputs can be derived from this definition, i.e. − := {y ∈ | (y, c) ∈ − }. Furthermore: Theorem 1. Standard targeted and untargeted AAs are special cases of ConPAtt.

Proof. First note the two special cases of ConPAtt where only a single task literal is used: 1. Generalized untargeted AAs: =⇒ .

2. Generalized targeted AAs: =⇒ ¬.

Un-/targeted respective are generalized un-/targeted AAs with ≡ true, i.e., no concept restriction.

A neat property of ConPAtts is that the search space is generally reduced compared to vanilla AAs: Theorem 2. The task output spaces of adversarial examples for generalized untargeted/targeted AAs are smaller than or equal to those for standard untargeted/targeted AAs.

− =⇒ ⊆ − − =⇒ ¬ ⊆ ¬− Proof. Let us first look at generalized untargeted AA properties like =⇒ . Each adversarial example must lack class prediction but requires concept predictions , i.e., they satisfy the property ∧ ¬. In contrast to that, standard untargeted AAs only require the misclassification of , i.e. each adversarial example satisfies ¬ and they accept adversarial examples that do not additionally fulfill . It follows that the valid output space of adversarial examples for generalized untargeted AAs − =⇒ is smaller than or equal to that for standard untargeted AAs − as well as for their valid task output spaces − =⇒ ⊆ − .

In this explanation, it does not matter whether both adversarial examples expect a misclassification ¬ or a specific task output . That is why this relation also applies between generalized targeted AAs and standard targeted AAs, i.e. − =⇒ ¬ ⊆ ¬−.

ConPAtt Procedure ConPAtt can be easily performed with any existing AA approach. The trick is to use the result of the (partially) diferentiable fuzzy operation ∘ (, ) : → B instead of the output of the NN. This makes ConPAtt a targeted AA with the expected result False or 0 for the adversarial examples.

5. Discussion and Outlook: ConPAtt for Adversarial Training

In the following we discuss further what practical benefits we expect from this more general formulation of attack goals, how this could be evaluated, and which challenges are still open.

5.1. Hypothesized Benefits of ConPAtts

We hypothesize that • generalized (un-)targeted AAs with at least one concept reduce the search space for adversarial examples not only theoretically but also empirically, • the adversarial examples obtained via ConPAtt are particularly eficient for retraining because they are pinpoint adversarial examples with a high information content.

ConPAtts versus Standard AAs: To understand above claims, one should first have a closer look at the vulnerabilities that can be exploited for a successfull ConPAtt attack against a concept-based property = ( =⇒ ). Standard AAs capture any cases, where the final output is changed, regardless of whether this resulted in illogical behavior breaking or not. Thus, standard AAs may primarily focus on turning of causally related early-layer concepts, i.e., falsifying to falsify . For example, falsify red to cause a negative output of stop_sign. This is not suficient for a ConPAtt to , for which not only must become false, but simultaneously must remain true (cf. Theorem 2). It is therefore not guaranteed that one obtains the same results for ConPAtts against any of the following concept-based properties: • = (true =⇒ ), which is the standard AA against the output , • = (¬ =⇒ false), which is the standard AA against the concept outputs, i.e., the attack flips any concept in the conjunction = ⋀︀ to false, and • = ( =⇒ ), which is a generalized concept-based property attack.

Whether the obtained adversarial examples are similar depends on whether it is easier to attack concepts—then falsifying and should yield similar results—or logics, in which case falsifying and are expected to yield similar results. Since concepts themselves represent noisy variables with non-perfect accuracy, chances are high that attacking concepts generally is easier than attacking logics. Our ConPAtt framework provides the option to test and train on these diferent rules individually, and hence distinguish more finegrained between simply attacking the concepts or outputs, and truly attacking internal logics.

Benefits of Targeting Logics: One reason for both of the claims is on semantic level: Humandefined properties typically encode important knowledge about the task at hand, thus should strengthen both the adherence to the properties and indirectly the actual main task of the network. Given that well-generalizing NNs typically adopt this knowledge to large extend, the cases of logic breaches should be few but meaningful. This would make attacking logics especially beneficial for retraining purposes similar to adversarial training [ 17, 58 ].

Benefits for Computational Eficiency: Also, here directly benefit from low integration overhead: (1) Preparation only requires cheap post-hoc concept extraction; (2) Only very few additional operations (the →) are added that need backpropagation/-tracing if gradient-based attack methods are used; and (3) The beneficial formulation of concepts as half-spaces in latent spaces allows eficient reachability analysis with substantial reduction in the search space as illustrated in Figure 2 and sketched in

Appendix A. Next steps should empirically test the attack success and the efect of retraining with adversarial examples of this approach.

5.2. Future Work: Evaluation and Challenges

Planned Experimental Setting: We suggest to evaluate several aspects to ensure a comprehensive assessment. As metrics, we consider both task performance and rule adherence, measured through accuracy and Intersection-over-Union (IoU) for task prediction as well as rule satisfaction. In addition, we track the success of adversarial attacks before retraining, as well as the efectiveness of defences and the accuracy of concepts after retraining. For evaluation, we draw on three established datasets: MNIST [59], GTSRB [60], and ImageNet [61]. The models include self-trained simple architectures for MNIST and GTSRB, as well as a range of widely used ImageNet classifiers: Inception-v3 [ 62], Inception-v4 [63], Inception-Resnet-v2 [63], Resnet-v2-101 [64] and the ensemble-based variants Inception v33, Inception v34 and IncRes v2 [58]. For baselines, we rely on several state-of-the-art adversarial attack methods, namely SGM [19], VMI-FGSM and VNI-FGSM [21], L2T [ 6 ], and BSR [ 7 ].

The attacked concept-based properties reflect both simple and more complex relations. Examples include that class 1 implies the concept line, that classes 1 and 2 should never be predicted simultaneously (i.e., ¬1 ∨ ¬2), and that the concepts red, octagon, and stop_label together imply stop_sign. Challenges and further Future Work: As explained above, it is expected that ConPAtts not necessarily yield the same results as standard AAs that attack outputs or concepts. In addition to above experiments, one could contrastively compare results for the diferent attacks for insights how large the gap truly is. However, a considerable challenge for the experimental evaluation is that retraining procedures may need to be adapted: (Adversarially) retraining with respect to the task output might accidentally destroy the post-hoc attached concept outputs. Countermeasures might be to freeze earlier NN parts up to the concept prediction, or alternatingly or simultaneously retrain the NN and the concept predictors. Experiments must show how to balance need for concept labels with concept accuracy during adversarial finetuning.

6. Conclusion

In this position paper, we introduce a novel generalized adversarial attack goal: Instead of targeting a change in (respectively falsification of) the output class, our attacks aim to falsify the compliance of the NN with prior symbolic knowledge on suficient indicators for an output class. Standard AAs are shown to be a specific case of our generalized formulation for concept-based properties. Also, these allow to substantially reduce the expected search space of the AA search with increasing number of concepts. Also, we argue that these concept-based properties provide a more natural and human-aligned target for AAs. This suggests that they might be particularly suited for NN robustification via adversarial model (re)training or runtime monitoring.

Acknowledgments

This work was supported through the junior research group project “chAI” funded by the German Federal Ministry of Research, Technology and Space (BMFTR), grant no. 01IS24058. The authors are solely responsible for the content of this publication.

Declaration on Generative AI

During the preparation of this work, the author used ChatGPT based on GPT-4o in order to: Improve writing style. After using these tool(s)/service(s), the author reviewed and edited the content as needed and takes full responsibility for the publication’s content.

A. Considerations for Reachability-based Search

Existing reachability-based techniques conduct forward and/or backward passes through the NN to trace / estimate regions of interest through the NN processing. We here show how the considered concept-based properties give rise to a particularly eficient formulation of this approach: Being halfspaces in intermediate layers, the (negated) concepts have the potential to easily and substantially reduce the adversarial space that one needs to keep track of half-way through the network and can also be easily described in later layers as sketched in Figure 2. In the following, this is illustrated for a back-propagation approach for a simple generalized untargeted attack (1 ∧ · · · ∧ ) =⇒ . Recall that a valid counterexample falsifying the property (1 ∧ · · · ∧ ) =⇒ must fulfil 1 ∧ · · · ∧ ∧ ¬.

Denote by ℒ→ℒ′ : ℒ → ℒ′ the NN part mapping from layer ℒ to ℒ′, and (ℒℒ′ →) = ∘ ℒ→ ∘ ℒ′→ℒ : ℒ′ → [ 0, 1 ] the function evaluating the presence of concept in layer ℒ for a latent vector from an earlier layer ℒ′. Denote by ℒ the layer which was chosen for the embedding of concept , and let ℒ1 be the earliest layer for which ℒ1 = ℒ for some . Note that ℒ = ℒ− 1 is the final representation layer before the output confidence prediction, if this is layers later than ℒ1. Let = { ∈ ℒ | ℒ→() < } be the halfspace of the concept in the concept’s ℒ.

Now we can reformulate the falsification as a search for a region in latent space: Lemma 2. A representation = →ℒ() ∈ ℒ in layer ℒ of a valid counterexample ∈ to the concept-based property (1 ∧ · · · ∧ ) =⇒ must fulfil ∈ ∈⋀︀, ℒ− →1ℒ ().

While it is costly to determine − 1

ℒ→ℒ () independently, the concept-based property gives rise to a recursive definition: Theorem 3. Recursively define the propagation of halfspace intersections through the NN − 1 =

⋂︁ ℒ=ℒ , = ℒ−1− 1→ℒ (+1) ∩

⋂︁ ℒ=ℒ (5) constraint of considering ReLU networks.

Then for any counterexample to above concept-based property it must hold that →ℒ() ∈ 1. 1 can be eficiently calculated using a single backward propagation through layers − 1 to 1. Proof. The property inductively follow from the definition, noting that 1 = ⋀︀ − 1 ∈, ℒ→ℒ () and the

In particular, each propagation step only requires to obtain a polytope’s preimage for a single NN layer operation, and apply a cheap intersection of the resulting polytope with halfspaces. This makes the first part of the search very eficient, promising speedup compared to a full end-to-end search for counterexamples directly in the input space.

The forward-propagation case is similar. Here, it can additionally be shown, that the propagated always is a connected polytope, since intersection with half-spaces does not change this property, neither does the forward pass through continuous layer operations. A Survey on Adversarial Attacks and Defense Mechanisms on Image Classification, IEEE Access 10 (2022) 102266–102291. doi:10.1109/ACCESS.2022.3208131. [18] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, R. Fergus, Intriguing properties of neural networks, 2014. URL: http://arxiv.org/abs/1312.6199. doi:10.48550/arXiv. 1312.6199, arXiv:1312.6199 [cs]. [19] D. Wu, Y. Wang, S.-T. Xia, J. Bailey, X. Ma, Skip Connections Matter: On the Transferability of Adversarial Examples Generated with ResNets, 2019. URL: https://openreview.net/forum?id= BJlRs34Fvr. [20] J. Su, D. V. Vargas, K. Sakurai, One Pixel Attack for Fooling Deep Neural Networks, IEEE Transactions on Evolutionary Computation 23 (2019) 828–841. URL: https://ieeexplore.ieee.org/ document/8601309. doi:10.1109/TEVC.2019.2890858, conference Name: IEEE Transactions on Evolutionary Computation. [21] X. Wang, K. He, Enhancing the Transferability of Adversarial Attacks Through Variance Tuning, 2021, pp. 1924–1933. URL: https://openaccess.thecvf.com/content/CVPR2021/html/Wang_ Enhancing_the_Transferability_of_Adversarial_Attacks_Through_Variance_Tuning_CVPR_ 2021_paper.html. [22] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, D. Song, Robust Physical-World Attacks on Deep Learning Visual Classification, 2018, pp. 1625–1634. URL: https://openaccess.thecvf.com/content_cvpr_2018/html/Eykholt_Robust_Physical-World_ Attacks_CVPR_2018_paper. [23] A. Liu, X. Liu, J. Fan, Y. Ma, A. Zhang, H. Xie, D. Tao, Perceptual-Sensitive GAN for Generating Adversarial Patches, Proceedings of the AAAI Conference on Artificial Intelligence 33 (2019) 1028– 1035. URL: https://ojs.aaai.org/index.php/AAAI/article/view/3893. doi:10.1609/aaai.v33i01. 33011028, number: 01. [24] W. Huang, X. Zhao, G. Jin, X. Huang, SAFARI: Versatile and Eficient Evaluations for Robustness of Interpretability, 2023, pp. 1988–1998. URL: https://openaccess.thecvf.com/content/ICCV2023/ html/Huang_SAFARI_Versatile_and_Eficient_Evaluations_for_Robustness_of_Interpretability_ ICCV_2023_paper.html. [25] D. Wang, W. Yao, T. Jiang, C. Li, X. Chen, RFLA: A Stealthy Reflected Light Adversarial Attack in the Physical World, 2023, pp. 4455–4465. URL: https://openaccess.thecvf.com/content/ICCV2023/ html/Wang_RFLA_A_Stealthy_Reflected_Light_Adversarial_Attack_in_the_Physical_ICCV_ 2023_paper.html. [26] C. Liu, T. Arnon, C. Lazarus, C. Strong, C. Barrett, M. J. Kochenderfer, Algorithms for verifying deep neural networks, Foundations and Trends® in Optimization 4 (2021) 244–404. doi:10.1561/ 2400000035. arXiv:1903.06758. [27] Y. Sun, M. Wu, W. Ruan, X. Huang, M. Kwiatkowska, D. Kroening, Concolic testing for deep neural networks, in: Proc. 33rd ACM/IEEE Int. Conf. Automated Software Engineering, ACM, Montpellier, France, 2018, pp. 109–119. doi:10.1145/3238147.3238172. [28] D. J. Fremont, T. Dreossi, S. Ghosh, X. Yue, A. L. Sangiovanni-Vincentelli, S. A. Seshia, Scenic: a language for scenario specification and scene generation, in: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, Association for Computing Machinery, New York, NY, USA, 2019, pp. 63–78. URL: https://dl.acm.org/doi/10. 1145/3314221.3314633. doi:10.1145/3314221.3314633. [29] T. Dreossi, D. J. Fremont, S. Ghosh, E. Kim, H. Ravanbakhsh, M. Vazquez-Chanlatte, S. A. Seshia, VerifAI: A Toolkit for the Formal Design and Analysis of Artificial Intelligence-Based Systems, in: I. Dillig, S. Tasiran (Eds.), Computer Aided Verification, Springer International Publishing, Cham, 2019, pp. 432–442. doi:10.1007/978-3-030-25540-4_25. [30] J. H. Lee, G. Mikriukov, G. Schwalbe, S. Wermter, D. Wolter, Concept-Based Explanations in Computer Vision: Where Are We and Where Could We Go?, in: A. Del Bue, C. Canton, J. Pont-Tuset, T. Tommasi (Eds.), Computer Vision – ECCV 2024 Workshops, Springer Nature Switzerland, Cham, 2025, pp. 266–287. doi:10.1007/978-3-031-92648-8_17. [31] E. Poeta, G. Ciravegna, E. Pastor, T. Cerquitelli, E. Baralis, Concept-based Explainable Artificial

Intelligence: A Survey, 2023. doi:10.48550/arXiv.2312.12936. arXiv:2312.12936. [32] G. Schwalbe, Concept Embedding Analysis: A Review, 2022. doi:10.48550/arXiv.2203.13909.

arXiv:2203.13909. [33] A. Wan, L. Dunlap, D. Ho, J. Yin, S. Lee, S. Petryk, S. A. Bargal, J. E. Gonzalez, NBDT: Neural-Backed

Decision Tree, 2020. URL: https://openreview.net/forum?id=mCLVeEpplNE. [34] G. Schwalbe, Verification of Size Invariance in DNN Activations Using Concept Embeddings, in: I. Maglogiannis, J. Macintyre, L. Iliadis (Eds.), Artificial Intelligence Applications and Innovations, volume 627, Springer International Publishing, Cham, 2021, pp. 374–386. URL: https://link.springer. com/10.1007/978-3-030-79150-6_30. doi:10.1007/978-3-030-79150-6_30. [35] P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann, E. Pierson, B. Kim, P. Liang, Concept Bottleneck Models, in: Proceedings of the 37th International Conference on Machine Learning, PMLR, 2020, pp. 5338–5348. URL: https://proceedings.mlr.press/v119/koh20a.html. [36] M. Yuksekgonul, M. Wang, J. Zou, Post-hoc Concept Bottleneck Models, 2022. URL: https: //openreview.net/forum?id=nA5AZ8CEyow. [37] T. Oikarinen, S. Das, L. M. Nguyen, T.-W. Weng, Label-free Concept Bottleneck Models, 2022.

URL: https://openreview.net/forum?id=FlCg47MNvBA. [38] Y. Yang, A. Panagopoulou, S. Zhou, D. Jin, C. Callison-Burch, M. Yatskar, Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification, 2023, pp. 19187–19197. URL: https://openaccess.thecvf.com/content/CVPR2023/html/Yang_Language_in_a_ Bottle_Language_Model_Guided_Concept_Bottlenecks_for_CVPR_2023_paper.html. [39] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, R. Sayres, Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV), in: Proc. 35th Int. Conf. Machine Learning, volume 80 of Proceedings of Machine Learning Research, PMLR, Stockholmsmässan, Stockholm, Sweden, 2018, pp. 2668–2677. [40] D. Bau, B. Zhou, A. Khosla, A. Oliva, A. Torralba, Network dissection: Quantifying interpretability of deep visual representations, in: Proc. 2017 IEEE Conf. Comput. Vision and Pattern Recognition, IEEE Computer Society, Honolulu, HI, USA, 2017, pp. 3319–3327. doi:10.1109/CVPR.2017.354. arXiv:1704.05796. [41] C. Olah, A. Mordvintsev, L. Schubert, Feature visualization, Distill 2 (2017) e7. doi:10.23915/ distill.00007. [42] J. Crabbé, M. van der Schaar, Concept Activation Regions: A Generalized Framework For Concept

Based Explanations, Advances in Neural Information Processing Systems 35 (2022) 2590–2607. [43] R. Zhang, P. Madumal, T. Miller, K. A. Ehinger, B. I. P. Rubinstein, Invertible concept-based explanations for CNN models with non-negative concept activation vectors, in: Proc. 35th AAAI Conf. Artificial Intelligence, volume 35, AAAI Press, virtual, 2021, pp. 11682–11690. [44] R. Fong, A. Vedaldi, Net2Vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks, in: Proc. 2018 IEEE Conf. Comput. Vision and Pattern Recognition, IEEE Computer Society, Salt Lake City, UT, USA, 2018, pp. 8730–8738. doi:10.1109/CVPR.2018. 00910. [45] M. Graziani, V. Andrearczyk, H. Müller, Regression concept vectors for bidirectional explanations in histopathology, in: D. Stoyanov, Z. Taylor, S. M. Kia, I. Oguz, M. Reyes, A. Martel, L. Maier-Hein, A. F. Marquand, E. Duchesnay, T. Löfstedt, B. Landman, M. J. Cardoso, C. A. Silva, S. Pereira, R. Meier (Eds.), Understanding and Interpreting Machine Learning in Medical Image Computing Applications, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2018, pp. 124–132. doi:10.1007/978-3-030-02628-8_14. [46] G. Mikriukov, G. Schwalbe, K. Bade, Local Concept Embeddings for Analysis of Concept Distributions in Vision DNN Feature Spaces, International Journal of Computer Vision (2025). doi:10.1007/s11263-025-02446-y. [47] M. Dreyer, E. Purelku, J. Vielhaben, W. Samek, S. Lapuschkin, PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits, in: CVPR2024 Workshops, XAI4CV, arXiv, Seattle Convention Center, Seattle, WA, USA, 2024. doi:10.48550/arXiv.2404.06453. arXiv:2404.06453. [63] C. Szegedy, S. Iofe, V. Vanhoucke, A. Alemi, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, Proceedings of the AAAI Conference on Artificial Intelligence 31 (2017). URL: https://ojs.aaai.org/index.php/AAAI/article/view/11231. doi:10.1609/aaai. v31i1.11231, number: 1. [64] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, 2016, pp. 770– 778. URL: https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_ CVPR_2016_paper.html.

[1]

Rech , Artificial Neural Networks for Space and Safety-Critical Applications: Reliability Issues and Potential Solutions , IEEE Transactions on Nuclear Science 71 ( 2024 ) 377 - 404 . URL: https: //ieeexplore.ieee.org/abstract/document/10380628. doi: 10 .1109/TNS. 2024 . 3349956 .

[2]

Suryanto ,

Kim ,

Kang ,

H. T.

Larasati ,

Yun , T.- T.-H. Le , H.

Yang , S.-Y. Oh, H.

Kim , DTA: Physical Camouflage Attacks Using Diferentiable Transformation Network , 2022 , pp. 15305 - 15314 . URL: https://openaccess.thecvf.com/content/CVPR2022/html/Suryanto_DTA_ Physical_ Camouflage_Attacks_Using_Diferentiable_Transformation_Network_CVPR_2022_paper .html.

[3]

Li ,

Dai ,

Guo ,

Xiao , Physical-World Optical Adversarial Attacks on 3D Face Recognition , 2023 , pp. 24699 - 24708 . URL: https://openaccess.thecvf.com/content/CVPR2023/html/ Li_Physical-World_ Optical_Adversarial_Attacks_on_3D_Face_Recognition_CVPR_2023_paper . html.

[4]

Hu ,

Wang ,

Tiliwalidi ,

Li , Adversarial Laser Spot: Robust and Covert Physical-World Attack to DNNs , in : Proceedings of The 14th Asian Conference on Machine Learning, PMLR , 2023 , pp. 483 - 498 . URL: https://proceedings.mlr.press/v189/hu23b.html, iSSN: 2640 - 3498 .

[5]

Zheng ,

Lin ,

Sun ,

Zhao ,

Li ,

Shen , Physical 3D Adversarial Attacks against Monocular Depth Estimation in Autonomous Driving , 2024 , pp. 24452 - 24461 . URL: https://openaccess.thecvf.com/content/CVPR2024/html/Zheng_Physical_3D_Adversarial_ Attacks_against_Monocular_Depth_Estimation_in_Autonomous_CVPR_ 2024 _paper .html.

[6]

Zhu ,

Zhang ,

Liu ,

Xu ,

Liang , Learning to Transform Dynamically for Better Adversarial Transferability , 2024 . URL: https://openreview.net/forum?id=k76ngWX9OR.

[7]

Wang ,

He ,

Wang ,

Wang , Boosting Adversarial Transferability by Block Shufle and Rotation , in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2024 , pp. 24336 - 24346 . URL: https://ieeexplore.ieee.org/abstract/document/10656871. doi: 10 .1109/CVPR52733. 2024 . 02297 , iSSN: 2575 - 7075 .

[8]

Ming ,

Ren ,

Wang ,

Feng , Boosting the Transferability of Adversarial Attack on Vision Transformer with Adaptive Token Tuning , 2024 . URL: https://openreview.net/forum?id= sNz7tptCH6.

[9]

Schwalbe ,

Wirth , U. Schmid, Enabling verification of deep neural networks in perception tasks using fuzzy logic and concept embeddings, 2022 . doi: 10 .48550/arXiv.2201.00572. arXiv: 2201 . 00572 .

[10]

Giunchiglia ,

Stoian ,

Khan ,

Cuzzolin , T. Lukasiewicz, ROAD-R: The Autonomous Driving Dataset with Logical Requirements , in: IJCLR 2022 Workshops, Vienna, Austria, 2022 .

[11]

Ledaguenel ,

Hudelot ,

Khouadjia , Improving Neural-based Classification with Logical Background Knowledge , in: ECAI 2024 Workshop Proceedings , arXiv, Santiago de Compostela, Spain, 2024 . arXiv: 2402 . 13019 .

[12]

Badreddine , A. d'Avila Garcez , L.

Serafini , M.

Spranger , Logic Tensor Networks, Artificial Intelligence 303 ( 2022 ) 103649 . doi: 10 .1016/j.artint. 2021 . 103649 .

[13]

Bau ,

Zhou ,

Khosla ,

Oliva ,

Torralba , Network Dissection: Quantifying Interpretability of Deep Visual Representations , 2017 , pp. 6541 - 6549 . URL: https://openaccess.thecvf.com/content_ cvpr_2017/html/Bau_Network_Dissection_Quantifying_CVPR_ 2017 _paper .html.

[14]

Fong , A. Vedaldi, Net2Vec: Quantifying and Explaining How Concepts Are Encoded by Filters in Deep Neural Networks , 2018 , pp. 8730 - 8738 . URL: https://openaccess.thecvf.com/content_cvpr_ 2018/html/Fong_Net2Vec_ Quantifying_and_CVPR_2018_paper .html.

[15]

Crabbé , M. van der Schaar , Concept Activation Regions:

A Generalized

Framework For Concept-Based Explanations , Advances in Neural Information Processing Systems 35 ( 2022 ) 2590 - 2607 . URL: https://proceedings.neurips.cc/paper_files/paper/2022/hash/ 11a7f429d75f9f8c6e9c630aeb6524b5-Abstract-Conference.html.

[16]

Oikarinen , T.-W. Weng, CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks , 2022 . URL: https://openreview.net/forum?id=iPWiwWHc1V.

[17]

S. Y.

Khamaiseh ,

Bagagem ,

Al-Alaj ,

Mancino ,

H. W.

Alomari , Adversarial Deep Learning: