1. Introduction

Critically Assessing the State of the Art in CP U-based Local Robustness Verification

Matthias König

Annelot W. Bosman

Holger H. Hoos

0 1 2

Jan N. van Rijn

1 0 Chair for AI Methodology, RWTH Aachen University , Germany 1 Leiden Institute of Advanced Computer Science, Leiden University , The Netherlands 2 University of British Columbia , Canada

Recent research has proposed various methods to formally verify neural networks against minimal input perturbations. This type of verification is referred to as local robustness verification. The research area of local robustness verification is highly diverse, as verifiers rely on a multitude of techniques, including mixed integer programming and satisfiability modulo theories. At the same time, problem instances difer based on the network to be verified, the verification property and the specific network input. This gives rise to the question of which verification algorithm is most suitable to solve a given verification problem. To answer this question, we perform a systematic performance analysis of several CPU-based local robustness verification systems on a newly and carefully assembled set of 79 neural networks, each verified against 100 robustness properties. Notably, we show that there is no single best algorithm that dominates in performance across all verification problem instances. Instead, our results reveal complementarities in verifier performance and illustrate the potential of leveraging algorithm portfolios for more eficient local robustness verification. Furthermore, we confirm the notion that most algorithms only support ReLU-based networks, while other activation functions remain under-supported.

eol>Benchmark Analysis Neural Network Verification Adversarial Robustness Shapley Value Algorithm Portfolios

1. Introduction

sessment refers to local robustness verification , a broadly studied verification task, in which a network is systematIn recent years, deep learning methods based on neural ically tested against various input perturbations under networks have been increasingly applied within vari- predefined norms, such as the ∞ norm [15, 16]. ous safety-critical domains and use contexts, ranging Neural network verification is a highly diverse refrom manoeuvre advisory systems in unmanned aircraft search area, and existing methods rely on a broad range to face recognition systems in mobile phones (see, e.g., of techniques. At the same time, neural networks difer in Julian et al. [ 1 ]). Furthermore, it is now well known terms of their architecture, such as the number of hidden that neural networks are vulnerable to adversarial exam- layers and nodes, the type of non-linearities, e.g., ReLU, ples [ 2 ], where a given input is manipulated in such a Sigmoid or Tanh, or the type of operations they employ, way that it is misclassified by the network. In the case e.g., pooling or convolutional layers. This diversity, both of image recognition tasks, the required perturbation, in terms of verification approaches and neural network whether it is adversarially crafted or arises accidentally, design, makes it challenging for researchers or practitioncan be so small that it remains virtually undetectable to ers to assess and decide which method is most suitable for the human eye. Against this background, much work verifying a given neural network [17]. This challenge is has focused on developing methods to provide formal amplified by the fact that the neural network verification guarantees regarding the behaviour of a given neural community does not (yet) use commonly agreed evalunetwork [ 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 ]. For instance, ation protocols, which makes it dificult to draw clear a network employed in autonomous driving for detecting conclusions from the literature regarding the capabilities trafic signs should always produce accurate predictions, and performance of existing verifiers. More precisely, exeven when the input is slightly perturbed; failing to do so isting studies use diferent benchmarks and, so far, have could have fatal consequences. This specific type of as- not provided an in-depth performance comparison of a broad range of verification algorithms.

SafeAI 2023: Workshop on Artificial Intelligence Safety, Feb 13-14, 2023 Recently, a competition series has been initiated, in Washington, D.C., US @AAAI-23 which several verifiers were applied to diferent bench*$Cohr.mre.stp.koonndiign@g laiuacths.oler.idenuniv.nl (M. König); marks (i.e., networks, properties and datasets) and coma.w.bosman@liacs.leidenuniv.nl (A. W. Bosman); pared in terms of various performance measures, includhh@aim.rwth-aachen.de (H. H. Hoos); ing the number of verified instances as well as running j.n.van.rijn@liacs.leidenuniv.nl (J. N. van Rijn) time [18]. While the results from these competitions © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License have provided highly valuable insights into the general CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) 2. Background progress in neural network verification, several questions are left unanswered. For instance, it remains unclear why a verification system participated in certain competition Neural network verification methods seek to formally categories but not in others, although this information assess whether a trained neural network adheres to some would help in understanding the technical capabilities predefined input-output property. In this study, we focus and limitations of a given verifier. Furthermore, aggre- on local robustness properties. Given a trained neural gation of the results makes it dificult to assess in detail network and a set of images as inputs, a robustness verifithe strengths or weaknesses of verifiers on diferent in- cation algorithm aims to verify whether or not there exist stances. Instead, looking at aggregated results, one can slight perturbations to an image that lead the network to easily get the impression that a single approach domi- predict an incorrect label. The maximum perturbation of nates ‘across the board’ – an assumption that is known each variable in the input, i.e., pixel in a given image, is to be inaccurate for other problems involving formal ver- predefined. ification tasks; see, e.g., Xu et al. [19] or Kadioglu et al. Formal verification algorithms can be either complete [20] for SAT. or incomplete [22]. An algorithm that is incomplete does

In this work, we focus exclusively on local robustness not guarantee to report a solution for every given exverification in image classification against perturbations ample; however, incomplete verification algorithms are under the ∞ norm. This scenario represents a widely typically sound, which means they will report that a propstudied verification task, with a large number of networks erty holds only if the property actually holds. On the being publicly available and many verifiers providing of- other hand, an algorithm that is sound and complete will the-shelf support. Notice that most verification tasks can correctly state that a property holds whenever it holds be translated into local robustness verification queries when given suficient resources to be run to completion. [21]; we, therefore, believe that our findings are broadly In this work, we focus on complete algorithms, as those applicable. Moreover, we seek to overcome the limita- arguably represent the most ambitious form of neural tions of existing benchmarking approaches and shed light network verification, making them preferable over incomon previously unanswered questions with regard to the plete methods, especially in safety-critical applications. state of the art in local robustness verification. Early work on complete verification of neural netOur contributions are as follows: works utilised satisfiability modulo theories (SMT) solvers (i) We analyse the current state of practice in bench- [ 10, 23, 24, 25, 26 ], which determine whether a set of marking verification algorithms; logic constraints is satisfiable [ 27]. The resulting veri(ii) we perform a systematic benchmarking study of fication problems are NP-complete and challenging to several, carefully chosen verification methods based on solve in practice. Alternatively, it is possible to formulate a newly assembled, large and diverse set of networks, in- the verification task as a constraint optimisation problem cluding 38 CIFAR and 41 MNIST networks with diferent using mixed integer programming (MIP) [ 4, 28, 29, 12 ]. activation functions, each verified against 100 robustness MIP solvers essentially optimise an objective function properties, for which we consumed a total of 1.69 CPU subject to a set of constraints. MIP problems, similar to years in running time; SMT problems, can be challenging to solve and tend to (iii) we present a categorisation of verification bench- be computationally expensive (in terms of CPU time and marks based on verifier compatibilities with diferent memory). layer types and operations; To overcome the computational complexity of SMT (iv) we quantify verifier performance in terms of the and MIP, it has been proposed to use the well-known number of solved instances, running time, as well as branch-and-bound algorithm [30] for solving the verifimarginal contribution and Shapley value, showing that cation problem, regardless of whether it is modelled as top-performing verification algorithms strongly comple- MIP or SMT [ 31, 5, 32, 7, 13 ]. ment rather than consistently dominate each other in Besides these three popular approaches, recent work terms of performance – e.g., while the verifiers Neurify has studied symbolic interval propagation [ 4, 9, 33, 34 ] and Verinet achieved competitive performance on the and polyhedra [35, 36], zonotope [ 8, 37 ] and star-set abCIFAR networks we considered, the former solved many straction [ 38 ]. instances unsolved by the latter and vice versa;

(v) lastly, we provide a public repository containing all experimental data, along with the newly assembled network collection.1

3. Common Practices in Benchmarking Neural Network Verifiers 1https://github.com/ADA-research/nn-verification-assessment

Considering the diversity in neural network verification problems, it is quite natural to assume that a single best algorithm does not exist, i.e., a method that outperforms accelerated approaches employ advanced parallelisation all others under all circumstances. It is still hard to iden- schemes, giving rise to substantially higher computatify to what extent a method contributes to the state of tional demands than those required by methods running the art, mainly because verification methods are typically on CPUs, especially when restricted to a single core. For evaluated (i) on a small number of benchmarks, which example, an Amazon EC2 GPU-based instance with 32 have often been created for the sole purpose of evaluating CPU cores costs around 1.4 times more than an equivathe method at hand, and (ii) against baseline methods for lent CPU instance.2 which it is often unclear how they were chosen, leading Lastly, the competition seeks to determine the current to several methods claiming state-of-the-art performance state of the art; however, the competition ranking and without having been directly compared. We note that in scores do not suficiently quantify the extent to which an the context of local robustness verification, a benchmark algorithm actually contributes to the state of the art. In most often represents a neural network classifier trained other words, it is in the nature of competitions to deteron the MNIST or CIFAR-10 dataset, respectively. mine a winner, at least implicitly suggesting that a single

As previously mentioned, a competition series has approach generally outperforms all competitors. Howbeen established, with the goal of providing an objective ever, some verification algorithms might have limited and fair comparison of the state-of-the-art methods in but distinct areas of strength, which cannot be identified neural network verification, in terms of scalability and through aggregated performance measures, such as the speed [18]. The VNN Competition was held in the years total number of verified instances. Although the com2020, 2021 and 2022, yet with diferent protocols ( e.g., for petition report [18] shows that individual verifier perrunning experiments, scoring, etc.), benchmarks and par- formance difers among benchmarks, it remains unclear ticipants. In this discussion, we focus on the 2021 edition, whether all algorithms solve the same set of instances in as the report from 2022 has not yet been published at the the given benchmark, or if they complement each other. time of this writing. Similarly, it does not reveal whether or not methods are

Within VNN 2021, a total of 9 benchmarks were con- correlated in their performance. sidered, of which 7 represented test cases for local robustness verification of image classification networks.

Benchmarks were proposed by the participants them- 4. Verification Algorithms under selves and included a total of 11 CIFAR and 7 MNIST Assessment networks, which difered in terms of architecture components, such as non-linearities (e.g., ReLU, Tanh, Sigmoid) We consider 5 complete, CPU-based neural network veriand layer operations (e.g., convolutional or pooling lay- fication algorithms; each of these was chosen because it ers). Networks were trained on the CIFAR-10 and MNIST fulfilled one of the following conditions: it was ( i) ranked datasets, respectively. Moreover, each benchmark was among the top five verification methods according to the composed of random image subsets, excluding images 2021 VNN competition or (ii) supported by the recently that were misclassified by the given network, along with published DNNV framework [21]. Altogether, we convarying perturbation radii. sider our set of algorithms to be representative of the

This competition overcame several of the previously developments in the area of complete neural network reported limitations with regard to the evaluation of net- and, more specifically, local robustness verification. work verifiers. Most notably, it covered a relatively large All methods were employed with their default hyperand diverse set of neural networks. Moreover, thanks to parameter settings, as they would likely be used by practithe active participation from the community, 12 verifi- tioners. We note that the performance of a veriefir might cation algorithms were included in the competition. At improve if its hyperparameters were optimised specifithe same time, we see room for further research into the cally for the given benchmark; however, most verifiers performance of neural network verifiers. have dozens of hyperparameters (or employ combinato

Firstly, it is not clearly stated in the VNN competition rial solvers that come with their own, extensive set of report [18] why verifiers participated in certain competi- hyperparameters), which makes this a non-trivial task, tion categories, but not in others. While we assume this requiring additional expertise and resources (see, e.g., to be due to compatibility reasons, we could not find any König et al. [ 39 ]). The verification algorithms we considformal explanation in the report, although this informa- ered are the following: tion could provide insight into relevant problem classes BaB. The algorithm proposed by Bunel et al. [ 5 ] reand suitable algorithms for solving these. states the verification problem as a global optimisation

Secondly, GPU-accelerated tools were directly com- problem, which is then solved using branch-and-bound pared to those that rely solely on CPU resources, which search. It further incorporates algorithmic improvements we argue does not give due credit to the structural differences between these classes of algorithms. GPU- 2See https://aws.amazon.com/ec2/pricing/on-demand/ with regard to branching and bounding procedures such as smart branching; i.e., before splitting, it computes fast bounds on each of the possible subdomains and chooses the one with the tightest bounds. We refer to this method as BaBSB for the remainder of this paper. BaBSB supports ReLU-based networks.

Marabou. The Marabou framework [23] employs SMT solving techniques, specifically the lazy search technique for handling non-linear constraints. Furthermore, Marabou employs deduction techniques to obtain information on the activation functions that can be used to simplify them. The core of the SMT solver is simplexbased, which means that the variable assignments are made using the simplex algorithm. Marabou supports ReLU and Sigmoid activations as well as MaxPooling operations.

Neurify. The verification algorithm proposed by Wang et al. [33] relies on symbolic interval propagation to create over-approximations, followed by a refinement strategy based on symbolic gradient information. The constraint refinement aims to tighten the bounds of the approximation of activation functions. Neurify can process networks containing ReLU activations.

nnenum. The verifier proposed by Bak et al. [ 38 ] utilises star sets to represent the values each layer of a neural network can attain. By propagating these through the network, it checks whether one or more of the star sets results in an adversarial example. This verifier can handle networks with ReLU activations.

Verinet. The verifier developed by Henriksen and Lomuscio [ 9 ] combines symbolic intervals with gradientbased adversarial local search for finding counterexamples. The authors further propose a splitting heuristic for interval propagation based on the influence of a given node on the bounds of the network output. Verinet supports networks containing ReLU, Sigmoid and Tanh activations.

5. Setup for Empirical Evaluation

In this work, we seek to provide a clearer picture of the state of the art in neural network verification. More specifically, we argue that the state of the art is not just defined by a single verification algorithm, as there might be verifiers which on their own perform poorly but still make meaningful contributions by excelling on limited instance subsets that are challenging for other verification methods.

In the following, we will present an overview of how we set up our benchmark study, i.e., how we selected problem instances and verification algorithms. Furthermore, we will provide details on the software we used and the execution environment in which our experiments were carried out. For our assessment, we compiled a high-quality set of problem instances for local robustness verification. Following best practices in other research areas, such as optimisation [ 40, 41 ], the benchmark should be representative and diverse, where the former refers to how well the dificulty of the benchmark is aligned with that of real-world instances from the same problem class, and the latter means that the problem set should generally contain problems with a wide range of dificulties.

Overall, our benchmark is comprised of 79 image classification networks, of which 38 are trained on the CIFAR10 dataset and 41 are trained on the MNIST dataset. To ensure the representativeness of the problem set, all networks were sampled from the neural network verification literature, i.e., networks used in existing work on local robustness verification and provided in public repositories; in other words, the characteristics of the networks in our benchmark are assumed to match those of networks generally used for evaluating verification algorithms.

We further want our problem set to be diverse. Therefore, we ensured that the considered networks difer in size, i.e., the number of hidden layers and nodes, as well as the type of non-linearities (e.g., ReLU or Tanh) or layer operations (e.g., pooling layers) they employ.

For each network, we verified 100 local robustness properties; more precisely, we sampled 100 images from the dataset on which the network has been trained and verified for local robustness with perturbation radius set at 0.012. This perturbation radius was chosen to be a median of values we found in literature: the smallest value we found was 1/255 in the work by Li et al. [22], whereas Botoeva et al. [ 4 ] or Wang et al. [13] utilised larger values, such as 0.05.

Lastly, we split our benchmark set into diferent categories based on verifier compatibilities. This means a verifier is only employed to categories it is able to process. The categories as well as the instance set size for each category are shown in Table 1.

5.2. Evaluation Metrics

In order to assess the performance of the various methods, we compute four performance metrics: the average running time, the number of solved instances, the relative marginal contribution and the Shapley value [ 42 ] of each verifier. Although these metrics present aggregated measures, they reflect algorithm performance on an instance level and in relation to other methods included in the comparison; a more detailed explanation will be provided in the following paragraphs. Notice that we do not penalise timeouts when computing average running time; i.e., the maximum running time equals the given time limit.

Category ReLU ReLU + MaxPool Tanh Sigmoid

Total

The marginal contribution is computed as follows. De- e.g., it might understate the importance of a single algoifne as a set of given verifiers and let () be the total rithm given the presence of another, highly correlated score of set . Here, the total score () consists of the algorithm. In such a case, both algorithms would not be number of instances verified by at least one verifier in assigned any marginal contribution, even though one of set . We compute the marginal contribution per algo- them should be included in a potential portfolio. This is rithm to determine how much the total performance of captured by the Shapley value: define as the number all algorithms (in terms of solved instances) decreases of verifiers under consideration and as the set of all when the given algorithm is removed from the set of all combinations of all subsets of verifiers under consideralgorithms if they were employed in a parallel algorithm ation including the empty set, where set will be of portfolio. Such portfolios are sets of algorithms that are size ∑︀ ! . Finally, define as all sets in run in parallel on each given problem instance and com- =0 ! · ( − )! plement each other in terms of performance across an which verifier is present. The Shapley value of verifier instance set [ 43 ]. Formally, to determine the marginal is then calculated as contribution of any of the verifiers to portfolio , one 1 needs to know the value of () and ( ∖ {}), where = ! · ∑︁ (() − ( ∖ {})) (2) ∖{} is the portfolio minus verifier . Thus, the marginal ∈ contribution of verifier is expressed as

MCi = () − ( ∖ {}) (1)

5.3. Execution Environment and Software Used

Following this terminology, we can define the number of solved instances by verifier as a set consisting only Our experiments were carried out on a cluster of maof verifier , Solvedi = () − (∅). In other words, the chines equipped with Intel Xeon E5-2683 CPUs with 32 number of solved instances employs a set of size one and cores, 40 MB cache size and 94 GB RAM, running CentOS the marginal contribution employs a set of all verifiers Linux 7. For each verification method, we limited the under consideration. The relative marginal contribution number of available CPU cores to a single core. Each then represents the marginal contribution of a given veri- query was given a time budget of 3 600 seconds and a ifer as a proportion of the sum of every method’s absolute memory budget of 3 GB. Generally, we executed the verimarginal contribution. ifcation algorithms through the DNNV interface, version

Lastly, the Shapley value is the average marginal contri- 0.4.8. DNNV is a framework that transforms a network bution of a verifier over all possible joining orders, where and property specification into a unified format, which joining order refers to the order in which the verifiers can then be solved by a given method [21]. More specifare added to a parallel portfolio. This value is comple- ically, DNNV takes as input a network in the ONNX menting the previous two metrics, as it does not assume format, along with a property specification, and then a particular order in which algorithms are added to the translates the network and property to the input format portfolio. To be precise, the number of solved instances required by the verifier. After running the verifier on the simply represents a joining order in which the considered transformed problem, it returns either if the property algorithm comes first, whereas the marginal contribu- was falsified or if the property was proven to hold. tion metric assumes a joining order in which it comes last. However, using fixed orders, as it is the case for the marginal contribution, might not reveal possible interactions between the given method and other algorithms,

6. Results and Discussion

by all verifiers, highlighting the structural diferences between instances and the sensitivity of the verification apIn the following, we provide an in-depth discussion of the proaches to those instances. That said, Neurify slightly results obtained from our experiments. Table 1 shows the outperformed Verinet in terms of the number of solved categories we devised, along with the resulting instance instances (915 vs 841 out of 2 500). Furthermore, Neurify set sizes as well as information on which verifier has achieved a much larger relative marginal contribution been employed for each category. than Verinet (0.75 vs 0.20), which means that the for

Table 2 reports the number of problem instances mer could solve a relatively large number of instances solved by each verifier per network category, the rel- which could not be solved by the other methods. Generative marginal contribution, the Shapley value and the ally, relative marginal contribution scores are much less average running time. On ReLU-based MNIST networks, evenly distributed among verifiers when compared to we found Verinet to be the best-performing verifier, solv- the MNIST dataset. ing 1 799 out of 2 500 instances, while achieving a Shapley Figures 1a and 1b show an instance-level comparivalue of 618. However, taking relative marginal contribu- son of the two best-performing algorithms (in terms of tion into account, we found that Neurify achieved the Shapley value) in the ReLU category for each dataset. In highest relative marginal contribution of 0.25 (compared Figure 1a, we see that on MNIST networks, both Verinet to 0.16 for Verinet), indicating that it could verify a siz- and nnenum solved instances that the other one could able fraction of instances on which other methods failed not solve within the given time budget. Concretely, when to return a solution. Moreover, the relative marginal considering a parallel portfolio containing both algocontribution scores show that each method could solve rithms, the number of solved instances slightly increases a sizeable fraction of instances unsolved by any of the to 1 817 out of 2 500 (vs 1 799 solved by Verinet and 1 754 other methods. solved by nnenum alone), while supplied with similar

On ReLU-based CIFAR networks, it should first be CPU resources (i.e., 1 800 CPU seconds per verifier, which noted that no verification problem instance can be solved represents half of the total given time budget of 3 600 1.0 0.8 d e v l o s se0.6 c n a ts n if o0.4 n o it c ra F 0.2 10 2 1.0 0.8 d e v l o s se0.6 c n a ts n if o0.4 n o it c ra F 0.2

CPU seconds per verification query as used in our ex- algorithms considered in our study. All remaining verperiments). Notice that leveraging parallel portfolios has ifiers achieved substantially lower Shapley values and already been shown to significantly improve the perfor- relative marginal contribution scores, indicating that they mance of MIP-based verification methods [ 39 ]. would not complement Neurify and Verinet well in a

On CIFAR instances, we found Neurify and Verinet portfolio. to also have distinctive strengths over each other. This is Figure 2a shows the cumulative distribution function shown in Figure 1b, where both algorithms could solve a of running times over the MNIST problem instances. As substantial amount of instances that the other could not seen in the figure, Verinet tends to solve these problem return a solution for. Thus, when combined in a parallel instances fastest; however, Neurify tended to show even portfolio, 963 instances can be solved (vs 915 solved by better performances on those instances it was able to Neurify and 841 solved by Verinet alone, out of 2 500 in- solve. We note that most of the instances unsolved by stances), while using the same amount of CPU resources, Neurify represent networks that were trained on images i.e., 1 800 CPU seconds per verifier. These findings further with 3 dimensions, whereas Neurify requires images emphasise the complementarity between the verification used as network inputs to have 2 or 4 dimensions.

Figure 2b shows a similar plot for the CIFAR problem instances. Here, Neurify solved the largest fraction in less time than other methods. This suggests that Neurify is a very competitive verifier when applicable to the specific network or input format.

For each of the remaining categories, we found that there is only one verifier that could efectively handle the respective problem instances. Specifically, instances from the ReLU+MaxPooling category can be processed by Marabou, although, only a modest number of MNIST instances could be solved in this way. Networks containing Tanh activations can, in principle, be verified by Verinet, but the algorithm did also not solve any CIFAR instances. Lastly, Sigmoid-based networks can be handled by both Verinet and Marabou, however, only the former could solve MNIST instances within the given compute budget.

7. Conclusions and Future Work

In this work, we assessed the performance of several local robustness verification algorithms, i.e., algorithms used to verify the robustness of an image classification network against small input perturbations. To conclude, we found that all considered methods support ReLU-based networks, while other network types are strongly undersupported. While this has been suspected in the community, it has, to our knowledge, not yet been subject to formal study. Furthermore, we presented evidence for strong performance complementarity: even within the same benchmark category, two verification systems outperform each other on distinct sets of instances. As we have demonstrated, this complementarity can be exploited by combining individual verifiers into parallel portfolios. Lastly, we showed that, in general, the performance of verifiers strongly difers between image datasets, with some methods achieving the best performance on MNIST (in terms of the number of solved instances and average running time) while falling behind on CIFAR and vice versa. In future work, it would be interesting to include a broader set of perturbation radii and analyse in more detail how the relative performance of verifiers depends on the given radius. Furthermore, we are interested in expanding our analysis to GPU-based verification algorithms.

Acknowledgements This research was partially supported by TAILOR, a project funded by EU Horizon 2020 research and innovation program under GA No. 952215.

und Verifikation von Schaltungen und Systemen ysis of Deep Neural Networks, in: Proceedings (MBMV 2015), 2015, pp. 30–40. of the 31st International Conference on Computer [12] V. Tjeng, K. Xiao, R. Tedrake, Evaluating Ro- Aided Verification (CAV 2019), 2019, pp. 443–452. bustness of Neural Networks with Mixed Integer [24] L. Pulina, A. Tacchella, Checking Safety of NeuProgramming, in: Proceedings of the 7th Inter- ral Networks with SMT Solvers: A Comparative national Conference on Learning Representations Evaluation, in: AI*IA, 2011, pp. 127–138. (ICLR 2019), 2019, pp. 1–21. [25] L. Pulina, A. Tacchella, NeVer: A Tool for Artificial [13] S. Wang, K. Pei, J. Whitehouse, J. Yang, S. Jana, Ef- Neural Networks Verification, Annals of Matheifcient Formal Safety Analysis of Neural Networks, matics and Artificial Intelligence (2011) 403–425. in: Advances in Neural Information Processing Sys- [26] L. Pulina, A. Tacchella, Challenging SMT Solvers tems (NeurIPS 2018), 2018, pp. 6369–6379. to Verify Neural Networks, AI Communications [14] W. Xiang, H.-D. Tran, T. T. Johnson, Output Reach- (2012) 117–135.

able Set Estimation and Verification for Multilayer [27] L. d. Moura, N. Bjørner, Satisfiability Modulo theoNeural Networks, IEEE Transactions on Neural Net- ries: An Appetizer, in: Proceedings of the Brazilian works and Learning Systems 29 (2018) 5777–5783. Symposium on Formal Methods (SBMF 2009), 2009, [15] I. J. Goodfellow, J. Shlens, C. Szegedy, Explain- pp. 23–36.

ing and Harnessing Adversarial Examples, arXiv [28] S. Dutta, S. Jha, S. Sankaranarayanan, A. Tiwari, preprint arXiv:1412.6572 (2014). Output Range Analysis for Deep Neural Networks, [16] N. Papernot, P. McDaniel, X. Wu, S. Jha, A. Swami, in: Proceedings of the Tenth NASA Formal Methods Distillation as a Defense to Adversarial Perturba- Symposium (NFM 2018), 2018, pp. 121–138. tions Against Deep Neural Networks, in: Proceed- [29] A. Lomuscio, L. Maganti, An approach to reachabilings of the 37th IEEE Symposium on Security and ity analysis for feed-forward ReLU neural networks, Privacy (IEEE S&P 2016), 2016, pp. 582–597. arXiv preprint arXiv:1706.07351 (2017). [17] M. Casadio, E. Komendantskaya, M. L. Daggitt, [30] A. H. Land, A. G. Doig, An Automatic Method of W. Kokke, G. Katz, G. Amir, I. Refaeli, Neural Net- Solving Discrete Programming Problems, Econowork Robustness as a Verification Property: A Prin- metrica (1960) 497–520. cipled Case Study, in: Proceedings of the 34rd [31] R. Bunel, I. Turkaslan, P. H. S. Torr, M. P. Kumar, International Conference on Computer Aided Veri- J. Lu, P. Kohli, Branch and Bound for Piecewise ifcation (CAV 2022), 2022, pp. 219–231. Linear Neural Network Verification, Journal of [18] S. Bak, C. Liu, T. Johnson, The Second International Machine Learning Research (2020) 1574–1612.

Verification of Neural Networks Competition (VNN- [32] A. De Palma, R. Bunel, A. Desmaison, K. Dvijotham, COMP 2021): Summary and Results, arXiv preprint P. Kohli, P. H. S. Torr, M. P. Kumar, Improved arXiv:2109.00498 (2021). Branch and Bound for Neural Network Verifica[19] L. Xu, F. Hutter, H. H. Hoos, K. Leyton-Brown, tion via Lagrangian Decomposition, arXiv preprint Satzilla: Portfolio-based algorithm selection for sat, arXiv:2104.06718 (2021).

Journal of Artificial Intelligence Research 32 (2008) [33] S. Wang, K. Pei, J. Whitehouse, J. Yang, S. Jana, 565–606. Formal Security Analysis of Neural Networks using [20] S. Kadioglu, Y. Malitsky, A. Sabharwal, H. Samu- Symbolic Intervals, in: Proceedings of the 27th lowitz, M. Sellmann, Algorithm Selection and USENIX Security Symposium (USENIX Security 18), Scheduling, in: Proceedings of the Seventeenth 2018, pp. 1599–1614.

International Conference on Principles and Prac- [34] S. Wang, H. Zhang, K. Xu, X. Lin, S. Jana, C.-J. Hsieh, tice of Constraint Programming (CP2011), 2011, pp. Z. Kolter, Beta-CROWN: Eficient Bound Propaga454–469. tion with Per-neuron Split Constraints for Neural [21] D. Shriver, S. Elbaum, M. B. Dwyer, DNNV: A Network Robustness Verification, in: Advances in Framework for Deep Neural Network Verification, Neural Information Processing Systems (NeurIPS in: Proceedings of the 33rd International Confer- 2021), 2021, pp. 29909–29921. ence on Computer Aided Verification (CAV 2021), [35] G. Singh, T. Gehr, M. Püschel, M. Vechev, An Ab2021, pp. 137–150. stract Domain for Certifying Neural Networks, in: [22] L. Li, X. Qi, T. Xie, B. Li, Sok: Certified robust- Proceedings of the 46th ACM SIGPLAN Symposium ness for deep neural networks, arXiv preprint on Principles of Programming Languages (ACMarXiv:2009.04131 (2020). POPL 2019), 2019, pp. 1–30. [23] G. Katz, D. A. Huang, D. Ibeling, K. Julian, [36] H. Zhang, T.-W. Weng, P.-Y. Chen, C.-J. Hsieh, C. Lazarus, R. Lim, P. Shah, S. Thakoor, H. Wu, L. Daniel, Eficient Neural Network Robustness A. Zeljić, D. L. Dill, M. J. Kochenderfer, C. Barrett, Certification with General Activation Functions, the Marabou Framework for Verification and Anal- Advances in Neural Information Processing Sys

[1]

K. D.

Julian ,

M. J.

Kochenderfer ,

M. P.

Owen , Deep neural network compression for aircraft collision avoidance systems , Journal of Guidance, Control, and Dynamics 42 ( 2019 ) 598 - 608 .

[2]

Szegedy ,

Zaremba , I. Sutskever ,

Bruna ,

Erhan , I. Goodfellow ,

Fergus , Intriguing properties of neural networks , arXiv preprint arXiv:1312.6199 ( 2014 ).

[3]

Bastani ,

Ioannou ,

Lampropoulos ,

Vytiniotis ,

Nori ,

Criminisi , Measuring Neural Net Robustness with Constraints , in: Advances in Neural Information Processing Systems (NeurIPS 2016 ), 2016 , pp. 2613 - 2621 .

[4]

Botoeva ,

Kouvaros ,

Kronqvist ,

Lomuscio ,

Misener , Eficient Verification of ReLU-based Neural Networks via Dependency Analysis , in: Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI-20) , 2020 , pp. 3291 - 3299 .

[5]

R. R.

Bunel , I. Turkaslan ,

Torr ,

Kohli ,

P. K.

Mudigonda , A Unified View of Piecewise Linear Neural Network Verification , in: Advances in Neural Information Processing Systems (NeurIPS 2018 ), 2018 , pp. 1 - 10 .

[6]

Dvijotham ,

Stanforth ,

Gowal ,

T. A.

Mann ,

Kohli , A Dual Approach to Scalable Verification of Deep Networks , in: Proceedings of the 38th Conference on Uncertainty in Artificial Intelligence (UAI 2018 ), 2018 , pp. 550 - 559 .

[7]

Ehlers , Formal Verification of Piece-Wise Linear Feed-Forward Neural Networks , in: Proceedings of the 15th International Symposium on Automated Technology for Verification and Analysis (ATVA 2017 ), 2017 , pp. 269 - 286 .

[8]

Gehr ,

Mirman , D. Drachsler-Cohen, P.

Tsankov , S.

Chaudhuri , M.

Vechev, AI2: Safety and Robustness Certification of Neural Networks with Abstract Interpretation , in: Proceedings of the 39th IEEE Symposium on Security and Privacy (IEEE S&P 2018 ), 2018 , pp. 3 - 18 .

[9]

Henriksen ,

Lomuscio , Eficient Neural Network Verification via Adaptive Refinement and Adversarial Search , in: Proceedings of the 24th European Conference on Artificial Intelligence (ECAI 2020 ), 2020 , pp. 2513 - 2520 .

[10]

Katz ,

Barrett ,

D. L.

Dill ,

Julian ,

M. J.

Kochenderfer , Reluplex: An Eficient SMT Solver for Verifying Deep Neural Networks , in: Proceedings of the 29th International Conference on Computer Aided Verification (CAV 2017 ), 2017 , pp. 97 - 117 .

[11]

Scheibler ,

Winterer ,

Wimmer ,

Becker , Towards Verification of Artificial Neural Networks, in: Proceedings of the 18th Workshop on Methoden und Beschreibungssprachen zur Modellierung tems (NeurIPS 2018 ) 31 ( 2018 ) 4944 - 4953 .

[37]

Singh ,

Gehr ,

Mirman ,

Püschel ,

Vechev , Fast and Efective Robustness Certiifcation, in: Advances in Neural Information Processing Systems (NeurIPS 2018 ), 2018 , pp. 1 - 12 .

[38]

Bak , H. -D. Tran , K. Hobbs , T. T. Johnson, Improved Geometric Path Enumeration for Verifying ReLU Neural Networks , in: Proceedings of the 32nd International Conference on Computer Aided Verification (CAV 2020 ), 2020 , pp. 66 - 96 .

[39]

König ,

H. H.

Hoos ,

J. N. v.

Rijn , Speeding up neural network robustness verification via algorithm configuration and an optimised mixed integer linear programming solver portfolio , Machine Learning 111 ( 2022 ) 4565 - 4584 .

[40]

H. H.

Hoos , T. Stützle, Stochastic Local Search: Foundations & Applications, Elsevier / Morgan Kaufmann, 2004 .

[41]

Bartz-Beielstein ,

Doerr , D. v. d. Berg,

Bossek ,

Chandrasekaran ,

Eftimov ,

Fischbach ,

Kerschke ,

W. La

Cava ,

Lopez-Ibanez , et al., Benchmarking in Optimization: Best Practice and Open Issues , arXiv preprint arXiv: 2007 . 03488 ( 2020 ).

[42]

Fréchette ,

Kotthof ,

Michalak ,

Rahwan ,

Hoos ,

Leyton-Brown , Using the shapley value to analyze algorithm portfolios , in: Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI-16) , 2016 , pp. 3397 - 3403 .

[43]

C. P.

Gomes ,

Selman , Algorithm portfolios, Artiifcial Intelligence 126 ( 2001 ) 43 - 62 .