=Paper=
{{Paper
|id=Vol-3381/paper_25
|storemode=property
|title=Critically Assessing the State of the Art in CPU-based Local Robustness
Verification
|pdfUrl=https://ceur-ws.org/Vol-3381/25.pdf
|volume=Vol-3381
|authors=Matthias KΓΆnig,Annelot Bosman,Holger Hoos,Jan van Rijn
|dblpUrl=https://dblp.org/rec/conf/aaai/KonigBHR23
}}
==Critically Assessing the State of the Art in CPU-based Local Robustness
Verification==
Critically Assessing the State of the Art
in CPU-based Local Robustness Verification
Matthias KΓΆnig1,* , Annelot W. Bosman1 , Holger H. Hoos1,2,3 and Jan N. van Rijn1
1
Leiden Institute of Advanced Computer Science, Leiden University, The Netherlands
2
Chair for AI Methodology, RWTH Aachen University, Germany
3
University of British Columbia, Canada
Abstract
Recent research has proposed various methods to formally verify neural networks against minimal input perturbations.
This type of verification is referred to as local robustness verification. The research area of local robustness verification
is highly diverse, as verifiers rely on a multitude of techniques, including mixed integer programming and satisfiability
modulo theories. At the same time, problem instances differ based on the network to be verified, the verification property
and the specific network input. This gives rise to the question of which verification algorithm is most suitable to solve a
given verification problem. To answer this question, we perform a systematic performance analysis of several CPU-based
local robustness verification systems on a newly and carefully assembled set of 79 neural networks, each verified against
100 robustness properties. Notably, we show that there is no single best algorithm that dominates in performance across
all verification problem instances. Instead, our results reveal complementarities in verifier performance and illustrate the
potential of leveraging algorithm portfolios for more efficient local robustness verification. Furthermore, we confirm the
notion that most algorithms only support ReLU-based networks, while other activation functions remain under-supported.
Keywords
Benchmark Analysis, Neural Network Verification, Adversarial Robustness, Shapley Value, Algorithm Portfolios
1. Introduction sessment refers to local robustness verification, a broadly
studied verification task, in which a network is systemat-
In recent years, deep learning methods based on neural ically tested against various input perturbations under
networks have been increasingly applied within vari- predefined norms, such as the πβ norm [15, 16].
ous safety-critical domains and use contexts, ranging Neural network verification is a highly diverse re-
from manoeuvre advisory systems in unmanned aircraft search area, and existing methods rely on a broad range
to face recognition systems in mobile phones (see, e.g., of techniques. At the same time, neural networks differ in
Julian et al. [1]). Furthermore, it is now well known terms of their architecture, such as the number of hidden
that neural networks are vulnerable to adversarial exam- layers and nodes, the type of non-linearities, e.g., ReLU,
ples [2], where a given input is manipulated in such a Sigmoid or Tanh, or the type of operations they employ,
way that it is misclassified by the network. In the case e.g., pooling or convolutional layers. This diversity, both
of image recognition tasks, the required perturbation, in terms of verification approaches and neural network
whether it is adversarially crafted or arises accidentally, design, makes it challenging for researchers or practition-
can be so small that it remains virtually undetectable to ers to assess and decide which method is most suitable for
the human eye. Against this background, much work verifying a given neural network [17]. This challenge is
has focused on developing methods to provide formal amplified by the fact that the neural network verification
guarantees regarding the behaviour of a given neural community does not (yet) use commonly agreed evalu-
network [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]. For instance, ation protocols, which makes it difficult to draw clear
a network employed in autonomous driving for detecting conclusions from the literature regarding the capabilities
traffic signs should always produce accurate predictions, and performance of existing verifiers. More precisely, ex-
even when the input is slightly perturbed; failing to do so isting studies use different benchmarks and, so far, have
could have fatal consequences. This specific type of as- not provided an in-depth performance comparison of a
broad range of verification algorithms.
SafeAI 2023: Workshop on Artificial Intelligence Safety, Feb 13-14, 2023 Recently, a competition series has been initiated, in
Washington, D.C., US @AAAI-23 which several verifiers were applied to different bench-
*
Corresponding author.
marks (i.e., networks, properties and datasets) and com-
$ h.m.t.konig@liacs.leidenuniv.nl (M. KΓΆnig);
a.w.bosman@liacs.leidenuniv.nl (A. W. Bosman); pared in terms of various performance measures, includ-
hh@aim.rwth-aachen.de (H. H. Hoos); ing the number of verified instances as well as running
j.n.van.rijn@liacs.leidenuniv.nl (J. N. van Rijn) time [18]. While the results from these competitions
Β© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). have provided highly valuable insights into the general
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
progress in neural network verification, several questions 2. Background
are left unanswered. For instance, it remains unclear why
a verification system participated in certain competition Neural network verification methods seek to formally
categories but not in others, although this information assess whether a trained neural network adheres to some
would help in understanding the technical capabilities predefined input-output property. In this study, we focus
and limitations of a given verifier. Furthermore, aggre- on local robustness properties. Given a trained neural
gation of the results makes it difficult to assess in detail network and a set of images as inputs, a robustness verifi-
the strengths or weaknesses of verifiers on different in- cation algorithm aims to verify whether or not there exist
stances. Instead, looking at aggregated results, one can slight perturbations to an image that lead the network to
easily get the impression that a single approach domi- predict an incorrect label. The maximum perturbation of
nates βacross the boardβ β an assumption that is known each variable in the input, i.e., pixel in a given image, is
to be inaccurate for other problems involving formal ver- predefined.
ification tasks; see, e.g., Xu et al. [19] or Kadioglu et al. Formal verification algorithms can be either complete
[20] for SAT. or incomplete [22]. An algorithm that is incomplete does
In this work, we focus exclusively on local robustness not guarantee to report a solution for every given ex-
verification in image classification against perturbations ample; however, incomplete verification algorithms are
under the πβ norm. This scenario represents a widely typically sound, which means they will report that a prop-
studied verification task, with a large number of networks erty holds only if the property actually holds. On the
being publicly available and many verifiers providing off- other hand, an algorithm that is sound and complete will
the-shelf support. Notice that most verification tasks can correctly state that a property holds whenever it holds
be translated into local robustness verification queries when given sufficient resources to be run to completion.
[21]; we, therefore, believe that our findings are broadly In this work, we focus on complete algorithms, as those
applicable. Moreover, we seek to overcome the limita- arguably represent the most ambitious form of neural
tions of existing benchmarking approaches and shed light network verification, making them preferable over incom-
on previously unanswered questions with regard to the plete methods, especially in safety-critical applications.
state of the art in local robustness verification. Early work on complete verification of neural net-
Our contributions are as follows: works utilised satisfiability modulo theories (SMT) solvers
(i) We analyse the current state of practice in bench- [10, 23, 24, 25, 26], which determine whether a set of
marking verification algorithms; logic constraints is satisfiable [27]. The resulting veri-
(ii) we perform a systematic benchmarking study of fication problems are NP-complete and challenging to
several, carefully chosen verification methods based on solve in practice. Alternatively, it is possible to formulate
a newly assembled, large and diverse set of networks, in- the verification task as a constraint optimisation problem
cluding 38 CIFAR and 41 MNIST networks with different using mixed integer programming (MIP) [4, 28, 29, 12].
activation functions, each verified against 100 robustness MIP solvers essentially optimise an objective function
properties, for which we consumed a total of 1.69 CPU subject to a set of constraints. MIP problems, similar to
years in running time; SMT problems, can be challenging to solve and tend to
(iii) we present a categorisation of verification bench- be computationally expensive (in terms of CPU time and
marks based on verifier compatibilities with different memory).
layer types and operations; To overcome the computational complexity of SMT
(iv) we quantify verifier performance in terms of the and MIP, it has been proposed to use the well-known
number of solved instances, running time, as well as branch-and-bound algorithm [30] for solving the verifi-
marginal contribution and Shapley value, showing that cation problem, regardless of whether it is modelled as
top-performing verification algorithms strongly comple- MIP or SMT [31, 5, 32, 7, 13].
ment rather than consistently dominate each other in Besides these three popular approaches, recent work
terms of performance β e.g., while the verifiers Neurify has studied symbolic interval propagation [4, 9, 33, 34]
and Verinet achieved competitive performance on the and polyhedra [35, 36], zonotope [8, 37] and star-set ab-
CIFAR networks we considered, the former solved many straction [38].
instances unsolved by the latter and vice versa;
(v) lastly, we provide a public repository containing
all experimental data, along with the newly assembled
3. Common Practices in
network collection.1 Benchmarking Neural Network
Verifiers
Considering the diversity in neural network verification
1
https://github.com/ADA-research/nn-verification-assessment problems, it is quite natural to assume that a single best
algorithm does not exist, i.e., a method that outperforms accelerated approaches employ advanced parallelisation
all others under all circumstances. It is still hard to iden- schemes, giving rise to substantially higher computa-
tify to what extent a method contributes to the state of tional demands than those required by methods running
the art, mainly because verification methods are typically on CPUs, especially when restricted to a single core. For
evaluated (i) on a small number of benchmarks, which example, an Amazon EC2 GPU-based instance with 32
have often been created for the sole purpose of evaluating CPU cores costs around 1.4 times more than an equiva-
the method at hand, and (ii) against baseline methods for lent CPU instance.2
which it is often unclear how they were chosen, leading Lastly, the competition seeks to determine the current
to several methods claiming state-of-the-art performance state of the art; however, the competition ranking and
without having been directly compared. We note that in scores do not sufficiently quantify the extent to which an
the context of local robustness verification, a benchmark algorithm actually contributes to the state of the art. In
most often represents a neural network classifier trained other words, it is in the nature of competitions to deter-
on the MNIST or CIFAR-10 dataset, respectively. mine a winner, at least implicitly suggesting that a single
As previously mentioned, a competition series has approach generally outperforms all competitors. How-
been established, with the goal of providing an objective ever, some verification algorithms might have limited
and fair comparison of the state-of-the-art methods in but distinct areas of strength, which cannot be identified
neural network verification, in terms of scalability and through aggregated performance measures, such as the
speed [18]. The VNN Competition was held in the years total number of verified instances. Although the com-
2020, 2021 and 2022, yet with different protocols (e.g., for petition report [18] shows that individual verifier per-
running experiments, scoring, etc.), benchmarks and par- formance differs among benchmarks, it remains unclear
ticipants. In this discussion, we focus on the 2021 edition, whether all algorithms solve the same set of instances in
as the report from 2022 has not yet been published at the the given benchmark, or if they complement each other.
time of this writing. Similarly, it does not reveal whether or not methods are
Within VNN 2021, a total of 9 benchmarks were con- correlated in their performance.
sidered, of which 7 represented test cases for local ro-
bustness verification of image classification networks.
Benchmarks were proposed by the participants them- 4. Verification Algorithms under
selves and included a total of 11 CIFAR and 7 MNIST Assessment
networks, which differed in terms of architecture compo-
nents, such as non-linearities (e.g., ReLU, Tanh, Sigmoid) We consider 5 complete, CPU-based neural network veri-
and layer operations (e.g., convolutional or pooling lay- fication algorithms; each of these was chosen because it
ers). Networks were trained on the CIFAR-10 and MNIST fulfilled one of the following conditions: it was (i) ranked
datasets, respectively. Moreover, each benchmark was among the top five verification methods according to the
composed of random image subsets, excluding images 2021 VNN competition or (ii) supported by the recently
that were misclassified by the given network, along with published DNNV framework [21]. Altogether, we con-
varying perturbation radii. sider our set of algorithms to be representative of the
This competition overcame several of the previously developments in the area of complete neural network
reported limitations with regard to the evaluation of net- and, more specifically, local robustness verification.
work verifiers. Most notably, it covered a relatively large All methods were employed with their default hyper-
and diverse set of neural networks. Moreover, thanks to parameter settings, as they would likely be used by practi-
the active participation from the community, 12 verifi- tioners. We note that the performance of a verifier might
cation algorithms were included in the competition. At improve if its hyperparameters were optimised specifi-
the same time, we see room for further research into the cally for the given benchmark; however, most verifiers
performance of neural network verifiers. have dozens of hyperparameters (or employ combinato-
Firstly, it is not clearly stated in the VNN competition rial solvers that come with their own, extensive set of
report [18] why verifiers participated in certain competi- hyperparameters), which makes this a non-trivial task,
tion categories, but not in others. While we assume this requiring additional expertise and resources (see, e.g.,
to be due to compatibility reasons, we could not find any KΓΆnig et al. [39]). The verification algorithms we consid-
formal explanation in the report, although this informa- ered are the following:
tion could provide insight into relevant problem classes BaB. The algorithm proposed by Bunel et al. [5] re-
and suitable algorithms for solving these. states the verification problem as a global optimisation
Secondly, GPU-accelerated tools were directly com- problem, which is then solved using branch-and-bound
pared to those that rely solely on CPU resources, which search. It further incorporates algorithmic improvements
we argue does not give due credit to the structural dif-
ferences between these classes of algorithms. GPU- 2
See https://aws.amazon.com/ec2/pricing/on-demand/
with regard to branching and bounding procedures such 5.1. Problem Instances
as smart branching; i.e., before splitting, it computes fast
For our assessment, we compiled a high-quality set of
bounds on each of the possible subdomains and chooses
problem instances for local robustness verification. Fol-
the one with the tightest bounds. We refer to this method
lowing best practices in other research areas, such as
as BaBSB for the remainder of this paper. BaBSB sup-
optimisation [40, 41], the benchmark should be represen-
ports ReLU-based networks.
tative and diverse, where the former refers to how well
Marabou. The Marabou framework [23] employs
the difficulty of the benchmark is aligned with that of
SMT solving techniques, specifically the lazy search tech-
real-world instances from the same problem class, and
nique for handling non-linear constraints. Furthermore,
the latter means that the problem set should generally
Marabou employs deduction techniques to obtain infor-
contain problems with a wide range of difficulties.
mation on the activation functions that can be used to
Overall, our benchmark is comprised of 79 image clas-
simplify them. The core of the SMT solver is simplex-
sification networks, of which 38 are trained on the CIFAR-
based, which means that the variable assignments are
10 dataset and 41 are trained on the MNIST dataset. To
made using the simplex algorithm. Marabou supports
ensure the representativeness of the problem set, all net-
ReLU and Sigmoid activations as well as MaxPooling
works were sampled from the neural network verification
operations.
literature, i.e., networks used in existing work on local ro-
Neurify. The verification algorithm proposed by
bustness verification and provided in public repositories;
Wang et al. [33] relies on symbolic interval propagation
in other words, the characteristics of the networks in
to create over-approximations, followed by a refinement
our benchmark are assumed to match those of networks
strategy based on symbolic gradient information. The
generally used for evaluating verification algorithms.
constraint refinement aims to tighten the bounds of the
We further want our problem set to be diverse. There-
approximation of activation functions. Neurify can pro-
fore, we ensured that the considered networks differ in
cess networks containing ReLU activations.
size, i.e., the number of hidden layers and nodes, as well
nnenum. The verifier proposed by Bak et al. [38]
as the type of non-linearities (e.g., ReLU or Tanh) or layer
utilises star sets to represent the values each layer of a
operations (e.g., pooling layers) they employ.
neural network can attain. By propagating these through
For each network, we verified 100 local robustness
the network, it checks whether one or more of the star
properties; more precisely, we sampled 100 images from
sets results in an adversarial example. This verifier can
the dataset on which the network has been trained and
handle networks with ReLU activations.
verified for local robustness with perturbation radius π
Verinet. The verifier developed by Henriksen and
set at 0.012. This perturbation radius was chosen to be
Lomuscio [9] combines symbolic intervals with gradient-
a median of values we found in literature: the smallest
based adversarial local search for finding counter-
value we found was 1/255 in the work by Li et al. [22],
examples. The authors further propose a splitting heuris-
whereas Botoeva et al. [4] or Wang et al. [13] utilised
tic for interval propagation based on the influence of a
larger values, such as 0.05.
given node on the bounds of the network output. Verinet
Lastly, we split our benchmark set into different cat-
supports networks containing ReLU, Sigmoid and Tanh
egories based on verifier compatibilities. This means a
activations.
verifier is only employed to categories it is able to process.
The categories as well as the instance set size for each
5. Setup for Empirical Evaluation category are shown in Table 1.
In this work, we seek to provide a clearer picture of the 5.2. Evaluation Metrics
state of the art in neural network verification. More
specifically, we argue that the state of the art is not just In order to assess the performance of the various meth-
defined by a single verification algorithm, as there might ods, we compute four performance metrics: the average
be verifiers which on their own perform poorly but still running time, the number of solved instances, the rela-
make meaningful contributions by excelling on limited tive marginal contribution and the Shapley value [42] of
instance subsets that are challenging for other verifica- each verifier. Although these metrics present aggregated
tion methods. measures, they reflect algorithm performance on an in-
In the following, we will present an overview of how stance level and in relation to other methods included
we set up our benchmark study, i.e., how we selected in the comparison; a more detailed explanation will be
problem instances and verification algorithms. Further- provided in the following paragraphs. Notice that we do
more, we will provide details on the software we used not penalise timeouts when computing average running
and the execution environment in which our experiments time; i.e., the maximum running time equals the given
were carried out. time limit.
Table 1
Instance set size for each benchmark category. Solvable instances are those solved by at least one (i.e., any) or all of the
considered verifiers. The column βVerifiers employedβ lists the matching suitable verification algorithm(s) to the respective
category.
MNIST CIFAR
Category Total Solvable Total Solvable Verifiers employed
Any All Any All
ReLU 2 500 1 913 42 2 500 972 0 BaBSB, Marabou, Neurify, nnenum, Verinet
ReLU + MaxPool 400 5 5 100 0 0 Marabou
Tanh 600 556 556 600 0 0 Verinet
Sigmoid 600 581 0 600 0 0 Marabou, Verinet
The marginal contribution is computed as follows. De- e.g., it might understate the importance of a single algo-
fine π as a set of given verifiers and let π£(π) be the total rithm given the presence of another, highly correlated
score of set π. Here, the total score π£(π) consists of the algorithm. In such a case, both algorithms would not be
number of instances verified by at least one verifier in assigned any marginal contribution, even though one of
set π. We compute the marginal contribution per algo- them should be included in a potential portfolio. This is
rithm to determine how much the total performance of captured by the Shapley value: define π as the number
all algorithms (in terms of solved instances) decreases of verifiers under consideration and πΆ as the set of all
when the given algorithm is removed from the set of all combinations of all subsets of verifiers under consider-
algorithms if they were employed in a parallel algorithm ation including the empty set, where set πΆ will be of
portfolio. Such portfolios are sets of algorithms that are size π
βοΈ π!
. Finally, define πΆπ as all sets in
run in parallel on each given problem instance and com- π=0
π! Β· (π β π)!
plement each other in terms of performance across an which verifier π is present. The Shapley value of verifier
instance set [43]. Formally, to determine the marginal π is then calculated as
contribution of any of the verifiers π to portfolio π, one 1 βοΈ
needs to know the value of π£(π) and π£(π β {π}), where ππ = Β· (π£(π) β π£(π β {π})) (2)
π! πβπΆ
πβ{π} is the portfolio minus verifier π. Thus, the marginal π
contribution of verifier π is expressed as
MCi = π£(π) β π£(π β {π}) (1)
5.3. Execution Environment and Software
Used
Following this terminology, we can define the number
of solved instances by verifier π as a set consisting only Our experiments were carried out on a cluster of ma-
of verifier π, Solvedi = π£(π) β π£(β
). In other words, the chines equipped with Intel Xeon E5-2683 CPUs with 32
number of solved instances employs a set of size one and cores, 40 MB cache size and 94 GB RAM, running CentOS
the marginal contribution employs a set of all verifiers Linux 7. For each verification method, we limited the
under consideration. The relative marginal contribution number of available CPU cores to a single core. Each
then represents the marginal contribution of a given veri- query was given a time budget of 3 600 seconds and a
fier as a proportion of the sum of every methodβs absolute memory budget of 3 GB. Generally, we executed the veri-
marginal contribution. fication algorithms through the DNNV interface, version
Lastly, the Shapley value is the average marginal contri- 0.4.8. DNNV is a framework that transforms a network
bution of a verifier over all possible joining orders, where and property specification into a unified format, which
joining order refers to the order in which the verifiers can then be solved by a given method [21]. More specif-
are added to a parallel portfolio. This value is comple- ically, DNNV takes as input a network in the ONNX
menting the previous two metrics, as it does not assume format, along with a property specification, and then
a particular order in which algorithms are added to the translates the network and property to the input format
portfolio. To be precise, the number of solved instances required by the verifier. After running the verifier on the
simply represents a joining order in which the considered transformed problem, it returns either π ππ‘ if the property
algorithm comes first, whereas the marginal contribu- was falsified or π’ππ ππ‘ if the property was proven to hold.
tion metric assumes a joining order in which it comes
last. However, using fixed orders, as it is the case for the
marginal contribution, might not reveal possible interac-
tions between the given method and other algorithms,
Table 2
Performance comparison of local robustness verification algorithms in terms of the number of solved instances, relative
marginal contribution (RMC), Shapley value (π) and average CPU running time, computed for each category with π set at
0.012.
ReLU
Verifier MNIST CIFAR
Solved RMC π Avg. Time Solved RMC π Avg. Time
[CPU s] [CPU s]
BaBSB 358 0.22 118 3 241 307 0.00 86 2 924
Marabou 1 001 0.19 312 1 801 400 0.00 117 2 153
Neurify 871 0.25 265 1 964 915 0.75 411 235
nnenum 1 754 0.17 600 389 76 0.05 28 3 337
Verinet 1 799 0.16 618 263 841 0.20 330 500
ReLU+Maxpool
Verifier MNIST CIFAR
Solved RMC π Avg. Time Solved RMC π Avg. Time
Marabou 5 1.00 5 57 0 0.00 0 3 600
Tanh
Verifier MNIST CIFAR
Solved RMC π Avg. Time Solved RMC π Avg. Time
Verinet 556 1.00 556 55 0 0.00 0 3 600
Sigmoid
Verifier MNIST CIFAR
Solved RMC π Avg. Time Solved RMC π Avg. Time
Marabou 0 0.00 0 3 600 0 0.00 0 3 600
Verinet 581 1.00 581 55 0 0.00 0 3 600
6. Results and Discussion by all verifiers, highlighting the structural differences be-
tween instances and the sensitivity of the verification ap-
In the following, we provide an in-depth discussion of the proaches to those instances. That said, Neurify slightly
results obtained from our experiments. Table 1 shows the outperformed Verinet in terms of the number of solved
categories we devised, along with the resulting instance instances (915 vs 841 out of 2 500). Furthermore, Neurify
set sizes as well as information on which verifier has achieved a much larger relative marginal contribution
been employed for each category. than Verinet (0.75 vs 0.20), which means that the for-
Table 2 reports the number of problem instances mer could solve a relatively large number of instances
solved by each verifier per network category, the rel- which could not be solved by the other methods. Gener-
ative marginal contribution, the Shapley value and the ally, relative marginal contribution scores are much less
average running time. On ReLU-based MNIST networks, evenly distributed among verifiers when compared to
we found Verinet to be the best-performing verifier, solv- the MNIST dataset.
ing 1 799 out of 2 500 instances, while achieving a Shapley Figures 1a and 1b show an instance-level compari-
value of 618. However, taking relative marginal contribu- son of the two best-performing algorithms (in terms of
tion into account, we found that Neurify achieved the Shapley value) in the ReLU category for each dataset. In
highest relative marginal contribution of 0.25 (compared Figure 1a, we see that on MNIST networks, both Verinet
to 0.16 for Verinet), indicating that it could verify a siz- and nnenum solved instances that the other one could
able fraction of instances on which other methods failed not solve within the given time budget. Concretely, when
to return a solution. Moreover, the relative marginal considering a parallel portfolio containing both algo-
contribution scores show that each method could solve rithms, the number of solved instances slightly increases
a sizeable fraction of instances unsolved by any of the to 1 817 out of 2 500 (vs 1 799 solved by Verinet and 1 754
other methods. solved by nnenum alone), while supplied with similar
On ReLU-based CIFAR networks, it should first be CPU resources (i.e., 1 800 CPU seconds per verifier, which
noted that no verification problem instance can be solved represents half of the total given time budget of 3 600
105 105
104 104
CPU time [s], Verinet 103 103
CPU time [s], Verinet
102 102
101 101
100 100
10 1 10 1
10 2 10 2
10 3 3 10 3 3
10 10 2 10 1 100 101 102 103 104 105 10 10 2 10 1 100 101 102 103 104 105
CPU time [s], nnenum CPU time [s], Neurify
(a) MNIST (b) CIFAR
Figure 1: Performance comparison of the two top-performing verification methods (in terms of Shapley value) in the ReLU
category on (a) MNIST and (b) CIFAR networks. The plots show that each verifier outperforms the other on some instances,
while none of the methods is dominating in performance across the entire instance set. This illustrates the complementary
strengths of the verification algorithms. The diagonal line indicates equal performance of the two methods.
1.0 1.0
BaBSB BaBSB
Marabou Marabou
Neurify Neurify
0.8 nnenum 0.8 nnenum
Verinet Verinet
Fraction of instances solved
Fraction of instances solved
0.6 0.6
0.4 0.4
0.2 0.2
0.0 100 101 102 103 0.0 100 101 102 103
CPU time [s] CPU time [s]
(a) MNIST (b) CIFAR
Figure 2: Fraction of instances solved by the considered verification algorithms in the ReLU category as a function of CPU
running time for (a) MNIST and (b) CIFAR networks. On (a), we find that Verinet solves most instances in the least time,
while on (b) Neurify does.
CPU seconds per verification query as used in our ex- algorithms considered in our study. All remaining ver-
periments). Notice that leveraging parallel portfolios has ifiers achieved substantially lower Shapley values and
already been shown to significantly improve the perfor- relative marginal contribution scores, indicating that they
mance of MIP-based verification methods [39]. would not complement Neurify and Verinet well in a
On CIFAR instances, we found Neurify and Verinet portfolio.
to also have distinctive strengths over each other. This is Figure 2a shows the cumulative distribution function
shown in Figure 1b, where both algorithms could solve a of running times over the MNIST problem instances. As
substantial amount of instances that the other could not seen in the figure, Verinet tends to solve these problem
return a solution for. Thus, when combined in a parallel instances fastest; however, Neurify tended to show even
portfolio, 963 instances can be solved (vs 915 solved by better performances on those instances it was able to
Neurify and 841 solved by Verinet alone, out of 2 500 in- solve. We note that most of the instances unsolved by
stances), while using the same amount of CPU resources, Neurify represent networks that were trained on images
i.e., 1 800 CPU seconds per verifier. These findings further with 3 dimensions, whereas Neurify requires images
emphasise the complementarity between the verification used as network inputs to have 2 or 4 dimensions.
Figure 2b shows a similar plot for the CIFAR problem References
instances. Here, Neurify solved the largest fraction in
less time than other methods. This suggests that Neurify [1] K. D. Julian, M. J. Kochenderfer, M. P. Owen, Deep
is a very competitive verifier when applicable to the spe- neural network compression for aircraft collision
cific network or input format. avoidance systems, Journal of Guidance, Control,
For each of the remaining categories, we found that and Dynamics 42 (2019) 598β608.
there is only one verifier that could effectively handle [2] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Er-
the respective problem instances. Specifically, instances han, I. Goodfellow, R. Fergus, Intriguing properties
from the ReLU+MaxPooling category can be processed of neural networks, arXiv preprint arXiv:1312.6199
by Marabou, although, only a modest number of MNIST (2014).
instances could be solved in this way. Networks con- [3] O. Bastani, Y. Ioannou, L. Lampropoulos, D. Vytini-
taining Tanh activations can, in principle, be verified by otis, A. Nori, A. Criminisi, Measuring Neural Net
Verinet, but the algorithm did also not solve any CIFAR Robustness with Constraints, in: Advances in Neu-
instances. Lastly, Sigmoid-based networks can be han- ral Information Processing Systems (NeurIPS 2016),
dled by both Verinet and Marabou, however, only the 2016, pp. 2613β2621.
former could solve MNIST instances within the given [4] E. Botoeva, P. Kouvaros, J. Kronqvist, A. Lomuscio,
compute budget. R. Misener, Efficient Verification of ReLU-based
Neural Networks via Dependency Analysis, in: Pro-
ceedings of the 34th AAAI Conference on Artificial
7. Conclusions and Future Work Intelligence (AAAI-20), 2020, pp. 3291β3299.
[5] R. R. Bunel, I. Turkaslan, P. Torr, P. Kohli, P. K.
In this work, we assessed the performance of several local Mudigonda, A Unified View of Piecewise Linear
robustness verification algorithms, i.e., algorithms used Neural Network Verification, in: Advances in Neu-
to verify the robustness of an image classification net- ral Information Processing Systems (NeurIPS 2018),
work against small input perturbations. To conclude, we 2018, pp. 1β10.
found that all considered methods support ReLU-based [6] K. Dvijotham, R. Stanforth, S. Gowal, T. A. Mann,
networks, while other network types are strongly under- P. Kohli, A Dual Approach to Scalable Verification
supported. While this has been suspected in the com- of Deep Networks, in: Proceedings of the 38th
munity, it has, to our knowledge, not yet been subject Conference on Uncertainty in Artificial Intelligence
to formal study. Furthermore, we presented evidence (UAI 2018), 2018, pp. 550β559.
for strong performance complementarity: even within [7] R. Ehlers, Formal Verification of Piece-Wise Linear
the same benchmark category, two verification systems Feed-Forward Neural Networks, in: Proceedings of
outperform each other on distinct sets of instances. As the 15th International Symposium on Automated
we have demonstrated, this complementarity can be ex- Technology for Verification and Analysis (ATVA
ploited by combining individual verifiers into parallel 2017), 2017, pp. 269β286.
portfolios. Lastly, we showed that, in general, the per- [8] T. Gehr, M. Mirman, D. Drachsler-Cohen,
formance of verifiers strongly differs between image P. Tsankov, S. Chaudhuri, M. Vechev, AI2: Safety
datasets, with some methods achieving the best perfor- and Robustness Certification of Neural Networks
mance on MNIST (in terms of the number of solved in- with Abstract Interpretation, in: Proceedings of
stances and average running time) while falling behind the 39th IEEE Symposium on Security and Privacy
on CIFAR and vice versa. In future work, it would be (IEEE S&P 2018), 2018, pp. 3β18.
interesting to include a broader set of perturbation radii [9] P. Henriksen, A. Lomuscio, Efficient Neural Net-
and analyse in more detail how the relative performance work Verification via Adaptive Refinement and Ad-
of verifiers depends on the given radius. Furthermore, we versarial Search, in: Proceedings of the 24th Euro-
are interested in expanding our analysis to GPU-based pean Conference on Artificial Intelligence (ECAI
verification algorithms. 2020), 2020, pp. 2513β2520.
[10] G. Katz, C. Barrett, D. L. Dill, K. Julian, M. J. Kochen-
Acknowledgements derfer, Reluplex: An Efficient SMT Solver for Verify-
ing Deep Neural Networks, in: Proceedings of the
This research was partially supported by TAILOR, a 29th International Conference on Computer Aided
project funded by EU Horizon 2020 research and innova- Verification (CAV 2017), 2017, pp. 97β117.
tion program under GA No. 952215. [11] K. Scheibler, L. Winterer, R. Wimmer, B. Becker,
Towards Verification of Artificial Neural Networks,
in: Proceedings of the 18th Workshop on Metho-
den und Beschreibungssprachen zur Modellierung
und Verifikation von Schaltungen und Systemen ysis of Deep Neural Networks, in: Proceedings
(MBMV 2015), 2015, pp. 30β40. of the 31st International Conference on Computer
[12] V. Tjeng, K. Xiao, R. Tedrake, Evaluating Ro- Aided Verification (CAV 2019), 2019, pp. 443β452.
bustness of Neural Networks with Mixed Integer [24] L. Pulina, A. Tacchella, Checking Safety of Neu-
Programming, in: Proceedings of the 7th Inter- ral Networks with SMT Solvers: A Comparative
national Conference on Learning Representations Evaluation, in: AI*IA, 2011, pp. 127β138.
(ICLR 2019), 2019, pp. 1β21. [25] L. Pulina, A. Tacchella, NeVer: A Tool for Artificial
[13] S. Wang, K. Pei, J. Whitehouse, J. Yang, S. Jana, Ef- Neural Networks Verification, Annals of Mathe-
ficient Formal Safety Analysis of Neural Networks, matics and Artificial Intelligence (2011) 403β425.
in: Advances in Neural Information Processing Sys- [26] L. Pulina, A. Tacchella, Challenging SMT Solvers
tems (NeurIPS 2018), 2018, pp. 6369β6379. to Verify Neural Networks, AI Communications
[14] W. Xiang, H.-D. Tran, T. T. Johnson, Output Reach- (2012) 117β135.
able Set Estimation and Verification for Multilayer [27] L. d. Moura, N. BjΓΈrner, Satisfiability Modulo theo-
Neural Networks, IEEE Transactions on Neural Net- ries: An Appetizer, in: Proceedings of the Brazilian
works and Learning Systems 29 (2018) 5777β5783. Symposium on Formal Methods (SBMF 2009), 2009,
[15] I. J. Goodfellow, J. Shlens, C. Szegedy, Explain- pp. 23β36.
ing and Harnessing Adversarial Examples, arXiv [28] S. Dutta, S. Jha, S. Sankaranarayanan, A. Tiwari,
preprint arXiv:1412.6572 (2014). Output Range Analysis for Deep Neural Networks,
[16] N. Papernot, P. McDaniel, X. Wu, S. Jha, A. Swami, in: Proceedings of the Tenth NASA Formal Methods
Distillation as a Defense to Adversarial Perturba- Symposium (NFM 2018), 2018, pp. 121β138.
tions Against Deep Neural Networks, in: Proceed- [29] A. Lomuscio, L. Maganti, An approach to reachabil-
ings of the 37th IEEE Symposium on Security and ity analysis for feed-forward ReLU neural networks,
Privacy (IEEE S&P 2016), 2016, pp. 582β597. arXiv preprint arXiv:1706.07351 (2017).
[17] M. Casadio, E. Komendantskaya, M. L. Daggitt, [30] A. H. Land, A. G. Doig, An Automatic Method of
W. Kokke, G. Katz, G. Amir, I. Refaeli, Neural Net- Solving Discrete Programming Problems, Econo-
work Robustness as a Verification Property: A Prin- metrica (1960) 497β520.
cipled Case Study, in: Proceedings of the 34rd [31] R. Bunel, I. Turkaslan, P. H. S. Torr, M. P. Kumar,
International Conference on Computer Aided Veri- J. Lu, P. Kohli, Branch and Bound for Piecewise
fication (CAV 2022), 2022, pp. 219β231. Linear Neural Network Verification, Journal of
[18] S. Bak, C. Liu, T. Johnson, The Second International Machine Learning Research (2020) 1574β1612.
Verification of Neural Networks Competition (VNN- [32] A. De Palma, R. Bunel, A. Desmaison, K. Dvijotham,
COMP 2021): Summary and Results, arXiv preprint P. Kohli, P. H. S. Torr, M. P. Kumar, Improved
arXiv:2109.00498 (2021). Branch and Bound for Neural Network Verifica-
[19] L. Xu, F. Hutter, H. H. Hoos, K. Leyton-Brown, tion via Lagrangian Decomposition, arXiv preprint
Satzilla: Portfolio-based algorithm selection for sat, arXiv:2104.06718 (2021).
Journal of Artificial Intelligence Research 32 (2008) [33] S. Wang, K. Pei, J. Whitehouse, J. Yang, S. Jana,
565β606. Formal Security Analysis of Neural Networks using
[20] S. Kadioglu, Y. Malitsky, A. Sabharwal, H. Samu- Symbolic Intervals, in: Proceedings of the 27th
lowitz, M. Sellmann, Algorithm Selection and USENIX Security Symposium (USENIX Security 18),
Scheduling, in: Proceedings of the Seventeenth 2018, pp. 1599β1614.
International Conference on Principles and Prac- [34] S. Wang, H. Zhang, K. Xu, X. Lin, S. Jana, C.-J. Hsieh,
tice of Constraint Programming (CP2011), 2011, pp. Z. Kolter, Beta-CROWN: Efficient Bound Propaga-
454β469. tion with Per-neuron Split Constraints for Neural
[21] D. Shriver, S. Elbaum, M. B. Dwyer, DNNV: A Network Robustness Verification, in: Advances in
Framework for Deep Neural Network Verification, Neural Information Processing Systems (NeurIPS
in: Proceedings of the 33rd International Confer- 2021), 2021, pp. 29909β29921.
ence on Computer Aided Verification (CAV 2021), [35] G. Singh, T. Gehr, M. PΓΌschel, M. Vechev, An Ab-
2021, pp. 137β150. stract Domain for Certifying Neural Networks, in:
[22] L. Li, X. Qi, T. Xie, B. Li, Sok: Certified robust- Proceedings of the 46th ACM SIGPLAN Symposium
ness for deep neural networks, arXiv preprint on Principles of Programming Languages (ACM-
arXiv:2009.04131 (2020). POPL 2019), 2019, pp. 1β30.
[23] G. Katz, D. A. Huang, D. Ibeling, K. Julian, [36] H. Zhang, T.-W. Weng, P.-Y. Chen, C.-J. Hsieh,
C. Lazarus, R. Lim, P. Shah, S. Thakoor, H. Wu, L. Daniel, Efficient Neural Network Robustness
A. ZeljiΔ, D. L. Dill, M. J. Kochenderfer, C. Barrett, Certification with General Activation Functions,
the Marabou Framework for Verification and Anal- Advances in Neural Information Processing Sys-
tems (NeurIPS 2018) 31 (2018) 4944β4953.
[37] G. Singh, T. Gehr, M. Mirman, M. PΓΌschel,
M. Vechev, Fast and Effective Robustness Certi-
fication, in: Advances in Neural Information Pro-
cessing Systems (NeurIPS 2018), 2018, pp. 1β12.
[38] S. Bak, H.-D. Tran, K. Hobbs, T. T. Johnson, Im-
proved Geometric Path Enumeration for Verifying
ReLU Neural Networks, in: Proceedings of the
32nd International Conference on Computer Aided
Verification (CAV 2020), 2020, pp. 66β96.
[39] M. KΓΆnig, H. H. Hoos, J. N. v. Rijn, Speeding up neu-
ral network robustness verification via algorithm
configuration and an optimised mixed integer linear
programming solver portfolio, Machine Learning
111 (2022) 4565β4584.
[40] H. H. Hoos, T. StΓΌtzle, Stochastic Local Search:
Foundations & Applications, Elsevier / Morgan
Kaufmann, 2004.
[41] T. Bartz-Beielstein, C. Doerr, D. v. d. Berg, J. Bossek,
S. Chandrasekaran, T. Eftimov, A. Fischbach, P. Ker-
schke, W. La Cava, M. Lopez-Ibanez, et al., Bench-
marking in Optimization: Best Practice and Open
Issues, arXiv preprint arXiv:2007.03488 (2020).
[42] A. FrΓ©chette, L. Kotthoff, T. Michalak, T. Rahwan,
H. Hoos, K. Leyton-Brown, Using the shapley value
to analyze algorithm portfolios, in: Proceedings of
the 30th AAAI Conference on Artificial Intelligence
(AAAI-16), 2016, pp. 3397β3403.
[43] C. P. Gomes, B. Selman, Algorithm portfolios, Arti-
ficial Intelligence 126 (2001) 43β62.