=Paper=
{{Paper
|id=Vol-3381/paper_25
|storemode=property
|title=Critically Assessing the State of the Art in CPU-based Local Robustness
        Verification
|pdfUrl=https://ceur-ws.org/Vol-3381/25.pdf
|volume=Vol-3381
|authors=Matthias König,Annelot Bosman,Holger Hoos,Jan van Rijn
|dblpUrl=https://dblp.org/rec/conf/aaai/KonigBHR23
}}
==Critically Assessing the State of the Art in CPU-based Local Robustness
        Verification==
<pdf width="1500px">https://ceur-ws.org/Vol-3381/25.pdf</pdf>
<pre>
Critically Assessing the State of the Art
in CPU-based Local Robustness Verification
Matthias König1,* , Annelot W. Bosman1 , Holger H. Hoos1,2,3 and Jan N. van Rijn1
1
  Leiden Institute of Advanced Computer Science, Leiden University, The Netherlands
2
  Chair for AI Methodology, RWTH Aachen University, Germany
3
  University of British Columbia, Canada


                                          Abstract
                                          Recent research has proposed various methods to formally verify neural networks against minimal input perturbations.
                                          This type of verification is referred to as local robustness verification. The research area of local robustness verification
                                          is highly diverse, as verifiers rely on a multitude of techniques, including mixed integer programming and satisfiability
                                          modulo theories. At the same time, problem instances differ based on the network to be verified, the verification property
                                          and the specific network input. This gives rise to the question of which verification algorithm is most suitable to solve a
                                          given verification problem. To answer this question, we perform a systematic performance analysis of several CPU-based
                                          local robustness verification systems on a newly and carefully assembled set of 79 neural networks, each verified against
                                          100 robustness properties. Notably, we show that there is no single best algorithm that dominates in performance across
                                          all verification problem instances. Instead, our results reveal complementarities in verifier performance and illustrate the
                                          potential of leveraging algorithm portfolios for more efficient local robustness verification. Furthermore, we confirm the
                                          notion that most algorithms only support ReLU-based networks, while other activation functions remain under-supported.

                                          Keywords
                                          Benchmark Analysis, Neural Network Verification, Adversarial Robustness, Shapley Value, Algorithm Portfolios


1. Introduction                                                                                         sessment refers to local robustness verification, a broadly
                                                                                                        studied verification task, in which a network is systemat-
In recent years, deep learning methods based on neural ically tested against various input perturbations under
networks have been increasingly applied within vari- predefined norms, such as the 𝑙∞ norm [15, 16].
ous safety-critical domains and use contexts, ranging                                                      Neural network verification is a highly diverse re-
from manoeuvre advisory systems in unmanned aircraft search area, and existing methods rely on a broad range
to face recognition systems in mobile phones (see, e.g., of techniques. At the same time, neural networks differ in
Julian et al. [1]). Furthermore, it is now well known terms of their architecture, such as the number of hidden
that neural networks are vulnerable to adversarial exam- layers and nodes, the type of non-linearities, e.g., ReLU,
ples [2], where a given input is manipulated in such a Sigmoid or Tanh, or the type of operations they employ,
way that it is misclassified by the network. In the case e.g., pooling or convolutional layers. This diversity, both
of image recognition tasks, the required perturbation, in terms of verification approaches and neural network
whether it is adversarially crafted or arises accidentally, design, makes it challenging for researchers or practition-
can be so small that it remains virtually undetectable to ers to assess and decide which method is most suitable for
the human eye. Against this background, much work verifying a given neural network [17]. This challenge is
has focused on developing methods to provide formal amplified by the fact that the neural network verification
guarantees regarding the behaviour of a given neural community does not (yet) use commonly agreed evalu-
network [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]. For instance, ation protocols, which makes it difficult to draw clear
a network employed in autonomous driving for detecting conclusions from the literature regarding the capabilities
traffic signs should always produce accurate predictions, and performance of existing verifiers. More precisely, ex-
even when the input is slightly perturbed; failing to do so isting studies use different benchmarks and, so far, have
could have fatal consequences. This specific type of as- not provided an in-depth performance comparison of a
                                                                                                        broad range of verification algorithms.
SafeAI 2023: Workshop on Artificial Intelligence Safety, Feb 13-14, 2023                                   Recently, a competition series has been initiated, in
Washington, D.C., US @AAAI-23                                                                           which  several verifiers were applied to different bench-
*
  Corresponding author.
                                                                                                        marks (i.e., networks, properties and datasets) and com-
$ h.m.t.konig@liacs.leidenuniv.nl (M. König);
a.w.bosman@liacs.leidenuniv.nl (A. W. Bosman);                                                          pared in terms of various performance measures, includ-
hh@aim.rwth-aachen.de (H. H. Hoos);                                                                     ing the number of verified instances as well as running
j.n.van.rijn@liacs.leidenuniv.nl (J. N. van Rijn)                                                       time [18]. While the results from these competitions
           © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
           Attribution 4.0 International (CC BY 4.0).                                                   have provided highly valuable insights into the general
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
progress in neural network verification, several questions       2. Background
are left unanswered. For instance, it remains unclear why
a verification system participated in certain competition        Neural network verification methods seek to formally
categories but not in others, although this information          assess whether a trained neural network adheres to some
would help in understanding the technical capabilities           predefined input-output property. In this study, we focus
and limitations of a given verifier. Furthermore, aggre-         on local robustness properties. Given a trained neural
gation of the results makes it difficult to assess in detail     network and a set of images as inputs, a robustness verifi-
the strengths or weaknesses of verifiers on different in-        cation algorithm aims to verify whether or not there exist
stances. Instead, looking at aggregated results, one can         slight perturbations to an image that lead the network to
easily get the impression that a single approach domi-           predict an incorrect label. The maximum perturbation of
nates ‘across the board’ – an assumption that is known           each variable in the input, i.e., pixel in a given image, is
to be inaccurate for other problems involving formal ver-        predefined.
ification tasks; see, e.g., Xu et al. [19] or Kadioglu et al.       Formal verification algorithms can be either complete
[20] for SAT.                                                    or incomplete [22]. An algorithm that is incomplete does
   In this work, we focus exclusively on local robustness        not guarantee to report a solution for every given ex-
verification in image classification against perturbations       ample; however, incomplete verification algorithms are
under the 𝑙∞ norm. This scenario represents a widely             typically sound, which means they will report that a prop-
studied verification task, with a large number of networks       erty holds only if the property actually holds. On the
being publicly available and many verifiers providing off-       other hand, an algorithm that is sound and complete will
the-shelf support. Notice that most verification tasks can       correctly state that a property holds whenever it holds
be translated into local robustness verification queries         when given sufficient resources to be run to completion.
[21]; we, therefore, believe that our findings are broadly       In this work, we focus on complete algorithms, as those
applicable. Moreover, we seek to overcome the limita-            arguably represent the most ambitious form of neural
tions of existing benchmarking approaches and shed light         network verification, making them preferable over incom-
on previously unanswered questions with regard to the            plete methods, especially in safety-critical applications.
state of the art in local robustness verification.                  Early work on complete verification of neural net-
   Our contributions are as follows:                             works utilised satisfiability modulo theories (SMT) solvers
   (i) We analyse the current state of practice in bench-        [10, 23, 24, 25, 26], which determine whether a set of
marking verification algorithms;                                 logic constraints is satisfiable [27]. The resulting veri-
   (ii) we perform a systematic benchmarking study of            fication problems are NP-complete and challenging to
several, carefully chosen verification methods based on          solve in practice. Alternatively, it is possible to formulate
a newly assembled, large and diverse set of networks, in-        the verification task as a constraint optimisation problem
cluding 38 CIFAR and 41 MNIST networks with different            using mixed integer programming (MIP) [4, 28, 29, 12].
activation functions, each verified against 100 robustness       MIP solvers essentially optimise an objective function
properties, for which we consumed a total of 1.69 CPU            subject to a set of constraints. MIP problems, similar to
years in running time;                                           SMT problems, can be challenging to solve and tend to
   (iii) we present a categorisation of verification bench-      be computationally expensive (in terms of CPU time and
marks based on verifier compatibilities with different           memory).
layer types and operations;                                         To overcome the computational complexity of SMT
   (iv) we quantify verifier performance in terms of the         and MIP, it has been proposed to use the well-known
number of solved instances, running time, as well as             branch-and-bound algorithm [30] for solving the verifi-
marginal contribution and Shapley value, showing that            cation problem, regardless of whether it is modelled as
top-performing verification algorithms strongly comple-          MIP or SMT [31, 5, 32, 7, 13].
ment rather than consistently dominate each other in                Besides these three popular approaches, recent work
terms of performance – e.g., while the verifiers Neurify         has studied symbolic interval propagation [4, 9, 33, 34]
and Verinet achieved competitive performance on the              and polyhedra [35, 36], zonotope [8, 37] and star-set ab-
CIFAR networks we considered, the former solved many             straction [38].
instances unsolved by the latter and vice versa;
   (v) lastly, we provide a public repository containing
all experimental data, along with the newly assembled
                                                                 3. Common Practices in
network collection.1                                                Benchmarking Neural Network
                                                                    Verifiers
                                                                 Considering the diversity in neural network verification
1
    https://github.com/ADA-research/nn-verification-assessment   problems, it is quite natural to assume that a single best
algorithm does not exist, i.e., a method that outperforms       accelerated approaches employ advanced parallelisation
all others under all circumstances. It is still hard to iden-   schemes, giving rise to substantially higher computa-
tify to what extent a method contributes to the state of        tional demands than those required by methods running
the art, mainly because verification methods are typically      on CPUs, especially when restricted to a single core. For
evaluated (i) on a small number of benchmarks, which            example, an Amazon EC2 GPU-based instance with 32
have often been created for the sole purpose of evaluating      CPU cores costs around 1.4 times more than an equiva-
the method at hand, and (ii) against baseline methods for       lent CPU instance.2
which it is often unclear how they were chosen, leading            Lastly, the competition seeks to determine the current
to several methods claiming state-of-the-art performance        state of the art; however, the competition ranking and
without having been directly compared. We note that in          scores do not sufficiently quantify the extent to which an
the context of local robustness verification, a benchmark       algorithm actually contributes to the state of the art. In
most often represents a neural network classifier trained       other words, it is in the nature of competitions to deter-
on the MNIST or CIFAR-10 dataset, respectively.                 mine a winner, at least implicitly suggesting that a single
   As previously mentioned, a competition series has            approach generally outperforms all competitors. How-
been established, with the goal of providing an objective       ever, some verification algorithms might have limited
and fair comparison of the state-of-the-art methods in          but distinct areas of strength, which cannot be identified
neural network verification, in terms of scalability and        through aggregated performance measures, such as the
speed [18]. The VNN Competition was held in the years           total number of verified instances. Although the com-
2020, 2021 and 2022, yet with different protocols (e.g., for    petition report [18] shows that individual verifier per-
running experiments, scoring, etc.), benchmarks and par-        formance differs among benchmarks, it remains unclear
ticipants. In this discussion, we focus on the 2021 edition,    whether all algorithms solve the same set of instances in
as the report from 2022 has not yet been published at the       the given benchmark, or if they complement each other.
time of this writing.                                           Similarly, it does not reveal whether or not methods are
   Within VNN 2021, a total of 9 benchmarks were con-           correlated in their performance.
sidered, of which 7 represented test cases for local ro-
bustness verification of image classification networks.
Benchmarks were proposed by the participants them-              4. Verification Algorithms under
selves and included a total of 11 CIFAR and 7 MNIST                Assessment
networks, which differed in terms of architecture compo-
nents, such as non-linearities (e.g., ReLU, Tanh, Sigmoid)      We consider 5 complete, CPU-based neural network veri-
and layer operations (e.g., convolutional or pooling lay-       fication algorithms; each of these was chosen because it
ers). Networks were trained on the CIFAR-10 and MNIST           fulfilled one of the following conditions: it was (i) ranked
datasets, respectively. Moreover, each benchmark was            among the top five verification methods according to the
composed of random image subsets, excluding images              2021 VNN competition or (ii) supported by the recently
that were misclassified by the given network, along with        published DNNV framework [21]. Altogether, we con-
varying perturbation radii.                                     sider our set of algorithms to be representative of the
   This competition overcame several of the previously          developments in the area of complete neural network
reported limitations with regard to the evaluation of net-      and, more specifically, local robustness verification.
work verifiers. Most notably, it covered a relatively large        All methods were employed with their default hyper-
and diverse set of neural networks. Moreover, thanks to         parameter settings, as they would likely be used by practi-
the active participation from the community, 12 verifi-         tioners. We note that the performance of a verifier might
cation algorithms were included in the competition. At          improve if its hyperparameters were optimised specifi-
the same time, we see room for further research into the        cally for the given benchmark; however, most verifiers
performance of neural network verifiers.                        have dozens of hyperparameters (or employ combinato-
   Firstly, it is not clearly stated in the VNN competition     rial solvers that come with their own, extensive set of
report [18] why verifiers participated in certain competi-      hyperparameters), which makes this a non-trivial task,
tion categories, but not in others. While we assume this        requiring additional expertise and resources (see, e.g.,
to be due to compatibility reasons, we could not find any       König et al. [39]). The verification algorithms we consid-
formal explanation in the report, although this informa-        ered are the following:
tion could provide insight into relevant problem classes           BaB. The algorithm proposed by Bunel et al. [5] re-
and suitable algorithms for solving these.                      states the verification problem as a global optimisation
   Secondly, GPU-accelerated tools were directly com-           problem, which is then solved using branch-and-bound
pared to those that rely solely on CPU resources, which         search. It further incorporates algorithmic improvements
we argue does not give due credit to the structural dif-
ferences between these classes of algorithms. GPU-              2
                                                                    See https://aws.amazon.com/ec2/pricing/on-demand/
with regard to branching and bounding procedures such          5.1. Problem Instances
as smart branching; i.e., before splitting, it computes fast
                                                               For our assessment, we compiled a high-quality set of
bounds on each of the possible subdomains and chooses
                                                               problem instances for local robustness verification. Fol-
the one with the tightest bounds. We refer to this method
                                                               lowing best practices in other research areas, such as
as BaBSB for the remainder of this paper. BaBSB sup-
                                                               optimisation [40, 41], the benchmark should be represen-
ports ReLU-based networks.
                                                               tative and diverse, where the former refers to how well
   Marabou. The Marabou framework [23] employs
                                                               the difficulty of the benchmark is aligned with that of
SMT solving techniques, specifically the lazy search tech-
                                                               real-world instances from the same problem class, and
nique for handling non-linear constraints. Furthermore,
                                                               the latter means that the problem set should generally
Marabou employs deduction techniques to obtain infor-
                                                               contain problems with a wide range of difficulties.
mation on the activation functions that can be used to
                                                                  Overall, our benchmark is comprised of 79 image clas-
simplify them. The core of the SMT solver is simplex-
                                                               sification networks, of which 38 are trained on the CIFAR-
based, which means that the variable assignments are
                                                               10 dataset and 41 are trained on the MNIST dataset. To
made using the simplex algorithm. Marabou supports
                                                               ensure the representativeness of the problem set, all net-
ReLU and Sigmoid activations as well as MaxPooling
                                                               works were sampled from the neural network verification
operations.
                                                               literature, i.e., networks used in existing work on local ro-
   Neurify. The verification algorithm proposed by
                                                               bustness verification and provided in public repositories;
Wang et al. [33] relies on symbolic interval propagation
                                                               in other words, the characteristics of the networks in
to create over-approximations, followed by a refinement
                                                               our benchmark are assumed to match those of networks
strategy based on symbolic gradient information. The
                                                               generally used for evaluating verification algorithms.
constraint refinement aims to tighten the bounds of the
                                                                  We further want our problem set to be diverse. There-
approximation of activation functions. Neurify can pro-
                                                               fore, we ensured that the considered networks differ in
cess networks containing ReLU activations.
                                                               size, i.e., the number of hidden layers and nodes, as well
   nnenum. The verifier proposed by Bak et al. [38]
                                                               as the type of non-linearities (e.g., ReLU or Tanh) or layer
utilises star sets to represent the values each layer of a
                                                               operations (e.g., pooling layers) they employ.
neural network can attain. By propagating these through
                                                                  For each network, we verified 100 local robustness
the network, it checks whether one or more of the star
                                                               properties; more precisely, we sampled 100 images from
sets results in an adversarial example. This verifier can
                                                               the dataset on which the network has been trained and
handle networks with ReLU activations.
                                                               verified for local robustness with perturbation radius 𝜖
   Verinet. The verifier developed by Henriksen and
                                                               set at 0.012. This perturbation radius was chosen to be
Lomuscio [9] combines symbolic intervals with gradient-
                                                               a median of values we found in literature: the smallest
based adversarial local search for finding counter-
                                                               value we found was 1/255 in the work by Li et al. [22],
examples. The authors further propose a splitting heuris-
                                                               whereas Botoeva et al. [4] or Wang et al. [13] utilised
tic for interval propagation based on the influence of a
                                                               larger values, such as 0.05.
given node on the bounds of the network output. Verinet
                                                                  Lastly, we split our benchmark set into different cat-
supports networks containing ReLU, Sigmoid and Tanh
                                                               egories based on verifier compatibilities. This means a
activations.
                                                               verifier is only employed to categories it is able to process.
                                                               The categories as well as the instance set size for each
5. Setup for Empirical Evaluation                              category are shown in Table 1.

In this work, we seek to provide a clearer picture of the      5.2. Evaluation Metrics
state of the art in neural network verification. More
specifically, we argue that the state of the art is not just   In order to assess the performance of the various meth-
defined by a single verification algorithm, as there might     ods, we compute four performance metrics: the average
be verifiers which on their own perform poorly but still       running time, the number of solved instances, the rela-
make meaningful contributions by excelling on limited          tive marginal contribution and the Shapley value [42] of
instance subsets that are challenging for other verifica-      each verifier. Although these metrics present aggregated
tion methods.                                                  measures, they reflect algorithm performance on an in-
   In the following, we will present an overview of how        stance level and in relation to other methods included
we set up our benchmark study, i.e., how we selected           in the comparison; a more detailed explanation will be
problem instances and verification algorithms. Further-        provided in the following paragraphs. Notice that we do
more, we will provide details on the software we used          not penalise timeouts when computing average running
and the execution environment in which our experiments         time; i.e., the maximum running time equals the given
were carried out.                                              time limit.
Table 1
Instance set size for each benchmark category. Solvable instances are those solved by at least one (i.e., any) or all of the
considered verifiers. The column “Verifiers employed” lists the matching suitable verification algorithm(s) to the respective
category.

                                    MNIST                   CIFAR
 Category                   Total     Solvable      Total         Solvable    Verifiers employed
                                      Any     All                 Any   All
 ReLU                       2 500    1 913    42    2 500         972    0    BaBSB, Marabou, Neurify, nnenum, Verinet
 ReLU + MaxPool               400        5     5      100           0    0    Marabou
 Tanh                         600      556   556      600           0    0    Verinet
 Sigmoid                      600      581     0      600           0    0    Marabou, Verinet


   The marginal contribution is computed as follows. De-           e.g., it might understate the importance of a single algo-
fine 𝑐 as a set of given verifiers and let 𝑣(𝑐) be the total       rithm given the presence of another, highly correlated
score of set 𝑐. Here, the total score 𝑣(𝑐) consists of the         algorithm. In such a case, both algorithms would not be
number of instances verified by at least one verifier in           assigned any marginal contribution, even though one of
set 𝑐. We compute the marginal contribution per algo-              them should be included in a potential portfolio. This is
rithm to determine how much the total performance of               captured by the Shapley value: define 𝑛 as the number
all algorithms (in terms of solved instances) decreases            of verifiers under consideration and 𝐶 as the set of all
when the given algorithm is removed from the set of all            combinations of all subsets of verifiers under consider-
algorithms if they were employed in a parallel algorithm           ation including the empty set, where set 𝐶 will be of
portfolio. Such portfolios are sets of algorithms that are         size 𝑛
                                                                         ∑︀           𝑛!
                                                                                              . Finally, define 𝐶𝑖 as all sets in
run in parallel on each given problem instance and com-                     𝑘=0
                                                                                𝑘! · (𝑛 − 𝑘)!
plement each other in terms of performance across an               which verifier 𝑖 is present. The Shapley value of verifier
instance set [43]. Formally, to determine the marginal             𝑖 is then calculated as
contribution of any of the verifiers 𝑖 to portfolio 𝑐, one                            1 ∑︁
needs to know the value of 𝑣(𝑐) and 𝑣(𝑐 ∖ {𝑖}), where                          𝜑𝑖 =      ·      (𝑣(𝑗) − 𝑣(𝑗 ∖ {𝑖}))           (2)
                                                                                      𝑛! 𝑗∈𝐶
𝑐∖{𝑖} is the portfolio minus verifier 𝑖. Thus, the marginal                                   𝑖

contribution of verifier 𝑖 is expressed as
                MCi = 𝑣(𝑐) − 𝑣(𝑐 ∖ {𝑖})                     (1)
                                                                   5.3. Execution Environment and Software
                                                                        Used
Following this terminology, we can define the number
of solved instances by verifier 𝑖 as a set consisting only         Our experiments were carried out on a cluster of ma-
of verifier 𝑖, Solvedi = 𝑣(𝑖) − 𝑣(∅). In other words, the          chines equipped with Intel Xeon E5-2683 CPUs with 32
number of solved instances employs a set of size one and           cores, 40 MB cache size and 94 GB RAM, running CentOS
the marginal contribution employs a set of all verifiers           Linux 7. For each verification method, we limited the
under consideration. The relative marginal contribution            number of available CPU cores to a single core. Each
then represents the marginal contribution of a given veri-         query was given a time budget of 3 600 seconds and a
fier as a proportion of the sum of every method’s absolute         memory budget of 3 GB. Generally, we executed the veri-
marginal contribution.                                             fication algorithms through the DNNV interface, version
   Lastly, the Shapley value is the average marginal contri-       0.4.8. DNNV is a framework that transforms a network
bution of a verifier over all possible joining orders, where       and property specification into a unified format, which
joining order refers to the order in which the verifiers           can then be solved by a given method [21]. More specif-
are added to a parallel portfolio. This value is comple-           ically, DNNV takes as input a network in the ONNX
menting the previous two metrics, as it does not assume            format, along with a property specification, and then
a particular order in which algorithms are added to the            translates the network and property to the input format
portfolio. To be precise, the number of solved instances           required by the verifier. After running the verifier on the
simply represents a joining order in which the considered          transformed problem, it returns either 𝑠𝑎𝑡 if the property
algorithm comes first, whereas the marginal contribu-              was falsified or 𝑢𝑛𝑠𝑎𝑡 if the property was proven to hold.
tion metric assumes a joining order in which it comes
last. However, using fixed orders, as it is the case for the
marginal contribution, might not reveal possible interac-
tions between the given method and other algorithms,
Table 2
Performance comparison of local robustness verification algorithms in terms of the number of solved instances, relative
marginal contribution (RMC), Shapley value (𝜑) and average CPU running time, computed for each category with 𝜖 set at
0.012.
 ReLU
 Verifier                                                      MNIST                                 CIFAR
                                              Solved    RMC         𝜑    Avg. Time     Solved    RMC       𝜑    Avg. Time
                                                                          [CPU s]                                [CPU s]
 BaBSB                                            358    0.22     118         3 241       307     0.00    86         2 924
 Marabou                                        1 001    0.19     312         1 801       400     0.00   117         2 153
 Neurify                                          871    0.25     265         1 964       915     0.75   411           235
 nnenum                                         1 754    0.17     600           389        76     0.05    28         3 337
 Verinet                                        1 799    0.16     618           263       841     0.20   330           500
 ReLU+Maxpool
 Verifier                                                      MNIST                                 CIFAR
                                              Solved    RMC         𝜑    Avg. Time     Solved    RMC       𝜑    Avg. Time
 Marabou                                           5     1.00       5            57         0     0.00      0        3 600
 Tanh
 Verifier                                                      MNIST                                 CIFAR
                                              Solved    RMC         𝜑    Avg. Time     Solved    RMC       𝜑    Avg. Time
 Verinet                                         556     1.00     556            55         0     0.00      0        3 600
 Sigmoid
 Verifier                                                      MNIST                                 CIFAR
                                              Solved    RMC         𝜑    Avg. Time     Solved    RMC       𝜑    Avg. Time
 Marabou                                           0     0.00       0         3 600         0     0.00      0        3 600
 Verinet                                         581     1.00     581            55         0     0.00      0        3 600


6. Results and Discussion                                       by all verifiers, highlighting the structural differences be-
                                                                tween instances and the sensitivity of the verification ap-
In the following, we provide an in-depth discussion of the      proaches to those instances. That said, Neurify slightly
results obtained from our experiments. Table 1 shows the        outperformed Verinet in terms of the number of solved
categories we devised, along with the resulting instance        instances (915 vs 841 out of 2 500). Furthermore, Neurify
set sizes as well as information on which verifier has          achieved a much larger relative marginal contribution
been employed for each category.                                than Verinet (0.75 vs 0.20), which means that the for-
   Table 2 reports the number of problem instances              mer could solve a relatively large number of instances
solved by each verifier per network category, the rel-          which could not be solved by the other methods. Gener-
ative marginal contribution, the Shapley value and the          ally, relative marginal contribution scores are much less
average running time. On ReLU-based MNIST networks,             evenly distributed among verifiers when compared to
we found Verinet to be the best-performing verifier, solv-      the MNIST dataset.
ing 1 799 out of 2 500 instances, while achieving a Shapley        Figures 1a and 1b show an instance-level compari-
value of 618. However, taking relative marginal contribu-       son of the two best-performing algorithms (in terms of
tion into account, we found that Neurify achieved the           Shapley value) in the ReLU category for each dataset. In
highest relative marginal contribution of 0.25 (compared        Figure 1a, we see that on MNIST networks, both Verinet
to 0.16 for Verinet), indicating that it could verify a siz-    and nnenum solved instances that the other one could
able fraction of instances on which other methods failed        not solve within the given time budget. Concretely, when
to return a solution. Moreover, the relative marginal           considering a parallel portfolio containing both algo-
contribution scores show that each method could solve           rithms, the number of solved instances slightly increases
a sizeable fraction of instances unsolved by any of the         to 1 817 out of 2 500 (vs 1 799 solved by Verinet and 1 754
other methods.                                                  solved by nnenum alone), while supplied with similar
   On ReLU-based CIFAR networks, it should first be             CPU resources (i.e., 1 800 CPU seconds per verifier, which
noted that no verification problem instance can be solved       represents half of the total given time budget of 3 600
                                             105                                                                                                   105

                                             104                                                                                                   104

            CPU time [s], Verinet            103                                                                                                   103


                                                                                                                  CPU time [s], Verinet
                                             102                                                                                                   102

                                             101                                                                                                   101

                                             100                                                                                                   100

                                     10 1                                                                                                  10 1

                                     10 2                                                                                                  10 2

                                     10 3 3                                                                                                10 3 3
                                        10  10 2 10 1                      100     101    102   103   104   105                               10  10 2 10 1                      100     101    102    103    104   105
                                                                         CPU time [s], nnenum                                                                                  CPU time [s], Neurify
                                                                           (a) MNIST                                                                                             (b) CIFAR
Figure 1: Performance comparison of the two top-performing verification methods (in terms of Shapley value) in the ReLU
category on (a) MNIST and (b) CIFAR networks. The plots show that each verifier outperforms the other on some instances,
while none of the methods is dominating in performance across the entire instance set. This illustrates the complementary
strengths of the verification algorithms. The diagonal line indicates equal performance of the two methods.

                                                   1.0                                                                                                   1.0
                                                               BaBSB                                                                                                 BaBSB
                                                               Marabou                                                                                               Marabou
                                                               Neurify                                                                                               Neurify
                                                   0.8         nnenum                                                                                    0.8         nnenum
                                                               Verinet                                                                                               Verinet
                    Fraction of instances solved


                                                                                                                          Fraction of instances solved


                                                   0.6                                                                                                   0.6


                                                   0.4                                                                                                   0.4


                                                   0.2                                                                                                   0.2


                                                   0.0   100             101             102          103                                                0.0   100             101             102           103
                                                                               CPU time [s]                                                                                          CPU time [s]
                                                                           (a) MNIST                                                                                             (b) CIFAR
Figure 2: Fraction of instances solved by the considered verification algorithms in the ReLU category as a function of CPU
running time for (a) MNIST and (b) CIFAR networks. On (a), we find that Verinet solves most instances in the least time,
while on (b) Neurify does.

CPU seconds per verification query as used in our ex-                                                                        algorithms considered in our study. All remaining ver-
periments). Notice that leveraging parallel portfolios has                                                                   ifiers achieved substantially lower Shapley values and
already been shown to significantly improve the perfor-                                                                      relative marginal contribution scores, indicating that they
mance of MIP-based verification methods [39].                                                                                would not complement Neurify and Verinet well in a
    On CIFAR instances, we found Neurify and Verinet                                                                         portfolio.
to also have distinctive strengths over each other. This is                                                                     Figure 2a shows the cumulative distribution function
shown in Figure 1b, where both algorithms could solve a                                                                      of running times over the MNIST problem instances. As
substantial amount of instances that the other could not                                                                     seen in the figure, Verinet tends to solve these problem
return a solution for. Thus, when combined in a parallel                                                                     instances fastest; however, Neurify tended to show even
portfolio, 963 instances can be solved (vs 915 solved by                                                                     better performances on those instances it was able to
Neurify and 841 solved by Verinet alone, out of 2 500 in-                                                                    solve. We note that most of the instances unsolved by
stances), while using the same amount of CPU resources,                                                                      Neurify represent networks that were trained on images
i.e., 1 800 CPU seconds per verifier. These findings further                                                                 with 3 dimensions, whereas Neurify requires images
emphasise the complementarity between the verification                                                                       used as network inputs to have 2 or 4 dimensions.
   Figure 2b shows a similar plot for the CIFAR problem      References
instances. Here, Neurify solved the largest fraction in
less time than other methods. This suggests that Neurify      [1] K. D. Julian, M. J. Kochenderfer, M. P. Owen, Deep
is a very competitive verifier when applicable to the spe-        neural network compression for aircraft collision
cific network or input format.                                    avoidance systems, Journal of Guidance, Control,
   For each of the remaining categories, we found that            and Dynamics 42 (2019) 598–608.
there is only one verifier that could effectively handle      [2] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Er-
the respective problem instances. Specifically, instances         han, I. Goodfellow, R. Fergus, Intriguing properties
from the ReLU+MaxPooling category can be processed                of neural networks, arXiv preprint arXiv:1312.6199
by Marabou, although, only a modest number of MNIST               (2014).
instances could be solved in this way. Networks con-          [3] O. Bastani, Y. Ioannou, L. Lampropoulos, D. Vytini-
taining Tanh activations can, in principle, be verified by        otis, A. Nori, A. Criminisi, Measuring Neural Net
Verinet, but the algorithm did also not solve any CIFAR           Robustness with Constraints, in: Advances in Neu-
instances. Lastly, Sigmoid-based networks can be han-             ral Information Processing Systems (NeurIPS 2016),
dled by both Verinet and Marabou, however, only the               2016, pp. 2613–2621.
former could solve MNIST instances within the given           [4] E. Botoeva, P. Kouvaros, J. Kronqvist, A. Lomuscio,
compute budget.                                                   R. Misener, Efficient Verification of ReLU-based
                                                                  Neural Networks via Dependency Analysis, in: Pro-
                                                                  ceedings of the 34th AAAI Conference on Artificial
7. Conclusions and Future Work                                    Intelligence (AAAI-20), 2020, pp. 3291–3299.
                                                              [5] R. R. Bunel, I. Turkaslan, P. Torr, P. Kohli, P. K.
In this work, we assessed the performance of several local        Mudigonda, A Unified View of Piecewise Linear
robustness verification algorithms, i.e., algorithms used         Neural Network Verification, in: Advances in Neu-
to verify the robustness of an image classification net-          ral Information Processing Systems (NeurIPS 2018),
work against small input perturbations. To conclude, we           2018, pp. 1–10.
found that all considered methods support ReLU-based          [6] K. Dvijotham, R. Stanforth, S. Gowal, T. A. Mann,
networks, while other network types are strongly under-           P. Kohli, A Dual Approach to Scalable Verification
supported. While this has been suspected in the com-              of Deep Networks, in: Proceedings of the 38th
munity, it has, to our knowledge, not yet been subject            Conference on Uncertainty in Artificial Intelligence
to formal study. Furthermore, we presented evidence               (UAI 2018), 2018, pp. 550–559.
for strong performance complementarity: even within           [7] R. Ehlers, Formal Verification of Piece-Wise Linear
the same benchmark category, two verification systems             Feed-Forward Neural Networks, in: Proceedings of
outperform each other on distinct sets of instances. As           the 15th International Symposium on Automated
we have demonstrated, this complementarity can be ex-             Technology for Verification and Analysis (ATVA
ploited by combining individual verifiers into parallel           2017), 2017, pp. 269–286.
portfolios. Lastly, we showed that, in general, the per-      [8] T. Gehr, M. Mirman, D. Drachsler-Cohen,
formance of verifiers strongly differs between image              P. Tsankov, S. Chaudhuri, M. Vechev, AI2: Safety
datasets, with some methods achieving the best perfor-            and Robustness Certification of Neural Networks
mance on MNIST (in terms of the number of solved in-              with Abstract Interpretation, in: Proceedings of
stances and average running time) while falling behind            the 39th IEEE Symposium on Security and Privacy
on CIFAR and vice versa. In future work, it would be              (IEEE S&P 2018), 2018, pp. 3–18.
interesting to include a broader set of perturbation radii    [9] P. Henriksen, A. Lomuscio, Efficient Neural Net-
and analyse in more detail how the relative performance           work Verification via Adaptive Refinement and Ad-
of verifiers depends on the given radius. Furthermore, we         versarial Search, in: Proceedings of the 24th Euro-
are interested in expanding our analysis to GPU-based             pean Conference on Artificial Intelligence (ECAI
verification algorithms.                                          2020), 2020, pp. 2513–2520.
                                                             [10] G. Katz, C. Barrett, D. L. Dill, K. Julian, M. J. Kochen-
Acknowledgements                                                  derfer, Reluplex: An Efficient SMT Solver for Verify-
                                                                  ing Deep Neural Networks, in: Proceedings of the
This research was partially supported by TAILOR, a                29th International Conference on Computer Aided
project funded by EU Horizon 2020 research and innova-            Verification (CAV 2017), 2017, pp. 97–117.
tion program under GA No. 952215.                            [11] K. Scheibler, L. Winterer, R. Wimmer, B. Becker,
                                                                  Towards Verification of Artificial Neural Networks,
                                                                  in: Proceedings of the 18th Workshop on Metho-
                                                                  den und Beschreibungssprachen zur Modellierung
     und Verifikation von Schaltungen und Systemen                 ysis of Deep Neural Networks, in: Proceedings
     (MBMV 2015), 2015, pp. 30–40.                                 of the 31st International Conference on Computer
[12] V. Tjeng, K. Xiao, R. Tedrake, Evaluating Ro-                 Aided Verification (CAV 2019), 2019, pp. 443–452.
     bustness of Neural Networks with Mixed Integer           [24] L. Pulina, A. Tacchella, Checking Safety of Neu-
     Programming, in: Proceedings of the 7th Inter-                ral Networks with SMT Solvers: A Comparative
     national Conference on Learning Representations               Evaluation, in: AI*IA, 2011, pp. 127–138.
     (ICLR 2019), 2019, pp. 1–21.                             [25] L. Pulina, A. Tacchella, NeVer: A Tool for Artificial
[13] S. Wang, K. Pei, J. Whitehouse, J. Yang, S. Jana, Ef-         Neural Networks Verification, Annals of Mathe-
     ficient Formal Safety Analysis of Neural Networks,            matics and Artificial Intelligence (2011) 403–425.
     in: Advances in Neural Information Processing Sys-       [26] L. Pulina, A. Tacchella, Challenging SMT Solvers
     tems (NeurIPS 2018), 2018, pp. 6369–6379.                     to Verify Neural Networks, AI Communications
[14] W. Xiang, H.-D. Tran, T. T. Johnson, Output Reach-            (2012) 117–135.
     able Set Estimation and Verification for Multilayer      [27] L. d. Moura, N. Bjørner, Satisfiability Modulo theo-
     Neural Networks, IEEE Transactions on Neural Net-             ries: An Appetizer, in: Proceedings of the Brazilian
     works and Learning Systems 29 (2018) 5777–5783.               Symposium on Formal Methods (SBMF 2009), 2009,
[15] I. J. Goodfellow, J. Shlens, C. Szegedy, Explain-             pp. 23–36.
     ing and Harnessing Adversarial Examples, arXiv           [28] S. Dutta, S. Jha, S. Sankaranarayanan, A. Tiwari,
     preprint arXiv:1412.6572 (2014).                              Output Range Analysis for Deep Neural Networks,
[16] N. Papernot, P. McDaniel, X. Wu, S. Jha, A. Swami,            in: Proceedings of the Tenth NASA Formal Methods
     Distillation as a Defense to Adversarial Perturba-            Symposium (NFM 2018), 2018, pp. 121–138.
     tions Against Deep Neural Networks, in: Proceed-         [29] A. Lomuscio, L. Maganti, An approach to reachabil-
     ings of the 37th IEEE Symposium on Security and               ity analysis for feed-forward ReLU neural networks,
     Privacy (IEEE S&P 2016), 2016, pp. 582–597.                   arXiv preprint arXiv:1706.07351 (2017).
[17] M. Casadio, E. Komendantskaya, M. L. Daggitt,            [30] A. H. Land, A. G. Doig, An Automatic Method of
     W. Kokke, G. Katz, G. Amir, I. Refaeli, Neural Net-           Solving Discrete Programming Problems, Econo-
     work Robustness as a Verification Property: A Prin-           metrica (1960) 497–520.
     cipled Case Study, in: Proceedings of the 34rd           [31] R. Bunel, I. Turkaslan, P. H. S. Torr, M. P. Kumar,
     International Conference on Computer Aided Veri-              J. Lu, P. Kohli, Branch and Bound for Piecewise
     fication (CAV 2022), 2022, pp. 219–231.                       Linear Neural Network Verification, Journal of
[18] S. Bak, C. Liu, T. Johnson, The Second International          Machine Learning Research (2020) 1574–1612.
     Verification of Neural Networks Competition (VNN-        [32] A. De Palma, R. Bunel, A. Desmaison, K. Dvijotham,
     COMP 2021): Summary and Results, arXiv preprint               P. Kohli, P. H. S. Torr, M. P. Kumar, Improved
     arXiv:2109.00498 (2021).                                      Branch and Bound for Neural Network Verifica-
[19] L. Xu, F. Hutter, H. H. Hoos, K. Leyton-Brown,                tion via Lagrangian Decomposition, arXiv preprint
     Satzilla: Portfolio-based algorithm selection for sat,        arXiv:2104.06718 (2021).
     Journal of Artificial Intelligence Research 32 (2008)    [33] S. Wang, K. Pei, J. Whitehouse, J. Yang, S. Jana,
     565–606.                                                      Formal Security Analysis of Neural Networks using
[20] S. Kadioglu, Y. Malitsky, A. Sabharwal, H. Samu-              Symbolic Intervals, in: Proceedings of the 27th
     lowitz, M. Sellmann, Algorithm Selection and                  USENIX Security Symposium (USENIX Security 18),
     Scheduling, in: Proceedings of the Seventeenth                2018, pp. 1599–1614.
     International Conference on Principles and Prac-         [34] S. Wang, H. Zhang, K. Xu, X. Lin, S. Jana, C.-J. Hsieh,
     tice of Constraint Programming (CP2011), 2011, pp.            Z. Kolter, Beta-CROWN: Efficient Bound Propaga-
     454–469.                                                      tion with Per-neuron Split Constraints for Neural
[21] D. Shriver, S. Elbaum, M. B. Dwyer, DNNV: A                   Network Robustness Verification, in: Advances in
     Framework for Deep Neural Network Verification,               Neural Information Processing Systems (NeurIPS
     in: Proceedings of the 33rd International Confer-             2021), 2021, pp. 29909–29921.
     ence on Computer Aided Verification (CAV 2021),          [35] G. Singh, T. Gehr, M. Püschel, M. Vechev, An Ab-
     2021, pp. 137–150.                                            stract Domain for Certifying Neural Networks, in:
[22] L. Li, X. Qi, T. Xie, B. Li, Sok: Certified robust-           Proceedings of the 46th ACM SIGPLAN Symposium
     ness for deep neural networks, arXiv preprint                 on Principles of Programming Languages (ACM-
     arXiv:2009.04131 (2020).                                      POPL 2019), 2019, pp. 1–30.
[23] G. Katz, D. A. Huang, D. Ibeling, K. Julian,             [36] H. Zhang, T.-W. Weng, P.-Y. Chen, C.-J. Hsieh,
     C. Lazarus, R. Lim, P. Shah, S. Thakoor, H. Wu,               L. Daniel, Efficient Neural Network Robustness
     A. Zeljić, D. L. Dill, M. J. Kochenderfer, C. Barrett,        Certification with General Activation Functions,
     the Marabou Framework for Verification and Anal-              Advances in Neural Information Processing Sys-
     tems (NeurIPS 2018) 31 (2018) 4944–4953.
[37] G. Singh, T. Gehr, M. Mirman, M. Püschel,
     M. Vechev, Fast and Effective Robustness Certi-
     fication, in: Advances in Neural Information Pro-
     cessing Systems (NeurIPS 2018), 2018, pp. 1–12.
[38] S. Bak, H.-D. Tran, K. Hobbs, T. T. Johnson, Im-
     proved Geometric Path Enumeration for Verifying
     ReLU Neural Networks, in: Proceedings of the
     32nd International Conference on Computer Aided
     Verification (CAV 2020), 2020, pp. 66–96.
[39] M. König, H. H. Hoos, J. N. v. Rijn, Speeding up neu-
     ral network robustness verification via algorithm
     configuration and an optimised mixed integer linear
     programming solver portfolio, Machine Learning
     111 (2022) 4565–4584.
[40] H. H. Hoos, T. Stützle, Stochastic Local Search:
     Foundations & Applications, Elsevier / Morgan
     Kaufmann, 2004.
[41] T. Bartz-Beielstein, C. Doerr, D. v. d. Berg, J. Bossek,
     S. Chandrasekaran, T. Eftimov, A. Fischbach, P. Ker-
     schke, W. La Cava, M. Lopez-Ibanez, et al., Bench-
     marking in Optimization: Best Practice and Open
     Issues, arXiv preprint arXiv:2007.03488 (2020).
[42] A. Fréchette, L. Kotthoff, T. Michalak, T. Rahwan,
     H. Hoos, K. Leyton-Brown, Using the shapley value
     to analyze algorithm portfolios, in: Proceedings of
     the 30th AAAI Conference on Artificial Intelligence
     (AAAI-16), 2016, pp. 3397–3403.
[43] C. P. Gomes, B. Selman, Algorithm portfolios, Arti-
     ficial Intelligence 126 (2001) 43–62.

</pre>