<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Critically Assessing the State of the Art in CP U-based Local Robustness Verification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matthias König</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Annelot W. Bosman</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Holger H. Hoos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan N. van Rijn</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Chair for AI Methodology, RWTH Aachen University</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Leiden Institute of Advanced Computer Science, Leiden University</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of British Columbia</institution>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recent research has proposed various methods to formally verify neural networks against minimal input perturbations. This type of verification is referred to as local robustness verification. The research area of local robustness verification is highly diverse, as verifiers rely on a multitude of techniques, including mixed integer programming and satisfiability modulo theories. At the same time, problem instances difer based on the network to be verified, the verification property and the specific network input. This gives rise to the question of which verification algorithm is most suitable to solve a given verification problem. To answer this question, we perform a systematic performance analysis of several CPU-based local robustness verification systems on a newly and carefully assembled set of 79 neural networks, each verified against 100 robustness properties. Notably, we show that there is no single best algorithm that dominates in performance across all verification problem instances. Instead, our results reveal complementarities in verifier performance and illustrate the potential of leveraging algorithm portfolios for more eficient local robustness verification. Furthermore, we confirm the notion that most algorithms only support ReLU-based networks, while other activation functions remain under-supported.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Benchmark Analysis</kwd>
        <kwd>Neural Network Verification</kwd>
        <kwd>Adversarial Robustness</kwd>
        <kwd>Shapley Value</kwd>
        <kwd>Algorithm Portfolios</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        sessment refers to local robustness verification , a broadly
studied verification task, in which a network is
systematIn recent years, deep learning methods based on neural ically tested against various input perturbations under
networks have been increasingly applied within vari- predefined norms, such as the ∞ norm [15, 16].
ous safety-critical domains and use contexts, ranging Neural network verification is a highly diverse
refrom manoeuvre advisory systems in unmanned aircraft search area, and existing methods rely on a broad range
to face recognition systems in mobile phones (see, e.g., of techniques. At the same time, neural networks difer in
Julian et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). Furthermore, it is now well known terms of their architecture, such as the number of hidden
that neural networks are vulnerable to adversarial exam- layers and nodes, the type of non-linearities, e.g., ReLU,
ples [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], where a given input is manipulated in such a Sigmoid or Tanh, or the type of operations they employ,
way that it is misclassified by the network. In the case e.g., pooling or convolutional layers. This diversity, both
of image recognition tasks, the required perturbation, in terms of verification approaches and neural network
whether it is adversarially crafted or arises accidentally, design, makes it challenging for researchers or
practitioncan be so small that it remains virtually undetectable to ers to assess and decide which method is most suitable for
the human eye. Against this background, much work verifying a given neural network [17]. This challenge is
has focused on developing methods to provide formal amplified by the fact that the neural network verification
guarantees regarding the behaviour of a given neural community does not (yet) use commonly agreed
evalunetwork [
        <xref ref-type="bibr" rid="ref10 ref11 ref3 ref4 ref5 ref6 ref7 ref8 ref9">3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14</xref>
        ]. For instance, ation protocols, which makes it dificult to draw clear
a network employed in autonomous driving for detecting conclusions from the literature regarding the capabilities
trafic signs should always produce accurate predictions, and performance of existing verifiers. More precisely,
exeven when the input is slightly perturbed; failing to do so isting studies use diferent benchmarks and, so far, have
could have fatal consequences. This specific type of as- not provided an in-depth performance comparison of a
broad range of verification algorithms.
      </p>
      <p>SafeAI 2023: Workshop on Artificial Intelligence Safety, Feb 13-14, 2023 Recently, a competition series has been initiated, in
Washington, D.C., US @AAAI-23 which several verifiers were applied to diferent
bench*$Cohr.mre.stp.koonndiign@g laiuacths.oler.idenuniv.nl (M. König); marks (i.e., networks, properties and datasets) and
coma.w.bosman@liacs.leidenuniv.nl (A. W. Bosman); pared in terms of various performance measures,
includhh@aim.rwth-aachen.de (H. H. Hoos); ing the number of verified instances as well as running
j.n.van.rijn@liacs.leidenuniv.nl (J. N. van Rijn) time [18]. While the results from these competitions
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License have provided highly valuable insights into the general
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
2. Background
progress in neural network verification, several questions
are left unanswered. For instance, it remains unclear why
a verification system participated in certain competition Neural network verification methods seek to formally
categories but not in others, although this information assess whether a trained neural network adheres to some
would help in understanding the technical capabilities predefined input-output property. In this study, we focus
and limitations of a given verifier. Furthermore, aggre- on local robustness properties. Given a trained neural
gation of the results makes it dificult to assess in detail network and a set of images as inputs, a robustness
verifithe strengths or weaknesses of verifiers on diferent in- cation algorithm aims to verify whether or not there exist
stances. Instead, looking at aggregated results, one can slight perturbations to an image that lead the network to
easily get the impression that a single approach domi- predict an incorrect label. The maximum perturbation of
nates ‘across the board’ – an assumption that is known each variable in the input, i.e., pixel in a given image, is
to be inaccurate for other problems involving formal ver- predefined.
ification tasks; see, e.g., Xu et al. [19] or Kadioglu et al. Formal verification algorithms can be either complete
[20] for SAT. or incomplete [22]. An algorithm that is incomplete does</p>
      <p>
        In this work, we focus exclusively on local robustness not guarantee to report a solution for every given
exverification in image classification against perturbations ample; however, incomplete verification algorithms are
under the ∞ norm. This scenario represents a widely typically sound, which means they will report that a
propstudied verification task, with a large number of networks erty holds only if the property actually holds. On the
being publicly available and many verifiers providing of- other hand, an algorithm that is sound and complete will
the-shelf support. Notice that most verification tasks can correctly state that a property holds whenever it holds
be translated into local robustness verification queries when given suficient resources to be run to completion.
[21]; we, therefore, believe that our findings are broadly In this work, we focus on complete algorithms, as those
applicable. Moreover, we seek to overcome the limita- arguably represent the most ambitious form of neural
tions of existing benchmarking approaches and shed light network verification, making them preferable over
incomon previously unanswered questions with regard to the plete methods, especially in safety-critical applications.
state of the art in local robustness verification. Early work on complete verification of neural
netOur contributions are as follows: works utilised satisfiability modulo theories (SMT) solvers
(i) We analyse the current state of practice in bench- [
        <xref ref-type="bibr" rid="ref10">10, 23, 24, 25, 26</xref>
        ], which determine whether a set of
marking verification algorithms; logic constraints is satisfiable [ 27]. The resulting
veri(ii) we perform a systematic benchmarking study of fication problems are NP-complete and challenging to
several, carefully chosen verification methods based on solve in practice. Alternatively, it is possible to formulate
a newly assembled, large and diverse set of networks, in- the verification task as a constraint optimisation problem
cluding 38 CIFAR and 41 MNIST networks with diferent using mixed integer programming (MIP) [
        <xref ref-type="bibr" rid="ref4">4, 28, 29, 12</xref>
        ].
activation functions, each verified against 100 robustness MIP solvers essentially optimise an objective function
properties, for which we consumed a total of 1.69 CPU subject to a set of constraints. MIP problems, similar to
years in running time; SMT problems, can be challenging to solve and tend to
(iii) we present a categorisation of verification bench- be computationally expensive (in terms of CPU time and
marks based on verifier compatibilities with diferent memory).
layer types and operations; To overcome the computational complexity of SMT
(iv) we quantify verifier performance in terms of the and MIP, it has been proposed to use the well-known
number of solved instances, running time, as well as branch-and-bound algorithm [30] for solving the
verifimarginal contribution and Shapley value, showing that cation problem, regardless of whether it is modelled as
top-performing verification algorithms strongly comple- MIP or SMT [
        <xref ref-type="bibr" rid="ref5 ref7">31, 5, 32, 7, 13</xref>
        ].
ment rather than consistently dominate each other in Besides these three popular approaches, recent work
terms of performance – e.g., while the verifiers Neurify has studied symbolic interval propagation [
        <xref ref-type="bibr" rid="ref4 ref9">4, 9, 33, 34</xref>
        ]
and Verinet achieved competitive performance on the and polyhedra [35, 36], zonotope [
        <xref ref-type="bibr" rid="ref12 ref8">8, 37</xref>
        ] and star-set
abCIFAR networks we considered, the former solved many straction [
        <xref ref-type="bibr" rid="ref13">38</xref>
        ].
instances unsolved by the latter and vice versa;
      </p>
      <p>(v) lastly, we provide a public repository containing
all experimental data, along with the newly assembled
network collection.1</p>
    </sec>
    <sec id="sec-2">
      <title>3. Common Practices in</title>
    </sec>
    <sec id="sec-3">
      <title>Benchmarking Neural Network</title>
    </sec>
    <sec id="sec-4">
      <title>Verifiers</title>
      <sec id="sec-4-1">
        <title>1https://github.com/ADA-research/nn-verification-assessment</title>
        <p>Considering the diversity in neural network verification
problems, it is quite natural to assume that a single best
algorithm does not exist, i.e., a method that outperforms accelerated approaches employ advanced parallelisation
all others under all circumstances. It is still hard to iden- schemes, giving rise to substantially higher
computatify to what extent a method contributes to the state of tional demands than those required by methods running
the art, mainly because verification methods are typically on CPUs, especially when restricted to a single core. For
evaluated (i) on a small number of benchmarks, which example, an Amazon EC2 GPU-based instance with 32
have often been created for the sole purpose of evaluating CPU cores costs around 1.4 times more than an
equivathe method at hand, and (ii) against baseline methods for lent CPU instance.2
which it is often unclear how they were chosen, leading Lastly, the competition seeks to determine the current
to several methods claiming state-of-the-art performance state of the art; however, the competition ranking and
without having been directly compared. We note that in scores do not suficiently quantify the extent to which an
the context of local robustness verification, a benchmark algorithm actually contributes to the state of the art. In
most often represents a neural network classifier trained other words, it is in the nature of competitions to
deteron the MNIST or CIFAR-10 dataset, respectively. mine a winner, at least implicitly suggesting that a single</p>
        <p>As previously mentioned, a competition series has approach generally outperforms all competitors.
Howbeen established, with the goal of providing an objective ever, some verification algorithms might have limited
and fair comparison of the state-of-the-art methods in but distinct areas of strength, which cannot be identified
neural network verification, in terms of scalability and through aggregated performance measures, such as the
speed [18]. The VNN Competition was held in the years total number of verified instances. Although the
com2020, 2021 and 2022, yet with diferent protocols ( e.g., for petition report [18] shows that individual verifier
perrunning experiments, scoring, etc.), benchmarks and par- formance difers among benchmarks, it remains unclear
ticipants. In this discussion, we focus on the 2021 edition, whether all algorithms solve the same set of instances in
as the report from 2022 has not yet been published at the the given benchmark, or if they complement each other.
time of this writing. Similarly, it does not reveal whether or not methods are</p>
        <p>Within VNN 2021, a total of 9 benchmarks were con- correlated in their performance.
sidered, of which 7 represented test cases for local
robustness verification of image classification networks.</p>
        <p>Benchmarks were proposed by the participants them- 4. Verification Algorithms under
selves and included a total of 11 CIFAR and 7 MNIST Assessment
networks, which difered in terms of architecture
components, such as non-linearities (e.g., ReLU, Tanh, Sigmoid) We consider 5 complete, CPU-based neural network
veriand layer operations (e.g., convolutional or pooling lay- fication algorithms; each of these was chosen because it
ers). Networks were trained on the CIFAR-10 and MNIST fulfilled one of the following conditions: it was ( i) ranked
datasets, respectively. Moreover, each benchmark was among the top five verification methods according to the
composed of random image subsets, excluding images 2021 VNN competition or (ii) supported by the recently
that were misclassified by the given network, along with published DNNV framework [21]. Altogether, we
convarying perturbation radii. sider our set of algorithms to be representative of the</p>
        <p>This competition overcame several of the previously developments in the area of complete neural network
reported limitations with regard to the evaluation of net- and, more specifically, local robustness verification.
work verifiers. Most notably, it covered a relatively large All methods were employed with their default
hyperand diverse set of neural networks. Moreover, thanks to parameter settings, as they would likely be used by
practithe active participation from the community, 12 verifi- tioners. We note that the performance of a veriefir might
cation algorithms were included in the competition. At improve if its hyperparameters were optimised
specifithe same time, we see room for further research into the cally for the given benchmark; however, most verifiers
performance of neural network verifiers. have dozens of hyperparameters (or employ
combinato</p>
        <p>
          Firstly, it is not clearly stated in the VNN competition rial solvers that come with their own, extensive set of
report [18] why verifiers participated in certain competi- hyperparameters), which makes this a non-trivial task,
tion categories, but not in others. While we assume this requiring additional expertise and resources (see, e.g.,
to be due to compatibility reasons, we could not find any König et al. [
          <xref ref-type="bibr" rid="ref14">39</xref>
          ]). The verification algorithms we
considformal explanation in the report, although this informa- ered are the following:
tion could provide insight into relevant problem classes BaB. The algorithm proposed by Bunel et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]
reand suitable algorithms for solving these. states the verification problem as a global optimisation
        </p>
        <p>Secondly, GPU-accelerated tools were directly com- problem, which is then solved using branch-and-bound
pared to those that rely solely on CPU resources, which search. It further incorporates algorithmic improvements
we argue does not give due credit to the structural
differences between these classes of algorithms. GPU- 2See https://aws.amazon.com/ec2/pricing/on-demand/
with regard to branching and bounding procedures such
as smart branching; i.e., before splitting, it computes fast
bounds on each of the possible subdomains and chooses
the one with the tightest bounds. We refer to this method
as BaBSB for the remainder of this paper. BaBSB
supports ReLU-based networks.</p>
        <p>Marabou. The Marabou framework [23] employs
SMT solving techniques, specifically the lazy search
technique for handling non-linear constraints. Furthermore,
Marabou employs deduction techniques to obtain
information on the activation functions that can be used to
simplify them. The core of the SMT solver is
simplexbased, which means that the variable assignments are
made using the simplex algorithm. Marabou supports
ReLU and Sigmoid activations as well as MaxPooling
operations.</p>
        <p>Neurify. The verification algorithm proposed by
Wang et al. [33] relies on symbolic interval propagation
to create over-approximations, followed by a refinement
strategy based on symbolic gradient information. The
constraint refinement aims to tighten the bounds of the
approximation of activation functions. Neurify can
process networks containing ReLU activations.</p>
        <p>
          nnenum. The verifier proposed by Bak et al. [
          <xref ref-type="bibr" rid="ref13">38</xref>
          ]
utilises star sets to represent the values each layer of a
neural network can attain. By propagating these through
the network, it checks whether one or more of the star
sets results in an adversarial example. This verifier can
handle networks with ReLU activations.
        </p>
        <p>
          Verinet. The verifier developed by Henriksen and
Lomuscio [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] combines symbolic intervals with
gradientbased adversarial local search for finding
counterexamples. The authors further propose a splitting
heuristic for interval propagation based on the influence of a
given node on the bounds of the network output. Verinet
supports networks containing ReLU, Sigmoid and Tanh
activations.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Setup for Empirical Evaluation</title>
      <p>In this work, we seek to provide a clearer picture of the
state of the art in neural network verification. More
specifically, we argue that the state of the art is not just
defined by a single verification algorithm, as there might
be verifiers which on their own perform poorly but still
make meaningful contributions by excelling on limited
instance subsets that are challenging for other
verification methods.</p>
      <p>
        In the following, we will present an overview of how
we set up our benchmark study, i.e., how we selected
problem instances and verification algorithms.
Furthermore, we will provide details on the software we used
and the execution environment in which our experiments
were carried out.
For our assessment, we compiled a high-quality set of
problem instances for local robustness verification.
Following best practices in other research areas, such as
optimisation [
        <xref ref-type="bibr" rid="ref15 ref16">40, 41</xref>
        ], the benchmark should be
representative and diverse, where the former refers to how well
the dificulty of the benchmark is aligned with that of
real-world instances from the same problem class, and
the latter means that the problem set should generally
contain problems with a wide range of dificulties.
      </p>
      <p>Overall, our benchmark is comprised of 79 image
classification networks, of which 38 are trained on the
CIFAR10 dataset and 41 are trained on the MNIST dataset. To
ensure the representativeness of the problem set, all
networks were sampled from the neural network verification
literature, i.e., networks used in existing work on local
robustness verification and provided in public repositories;
in other words, the characteristics of the networks in
our benchmark are assumed to match those of networks
generally used for evaluating verification algorithms.</p>
      <p>We further want our problem set to be diverse.
Therefore, we ensured that the considered networks difer in
size, i.e., the number of hidden layers and nodes, as well
as the type of non-linearities (e.g., ReLU or Tanh) or layer
operations (e.g., pooling layers) they employ.</p>
      <p>
        For each network, we verified 100 local robustness
properties; more precisely, we sampled 100 images from
the dataset on which the network has been trained and
verified for local robustness with perturbation radius 
set at 0.012. This perturbation radius was chosen to be
a median of values we found in literature: the smallest
value we found was 1/255 in the work by Li et al. [22],
whereas Botoeva et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or Wang et al. [13] utilised
larger values, such as 0.05.
      </p>
      <p>Lastly, we split our benchmark set into diferent
categories based on verifier compatibilities. This means a
verifier is only employed to categories it is able to process.
The categories as well as the instance set size for each
category are shown in Table 1.</p>
      <sec id="sec-5-1">
        <title>5.2. Evaluation Metrics</title>
        <p>
          In order to assess the performance of the various
methods, we compute four performance metrics: the average
running time, the number of solved instances, the
relative marginal contribution and the Shapley value [
          <xref ref-type="bibr" rid="ref17">42</xref>
          ] of
each verifier. Although these metrics present aggregated
measures, they reflect algorithm performance on an
instance level and in relation to other methods included
in the comparison; a more detailed explanation will be
provided in the following paragraphs. Notice that we do
not penalise timeouts when computing average running
time; i.e., the maximum running time equals the given
time limit.
        </p>
        <p>Category
ReLU
ReLU + MaxPool
Tanh
Sigmoid</p>
        <p>Total</p>
        <p>
          The marginal contribution is computed as follows. De- e.g., it might understate the importance of a single
algoifne  as a set of given verifiers and let () be the total rithm given the presence of another, highly correlated
score of set . Here, the total score () consists of the algorithm. In such a case, both algorithms would not be
number of instances verified by at least one verifier in assigned any marginal contribution, even though one of
set . We compute the marginal contribution per algo- them should be included in a potential portfolio. This is
rithm to determine how much the total performance of captured by the Shapley value: define  as the number
all algorithms (in terms of solved instances) decreases of verifiers under consideration and  as the set of all
when the given algorithm is removed from the set of all combinations of all subsets of verifiers under
consideralgorithms if they were employed in a parallel algorithm ation including the empty set, where set  will be of
portfolio. Such portfolios are sets of algorithms that are size ∑︀ ! . Finally, define  as all sets in
run in parallel on each given problem instance and com- =0 ! · ( − )!
plement each other in terms of performance across an which verifier  is present. The Shapley value of verifier
instance set [
          <xref ref-type="bibr" rid="ref18">43</xref>
          ]. Formally, to determine the marginal  is then calculated as
contribution of any of the verifiers  to portfolio , one 1
needs to know the value of () and ( ∖ {}), where  = ! · ∑︁ (() − ( ∖ {})) (2)
∖{} is the portfolio minus verifier . Thus, the marginal ∈
contribution of verifier  is expressed as
        </p>
        <p>MCi = () − ( ∖ {})
(1)</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.3. Execution Environment and Software</title>
      </sec>
      <sec id="sec-5-3">
        <title>Used</title>
        <p>Following this terminology, we can define the number
of solved instances by verifier  as a set consisting only Our experiments were carried out on a cluster of
maof verifier , Solvedi = () − (∅). In other words, the chines equipped with Intel Xeon E5-2683 CPUs with 32
number of solved instances employs a set of size one and cores, 40 MB cache size and 94 GB RAM, running CentOS
the marginal contribution employs a set of all verifiers Linux 7. For each verification method, we limited the
under consideration. The relative marginal contribution number of available CPU cores to a single core. Each
then represents the marginal contribution of a given veri- query was given a time budget of 3 600 seconds and a
ifer as a proportion of the sum of every method’s absolute memory budget of 3 GB. Generally, we executed the
verimarginal contribution. ifcation algorithms through the DNNV interface, version</p>
        <p>Lastly, the Shapley value is the average marginal contri- 0.4.8. DNNV is a framework that transforms a network
bution of a verifier over all possible joining orders, where and property specification into a unified format, which
joining order refers to the order in which the verifiers can then be solved by a given method [21]. More
specifare added to a parallel portfolio. This value is comple- ically, DNNV takes as input a network in the ONNX
menting the previous two metrics, as it does not assume format, along with a property specification, and then
a particular order in which algorithms are added to the translates the network and property to the input format
portfolio. To be precise, the number of solved instances required by the verifier. After running the verifier on the
simply represents a joining order in which the considered transformed problem, it returns either  if the property
algorithm comes first, whereas the marginal contribu- was falsified or  if the property was proven to hold.
tion metric assumes a joining order in which it comes
last. However, using fixed orders, as it is the case for the
marginal contribution, might not reveal possible
interactions between the given method and other algorithms,</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results and Discussion</title>
      <p>by all verifiers, highlighting the structural diferences
between instances and the sensitivity of the verification
apIn the following, we provide an in-depth discussion of the proaches to those instances. That said, Neurify slightly
results obtained from our experiments. Table 1 shows the outperformed Verinet in terms of the number of solved
categories we devised, along with the resulting instance instances (915 vs 841 out of 2 500). Furthermore, Neurify
set sizes as well as information on which verifier has achieved a much larger relative marginal contribution
been employed for each category. than Verinet (0.75 vs 0.20), which means that the
for</p>
      <p>Table 2 reports the number of problem instances mer could solve a relatively large number of instances
solved by each verifier per network category, the rel- which could not be solved by the other methods.
Generative marginal contribution, the Shapley value and the ally, relative marginal contribution scores are much less
average running time. On ReLU-based MNIST networks, evenly distributed among verifiers when compared to
we found Verinet to be the best-performing verifier, solv- the MNIST dataset.
ing 1 799 out of 2 500 instances, while achieving a Shapley Figures 1a and 1b show an instance-level
comparivalue of 618. However, taking relative marginal contribu- son of the two best-performing algorithms (in terms of
tion into account, we found that Neurify achieved the Shapley value) in the ReLU category for each dataset. In
highest relative marginal contribution of 0.25 (compared Figure 1a, we see that on MNIST networks, both Verinet
to 0.16 for Verinet), indicating that it could verify a siz- and nnenum solved instances that the other one could
able fraction of instances on which other methods failed not solve within the given time budget. Concretely, when
to return a solution. Moreover, the relative marginal considering a parallel portfolio containing both
algocontribution scores show that each method could solve rithms, the number of solved instances slightly increases
a sizeable fraction of instances unsolved by any of the to 1 817 out of 2 500 (vs 1 799 solved by Verinet and 1 754
other methods. solved by nnenum alone), while supplied with similar</p>
      <p>On ReLU-based CIFAR networks, it should first be CPU resources (i.e., 1 800 CPU seconds per verifier, which
noted that no verification problem instance can be solved represents half of the total given time budget of 3 600
1.0
0.8
d
e
v
l
o
s
se0.6
c
n
a
ts
n
if
o0.4
n
o
it
c
ra
F
0.2
10 2
1.0
0.8
d
e
v
l
o
s
se0.6
c
n
a
ts
n
if
o0.4
n
o
it
c
ra
F
0.2</p>
      <p>
        CPU seconds per verification query as used in our ex- algorithms considered in our study. All remaining
verperiments). Notice that leveraging parallel portfolios has ifiers achieved substantially lower Shapley values and
already been shown to significantly improve the perfor- relative marginal contribution scores, indicating that they
mance of MIP-based verification methods [
        <xref ref-type="bibr" rid="ref14">39</xref>
        ]. would not complement Neurify and Verinet well in a
      </p>
      <p>On CIFAR instances, we found Neurify and Verinet portfolio.
to also have distinctive strengths over each other. This is Figure 2a shows the cumulative distribution function
shown in Figure 1b, where both algorithms could solve a of running times over the MNIST problem instances. As
substantial amount of instances that the other could not seen in the figure, Verinet tends to solve these problem
return a solution for. Thus, when combined in a parallel instances fastest; however, Neurify tended to show even
portfolio, 963 instances can be solved (vs 915 solved by better performances on those instances it was able to
Neurify and 841 solved by Verinet alone, out of 2 500 in- solve. We note that most of the instances unsolved by
stances), while using the same amount of CPU resources, Neurify represent networks that were trained on images
i.e., 1 800 CPU seconds per verifier. These findings further with 3 dimensions, whereas Neurify requires images
emphasise the complementarity between the verification used as network inputs to have 2 or 4 dimensions.</p>
      <p>Figure 2b shows a similar plot for the CIFAR problem
instances. Here, Neurify solved the largest fraction in
less time than other methods. This suggests that Neurify
is a very competitive verifier when applicable to the
specific network or input format.</p>
      <p>For each of the remaining categories, we found that
there is only one verifier that could efectively handle
the respective problem instances. Specifically, instances
from the ReLU+MaxPooling category can be processed
by Marabou, although, only a modest number of MNIST
instances could be solved in this way. Networks
containing Tanh activations can, in principle, be verified by
Verinet, but the algorithm did also not solve any CIFAR
instances. Lastly, Sigmoid-based networks can be
handled by both Verinet and Marabou, however, only the
former could solve MNIST instances within the given
compute budget.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions and Future Work</title>
      <p>In this work, we assessed the performance of several local
robustness verification algorithms, i.e., algorithms used
to verify the robustness of an image classification
network against small input perturbations. To conclude, we
found that all considered methods support ReLU-based
networks, while other network types are strongly
undersupported. While this has been suspected in the
community, it has, to our knowledge, not yet been subject
to formal study. Furthermore, we presented evidence
for strong performance complementarity: even within
the same benchmark category, two verification systems
outperform each other on distinct sets of instances. As
we have demonstrated, this complementarity can be
exploited by combining individual verifiers into parallel
portfolios. Lastly, we showed that, in general, the
performance of verifiers strongly difers between image
datasets, with some methods achieving the best
performance on MNIST (in terms of the number of solved
instances and average running time) while falling behind
on CIFAR and vice versa. In future work, it would be
interesting to include a broader set of perturbation radii
and analyse in more detail how the relative performance
of verifiers depends on the given radius. Furthermore, we
are interested in expanding our analysis to GPU-based
verification algorithms.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <sec id="sec-8-1">
        <title>This research was partially supported by TAILOR, a project funded by EU Horizon 2020 research and innovation program under GA No. 952215.</title>
        <p>und Verifikation von Schaltungen und Systemen ysis of Deep Neural Networks, in: Proceedings
(MBMV 2015), 2015, pp. 30–40. of the 31st International Conference on Computer
[12] V. Tjeng, K. Xiao, R. Tedrake, Evaluating Ro- Aided Verification (CAV 2019), 2019, pp. 443–452.
bustness of Neural Networks with Mixed Integer [24] L. Pulina, A. Tacchella, Checking Safety of
NeuProgramming, in: Proceedings of the 7th Inter- ral Networks with SMT Solvers: A Comparative
national Conference on Learning Representations Evaluation, in: AI*IA, 2011, pp. 127–138.
(ICLR 2019), 2019, pp. 1–21. [25] L. Pulina, A. Tacchella, NeVer: A Tool for Artificial
[13] S. Wang, K. Pei, J. Whitehouse, J. Yang, S. Jana, Ef- Neural Networks Verification, Annals of
Matheifcient Formal Safety Analysis of Neural Networks, matics and Artificial Intelligence (2011) 403–425.
in: Advances in Neural Information Processing Sys- [26] L. Pulina, A. Tacchella, Challenging SMT Solvers
tems (NeurIPS 2018), 2018, pp. 6369–6379. to Verify Neural Networks, AI Communications
[14] W. Xiang, H.-D. Tran, T. T. Johnson, Output Reach- (2012) 117–135.</p>
        <p>able Set Estimation and Verification for Multilayer [27] L. d. Moura, N. Bjørner, Satisfiability Modulo
theoNeural Networks, IEEE Transactions on Neural Net- ries: An Appetizer, in: Proceedings of the Brazilian
works and Learning Systems 29 (2018) 5777–5783. Symposium on Formal Methods (SBMF 2009), 2009,
[15] I. J. Goodfellow, J. Shlens, C. Szegedy, Explain- pp. 23–36.</p>
        <p>ing and Harnessing Adversarial Examples, arXiv [28] S. Dutta, S. Jha, S. Sankaranarayanan, A. Tiwari,
preprint arXiv:1412.6572 (2014). Output Range Analysis for Deep Neural Networks,
[16] N. Papernot, P. McDaniel, X. Wu, S. Jha, A. Swami, in: Proceedings of the Tenth NASA Formal Methods
Distillation as a Defense to Adversarial Perturba- Symposium (NFM 2018), 2018, pp. 121–138.
tions Against Deep Neural Networks, in: Proceed- [29] A. Lomuscio, L. Maganti, An approach to
reachabilings of the 37th IEEE Symposium on Security and ity analysis for feed-forward ReLU neural networks,
Privacy (IEEE S&amp;P 2016), 2016, pp. 582–597. arXiv preprint arXiv:1706.07351 (2017).
[17] M. Casadio, E. Komendantskaya, M. L. Daggitt, [30] A. H. Land, A. G. Doig, An Automatic Method of
W. Kokke, G. Katz, G. Amir, I. Refaeli, Neural Net- Solving Discrete Programming Problems,
Econowork Robustness as a Verification Property: A Prin- metrica (1960) 497–520.
cipled Case Study, in: Proceedings of the 34rd [31] R. Bunel, I. Turkaslan, P. H. S. Torr, M. P. Kumar,
International Conference on Computer Aided Veri- J. Lu, P. Kohli, Branch and Bound for Piecewise
ifcation (CAV 2022), 2022, pp. 219–231. Linear Neural Network Verification, Journal of
[18] S. Bak, C. Liu, T. Johnson, The Second International Machine Learning Research (2020) 1574–1612.</p>
        <p>Verification of Neural Networks Competition (VNN- [32] A. De Palma, R. Bunel, A. Desmaison, K. Dvijotham,
COMP 2021): Summary and Results, arXiv preprint P. Kohli, P. H. S. Torr, M. P. Kumar, Improved
arXiv:2109.00498 (2021). Branch and Bound for Neural Network
Verifica[19] L. Xu, F. Hutter, H. H. Hoos, K. Leyton-Brown, tion via Lagrangian Decomposition, arXiv preprint
Satzilla: Portfolio-based algorithm selection for sat, arXiv:2104.06718 (2021).</p>
        <p>Journal of Artificial Intelligence Research 32 (2008) [33] S. Wang, K. Pei, J. Whitehouse, J. Yang, S. Jana,
565–606. Formal Security Analysis of Neural Networks using
[20] S. Kadioglu, Y. Malitsky, A. Sabharwal, H. Samu- Symbolic Intervals, in: Proceedings of the 27th
lowitz, M. Sellmann, Algorithm Selection and USENIX Security Symposium (USENIX Security 18),
Scheduling, in: Proceedings of the Seventeenth 2018, pp. 1599–1614.</p>
        <p>International Conference on Principles and Prac- [34] S. Wang, H. Zhang, K. Xu, X. Lin, S. Jana, C.-J. Hsieh,
tice of Constraint Programming (CP2011), 2011, pp. Z. Kolter, Beta-CROWN: Eficient Bound
Propaga454–469. tion with Per-neuron Split Constraints for Neural
[21] D. Shriver, S. Elbaum, M. B. Dwyer, DNNV: A Network Robustness Verification, in: Advances in
Framework for Deep Neural Network Verification, Neural Information Processing Systems (NeurIPS
in: Proceedings of the 33rd International Confer- 2021), 2021, pp. 29909–29921.
ence on Computer Aided Verification (CAV 2021), [35] G. Singh, T. Gehr, M. Püschel, M. Vechev, An
Ab2021, pp. 137–150. stract Domain for Certifying Neural Networks, in:
[22] L. Li, X. Qi, T. Xie, B. Li, Sok: Certified robust- Proceedings of the 46th ACM SIGPLAN Symposium
ness for deep neural networks, arXiv preprint on Principles of Programming Languages
(ACMarXiv:2009.04131 (2020). POPL 2019), 2019, pp. 1–30.
[23] G. Katz, D. A. Huang, D. Ibeling, K. Julian, [36] H. Zhang, T.-W. Weng, P.-Y. Chen, C.-J. Hsieh,
C. Lazarus, R. Lim, P. Shah, S. Thakoor, H. Wu, L. Daniel, Eficient Neural Network Robustness
A. Zeljić, D. L. Dill, M. J. Kochenderfer, C. Barrett, Certification with General Activation Functions,
the Marabou Framework for Verification and Anal- Advances in Neural Information Processing
Sys</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K. D.</given-names>
            <surname>Julian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Kochenderfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Owen</surname>
          </string-name>
          ,
          <article-title>Deep neural network compression for aircraft collision avoidance systems</article-title>
          ,
          <source>Journal of Guidance, Control, and Dynamics</source>
          <volume>42</volume>
          (
          <year>2019</year>
          )
          <fpage>598</fpage>
          -
          <lpage>608</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zaremba</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bruna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <article-title>Intriguing properties of neural networks</article-title>
          ,
          <source>arXiv preprint arXiv:1312.6199</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>O.</given-names>
            <surname>Bastani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ioannou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lampropoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vytiniotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Criminisi</surname>
          </string-name>
          ,
          <article-title>Measuring Neural Net Robustness with Constraints</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems (NeurIPS</source>
          <year>2016</year>
          ),
          <year>2016</year>
          , pp.
          <fpage>2613</fpage>
          -
          <lpage>2621</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Botoeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kouvaros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kronqvist</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lomuscio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Misener</surname>
          </string-name>
          ,
          <article-title>Eficient Verification of ReLU-based Neural Networks via Dependency Analysis</article-title>
          ,
          <source>in: Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI-20)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>3291</fpage>
          -
          <lpage>3299</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Bunel</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Turkaslan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Torr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kohli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Mudigonda</surname>
          </string-name>
          ,
          <article-title>A Unified View of Piecewise Linear Neural Network Verification</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems (NeurIPS</source>
          <year>2018</year>
          ),
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Dvijotham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Stanforth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gowal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kohli</surname>
          </string-name>
          ,
          <article-title>A Dual Approach to Scalable Verification of Deep Networks</article-title>
          ,
          <source>in: Proceedings of the 38th Conference on Uncertainty in Artificial Intelligence (UAI</source>
          <year>2018</year>
          ),
          <year>2018</year>
          , pp.
          <fpage>550</fpage>
          -
          <lpage>559</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ehlers</surname>
          </string-name>
          ,
          <article-title>Formal Verification of Piece-Wise Linear Feed-Forward Neural Networks</article-title>
          ,
          <source>in: Proceedings of the 15th International Symposium on Automated Technology for Verification and Analysis (ATVA</source>
          <year>2017</year>
          ),
          <year>2017</year>
          , pp.
          <fpage>269</fpage>
          -
          <lpage>286</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gehr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mirman</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
            Drachsler-Cohen,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Tsankov</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Chaudhuri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Vechev, AI2: Safety and Robustness Certification of Neural Networks with Abstract Interpretation</article-title>
          ,
          <source>in: Proceedings of the 39th IEEE Symposium on Security and Privacy (IEEE S&amp;P</source>
          <year>2018</year>
          ),
          <year>2018</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Henriksen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lomuscio</surname>
          </string-name>
          ,
          <article-title>Eficient Neural Network Verification via Adaptive Refinement and Adversarial Search</article-title>
          ,
          <source>in: Proceedings of the 24th European Conference on Artificial Intelligence (ECAI</source>
          <year>2020</year>
          ),
          <year>2020</year>
          , pp.
          <fpage>2513</fpage>
          -
          <lpage>2520</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Katz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Barrett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Dill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Julian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Kochenderfer</surname>
          </string-name>
          ,
          <article-title>Reluplex: An Eficient SMT Solver for Verifying Deep Neural Networks</article-title>
          ,
          <source>in: Proceedings of the 29th International Conference on Computer Aided Verification (CAV</source>
          <year>2017</year>
          ),
          <year>2017</year>
          , pp.
          <fpage>97</fpage>
          -
          <lpage>117</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Scheibler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Winterer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wimmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <source>Towards Verification of Artificial Neural Networks, in: Proceedings of the 18th Workshop on Methoden und Beschreibungssprachen zur Modellierung tems (NeurIPS</source>
          <year>2018</year>
          )
          <volume>31</volume>
          (
          <year>2018</year>
          )
          <fpage>4944</fpage>
          -
          <lpage>4953</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>G.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gehr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mirman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Püschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vechev</surname>
          </string-name>
          , Fast and Efective Robustness Certiifcation,
          <source>in: Advances in Neural Information Processing Systems (NeurIPS</source>
          <year>2018</year>
          ),
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bak</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-D. Tran</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Hobbs</surname>
          </string-name>
          , T. T. Johnson,
          <article-title>Improved Geometric Path Enumeration for Verifying ReLU Neural Networks</article-title>
          ,
          <source>in: Proceedings of the 32nd International Conference on Computer Aided Verification (CAV</source>
          <year>2020</year>
          ),
          <year>2020</year>
          , pp.
          <fpage>66</fpage>
          -
          <lpage>96</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>M.</given-names>
            <surname>König</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Hoos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. N. v.</given-names>
            <surname>Rijn</surname>
          </string-name>
          ,
          <article-title>Speeding up neural network robustness verification via algorithm configuration and an optimised mixed integer linear programming solver portfolio</article-title>
          ,
          <source>Machine Learning</source>
          <volume>111</volume>
          (
          <year>2022</year>
          )
          <fpage>4565</fpage>
          -
          <lpage>4584</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Hoos</surname>
          </string-name>
          , T. Stützle, Stochastic Local Search: Foundations &amp; Applications, Elsevier / Morgan Kaufmann,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bartz-Beielstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Doerr</surname>
          </string-name>
          , D. v. d. Berg,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bossek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chandrasekaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Eftimov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fischbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kerschke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. La</given-names>
            <surname>Cava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lopez-Ibanez</surname>
          </string-name>
          , et al.,
          <source>Benchmarking in Optimization: Best Practice and Open Issues</source>
          , arXiv preprint arXiv:
          <year>2007</year>
          .
          <volume>03488</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fréchette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kotthof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Michalak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rahwan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hoos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Leyton-Brown</surname>
          </string-name>
          ,
          <article-title>Using the shapley value to analyze algorithm portfolios</article-title>
          ,
          <source>in: Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI-16)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>3397</fpage>
          -
          <lpage>3403</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>C. P.</given-names>
            <surname>Gomes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Selman</surname>
          </string-name>
          , Algorithm portfolios,
          <source>Artiifcial Intelligence</source>
          <volume>126</volume>
          (
          <year>2001</year>
          )
          <fpage>43</fpage>
          -
          <lpage>62</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>