Formal Verification of Tree Ensembles against Real-World
Composite Geometric Perturbations
Valency Oscar Colaco1,∗ , Simin Nadjm-Tehrani1
1
    Department of Computer and Information Science, Linköping University, Sweden


                                       Abstract
                                       Since machine learning components are now being considered for integration in safety-critical systems, safety stakeholders
                                       should be able to provide convincing arguments that the systems are safe for use in realistic deployment settings. In the case
                                       of vision-based systems, the use of tree ensembles calls for formal stability verification against a host of composite geometric
                                       perturbations that the system may encounter. Such perturbations are a combination of an affine transformation like rotation,
                                       scaling, or translation and a pixel-wise transformation like changes in lighting. However, existing verification approaches
                                       mostly target small norm-based perturbations, and do not account for composite geometric perturbations. In this work,
                                       we present a novel method to precisely define the desired stability regions for these types of perturbations. We propose a
                                       feature space modelling process that generates abstract intervals which can be passed to VoTE, an efficient formal verification
                                       engine that is specialised for tree ensembles. Our method is implemented as an extension to VoTE by defining a new property
                                       checker. The applicability of the method is demonstrated by verifying classifier stability and computing metrics associated
                                       with stability and correctness, i.e., robustness, fragility, vulnerability, and breakage, in two case studies. In both case studies,
                                       targeted data augmentation pre-processing steps were applied for robust model training. Our results show that even models
                                       trained with augmented data are unable to handle these types of perturbations, thereby emphasising the need for certified
                                       robust training for tree ensembles.

                                       Keywords
                                       Machine Learning, Formal Verification, Tree Ensembles, Composite Perturbations, Geometric Perturbations, Random Forests,
                                       Gradient Boosting Machines, Semantic Perturbations, Stability, Robustness, Trustworthy AI, Trustworthy Computing


1. Introduction                                                                                     soning with compelling and comprehensible arguments
                                                                                                    for a system being acceptably safe in a given context [25].
Recent advancements in machine learning are now being
considered for integration in safety-critical systems like                                                       The formal methods community has recently been
medical equipment, critical infrastructure, and transport                                                        exploring the robustness properties of vision-based AI
networks (autonomous vehicles, avionics, spaceflights)                                                           systems used in autonomous vehicles [5]. An important
where software flaws could be detrimental to humans                                                              role of these vision-based systems is sending correct
and the environment. While these systems are subject to                                                          information about road signs to the vehicle’s control
strict regulations, a lack of evidence that the machine                                                          system. However, the road signs encountered in the
learning (ML) models deployed in these systems are safe                                                          real world may appear different to that in the test set.
for operation, can directly affect their trustworthiness.                                                        Real-world phenomena like the banking of roads cause
Considering the inherently probabilistic nature of                                                               changes to the road surface topography in which the
these models, standard industry practices like software                                                          outer edges are raised at an angle above the inner edge
testing are often unsuitable to provide any guarantees                                                           to provide the necessary centripetal force to the vehicles
about operational safety. In this context, to help ensure                                                        for safe turns. This phenomenon, combined with rapid
trustworthiness, Artificial Intelligence (AI) systems can                                                        changes in lighting (time of day, light reflections from
benefit from scrutiny through formal methods. Formal                                                             vehicles), and the fact that road signs have reflective
verification can be a tool to analyse the safety properties                                                      elements, can produce different versions of the road
of software operating over a multi-dimensional space.                                                            signs. Along with being rotated, parts of these signs
Evidence based on formal verification can be used in a                                                           may appear lighter and darker depending on the light
safety assurance case, which is a structured record of rea-                                                      interactions (absorption, reflection, and so on). The
                                                                                                                 same principle can be applied to other geometric
The AAAI-23 Workshop on Artificial Intelligence Safety (SafeAI 2023),
                                                                                                                 transformations like scaling, in which the road sign
Feb 13-14, 2023, Washington, D.C., US
∗
     Corresponding author.                                                                                       appears zoomed out when the vehicle is at a distance,
Envelope-Open valency.colaco@liu.se (V. O. Colaco);                                                              and translation, in which the road sign gradually
simin.nadjm-tehrani@liu.se (S. Nadjm-Tehrani)                                                                    moves out of the camera frame as it is approached
Orcid 0000-0001-6405-4794 (V. O. Colaco); 0000-0002-1485-0802                                                    by the vehicle. While the different versions of these
(S. Nadjm-Tehrani)
                    © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License road signs are correctly interpretable by humans, an
    CEUR
    Workshop
    Proceedings
                    Attribution 4.0 International (CC BY 4.0)
                    CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                                 ML model could misread these signs and the effects


                                                                                                1
could be disastrous. This calls for formal verification       recent survey [11] gives a broad overview of the safety
of these models to ensure classifier stability in the         and trustworthiness of deep neural networks with
presence of composite geometric perturbations that the        a specific focus on topics like verification, testing,
system may encounter in real-world deployment settings.       adversarial defence, and interpretability.

In this paper, we present a method to precisely Engstrom et al. [12], show that state of the art
capture these perturbations as abstract intervals to verify neural network classifiers are vulnerable to natural
tree-ensemble based classifiers against specified safety classes of perturbations, especially simple translations
properties. Since our goal is to support safety arguments and rotations for a significant fraction of their inputs.
in assurance cases, the verifiability of the models is of Another interesting observation they made is that data
paramount importance. The simple structure of the augmentation does not necessarily provide a significant
tree-based models makes them easier to systematically boost to robustness, which we also confirm for tree
analyze in the formal verification process. However, ensembles in this paper.
larger models may still prove hard to verify due to the
combinatorial explosion. We present a method where Pei et al. [13] and Mohapatra et al. [14] propose
composite geometric perturbations are applied to input verification frameworks called VERIVIS and Semantify-
data and shown to prove a lack of stability even in the NN for evaluating the robustness of machine learning
presence of data augmentation during training. This is systems. The VERIVIS framework checks for violations
demonstrated by applying our method to digit/character of the parameter space (like the range of angles) when
recognition problems. The contributions of this paper the inputs are subjected to semantic perturbations
are as follows,                                             while the Semantify-NN framework converts semantic
                                                            perturbations to 𝑙𝑝 -norm based threat models thereby
 - A method to capture composite geometric perturba-
                                                            enabling the use of any 𝑙𝑝 -norm based verification
    tions based on real-world phenomena (like the com-
                                                            method for robustness verification of deep neural
    bined effects of axial rotations, scaling, or transla-
                                                            networks. Balunović et al. [15] study the certification
    tion together with changes in lighting) in the interval
                                                            of neural networks against natural geometric transfor-
    domain through feature space modelling for vision-
                                                            mations like rotation and scaling. They use sampling,
    based systems.
                                                            interpolation and Lipschitz optimisation to find the
 - Realisation of the method, implemented as an exten- bounds for their certification process. The authors
    sion to the Verifier of Tree Ensembles [3] (VoTE), a implement their certification framework in a system
    formal verification tool that uses abstract interpreta- called DEEPG that uses a restricted Polyhedra with
    tion, along with the application of the method to two custom approximations for the operations used in several
    case studies commonly found in the literature.          geometric transformations. Our approach, however,
                                                            uses a feature space modelling process in conjunction
The rest of the paper is structured as follows. Section with an abstraction function to get precise upper and
2 discusses related works on the verification of tree en- lower bounds for the inputs as they are perturbed by
sembles against different classes of perturbations. Sec- real-world elements.
tion 3 presents the preliminaries on tree-ensembles, the
VoTE toolsuite, and composite geometric perturbations. Wu et al.           [16], study the certification of deep
Section 4 presents our abstraction function and feature neural networks against real-world distribution shifts.
space modelling process along with the implementation They train a generative model to learn perturbations
in VoTE. Section 5 presents the application of the method from the data to improve robustness precision. Yang
to two case studies commonly found in the literature. et al. [17] mention that geometric image transforma-
Section 6 concludes the paper and summarizes the learn- tions that arise in the real world, such as scaling and
ings.                                                       rotation, have been shown to easily deceive deep neural
                                                              networks. They propose a training formulation known
2. Related Works                                              as Certified Geometric Training (CGT) that improves the
                                                              deterministic certified robustness of neural networks
Recent machine learning advances are gradually finding        with respect to geometric transformations. Gowal et al.
their way into safety-critical systems. However, there        [19], use generative models to systematically transfer
is a lack of verification techniques to build formal          image attributes that are likely to be misclassified across
arguments for operational safety in such systems. There       image instances to achieve better robustness. They
has been extensive research on formal verification of         also mention that 𝑙𝑝 -norm bounded perturbations do
𝑙𝑝 -norm bounded perturbations (mathematical distances        not necessarily cover plausible real-world variations
that define the perturbations) in neural networks. A          that preserve the semantics of the input. We agree


                                                          2
that composite geometric perturbations based on tree structure is evaluated in a top-down manner, where
real-world phenomena (that preserve the semantics of decision functions determine which path to take towards
the input) are outside the 𝑙𝑝 -norm reach. Hence, we the leaves. Each internal node in the decision tree is
deploy a feature space modelling process for defining associated with a decision function that recursively splits
the multidimensional perturbation regions. Also, while the input space, thereby separating regions from each
using generative models can help with robust training, other. When a leaf is hit, the output 𝑦 ∈ ℝ𝑚 associated
this in itself cannot be used for building up a safety case with the leaf becomes the output associated with the
in systems that use ML components. Our approach aims input from which the top-down evaluation started.
to provide formal proof-based evidence to support safety
assurances in these systems.                                In this paper, we only consider binary trees with
                                                            linear decision functions with one input variable,
Outside formal verification and robust training, which Irsoy et al. [4] call univariate hard decision
significant research has been done to tackle the issue of trees. State-of-the-art implementations of tree-based
real-world perturbations like axial rotations in images ensembles typically use univariate hard decision trees,
in the form of rotation-invariant networks [20] and for example, scikit-learn [6] and CatBoost [7].
rotation-equivariant networks [21]. In Scher et al. [10],
the authors formally define real-world perturbations and 3.2. Random Forests
differentiate them between adversarial perturbations and
distribution shifts. They model the perturbations around Decision trees are known to suffer from a phenomenon
a point as a probability distribution and succeed in the called overfitting in which the models are fitted so
estimation of robustness using Monte Carlo Sampling for tightly to their training data that they memorize the data
low to medium dimensional data. While their method itself instead of learning the patterns associated with
provides probabilistic guarantees, our approach aims to the data in order to make generalized predictions. To
provide formal deterministic guarantees which can form counteract these issues with decision trees concerning
the basis for reasoning about safety in safety-critical bias and variance, Breiman [1] proposes Random Forests.
systems.
                                                            A random forest 𝑓 ∶ 𝑋 𝑛 → ℝ𝑚 is an ensemble of
                                                            𝐵 decision trees that produces outputs by averaging the
3. Preliminaries                                            values emitted by each individual tree. i.e.,
In this section, we present the background knowledge on                                    𝐵
                                                                                        1
tree-based ensembles, semantic perturbations, interpolat-                      𝑓 (𝑥) = ∑ 𝑡𝑏 (𝑥)                 (2)
                                                                                        𝐵 𝑏=1
ed/geometric and pixel-wise transformations, composite
geometric perturbations based on real-world phenomena,
and the toolsuite VoTE [3].                               Where 𝑡𝑏 is the 𝑏 𝑡ℎ tree in the ensemble

                                                           To reduce the correlation between trees, each tree
3.1. Decision Trees                                        is trained on overlapping random subsets of the training
Decision trees are employed as prediction models in ma- data (with replacement) using techniques like bagging
chine learning and can be used in vision-based systems to or boosting.
solve image classification problems (among other appli-
cations). A decision tree model contains rules to predict 3.3. Gradient Boosting Machines
the output class and implements a prediction function
𝑡 ∶ 𝑋 𝑛 → ℝ𝑚 that maps disjoint sets of points 𝑋𝑖 ⊂ 𝑋 𝑛 to Similar to Random Forests, Freidman [2] proposes
a single output point 𝑦𝑖 ∈ ℝ𝑚 , i.e.,                      gradient boosting machines that employ several decision
                                                           trees to create a prediction function. Unlike Random
                     (𝑦1,1 , ..., 𝑦 1,𝑚 ) 𝑥 ∈ 𝑋 1          Forests, these trees are trained in a sequential manner
                    ⎧
             𝑡(𝑥) = ⋮                                  (1) with each succeeding tree correcting the errors made
                    ⎨                                      by the predecessor tree. The errors are corrected
                    ⎩(𝑦𝑘,1 , ..., 𝑦𝑘,𝑚 ) 𝑥 ∈ 𝑋𝑘            using gradient descent (hence the name). Although
                                                   𝑘       conceptually different from random forests in a learning
Where 𝑘 is the number of disjoint sets and 𝑋 𝑛 = ⋃ 𝑋𝑖      context, these two models have several features in
                                                  𝑖=1      common when performing predictions.
The n-dimensional input domain 𝑋 𝑛 includes ele-
ments 𝑥 as tuples in which each input variable 𝑥𝑖 A gradient boosting machine 𝑓 ∶ 𝑋 𝑛 → ℝ𝑚 is an
captures some feature of interest of the system [3]. The


                                                         3
ensemble of 𝐵 additive decision trees, i.e.,                         in this work, the property checker defined earlier (see Fig-
                               𝐵                                     ure 1) checks for stability. The VoTE Core is instantiated
                      𝑓 (𝑥) = ∑ 𝑡𝑏 (𝑥)                    (3)        from the ensemble subject to verification, 𝑓 ∶ 𝑋 𝑛 → 𝑅𝑚 .
                              𝑏=1                                    It takes as input a hyperrectangle defining 𝑋 𝑛 , and the
                                                                     outcome of the formal verification is pass (when the tree
Where 𝑡𝑏 is the 𝑏 𝑡ℎ tree in the ensemble                            ensemble satisfies the concerned property) and fail when
                                                                     it does not.
3.4. Classifier
Decision trees and tree-ensemble models can be used as
                                                                     3.6. Semantic Perturbations
classifiers to categorize samples from an input domain        As opposed to perturbations that change the value of each
into one or more classes and assign each sample a la-         feature by a constant value, semantic perturbations take
bel unique to its class. In this paper, we only consider      an input image and 𝑝 parameters to perform a parameter-
functions that map each point from an input domain            ized operation that perturbs the image while preserving
to exactly one class. Let 𝑓 (𝑥) = (𝑦1 , ..., 𝑦𝑚 ) represent   its semantic information. Semantics-preserving pertur-
a model trained to predict the probability 𝑦𝑖 associated      bations can include object-level shifts (like changing the
with a class 𝑖 within disjoint regions in the input domain,   shape or size of an object in a color classification prob-
where 𝑚 is the number of classes. A classifier 𝑓𝑐 (𝑥) may     lem), geometric transformations (scaling, rotations) and
then be defined as,                                           common image corruptions (fog, blurs) [8]. Most of these
                                                              perturbations cannot be naturally represented using the
                     𝑓𝑐 (𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦𝑖                   (4) 𝑙 -norm (‖ 𝑥 − 𝑥 ′ ‖ ≤ 𝜖) that most robustness verification
                                 𝑖                             𝑝                  𝑝
                                                              methods are designed for. Using a large value of 𝜖 to
3.5. Verifier of Tree Ensembles                               try and capture these types of perturbations is usually
                                                              not recommended in practice as the image quality de-
VoTE [3] (Verifier of Tree Ensembles) is a toolsuite for for- grades significantly [8]. Hence, we propose a feature
mally verifying that tree ensembles comply with specific space modelling process to precisely capture these multi-
properties that the user can define through implementing dimensional perturbations.
a property checker.
                                                                     3.7. Geometric Perturbations
                                                                     Geometric Perturbations are a class of semantic per-
                                                                     turbations [8] that involve an affine transformation on
                                                                     each pixel’s row and column indices, followed by an
                                                                     interpolation operation [17]. In this paper, we consider
                                                                     three types of affine transformations - rotations, scaling
                                                                     and translation.

                                                                     Let 𝑇𝜃 ∶ ℝ2 → ℝ2 (in 2D) be an invertible affine
                                                                     transformation (like rotation) parameterized by 𝜃 (like
                                                                     the angle of rotation). Let 𝜙𝑥 (𝑗) and 𝜙𝑦 (𝑖) be functions
                                                                     that transform 𝑖, 𝑗 pixel indices to 𝑥, 𝑦 co-ordinates with
                                                                     respect to the center of the image. However, as these
      Figure 1: Verifier of Tree Ensembles [3] (VoTE)                transformed co-ordinates may not align exactly with the
                                                                     integer-valued pixel indices, bilinear interpolation ( 𝐼 ) is
                                                                     necessary. Yang et al. [17] define the general form of a
The tool is based on abstract interpretation which yields
                                                                     geometric perturbation for an image 𝑋 as,
equivalence classes in a tree ensemble, i.e. sets of points
                                                                                          ′
in the input domain that yield the same output tuple                                 𝑋𝑖,𝑗 = 𝐼 (𝑇𝜃−1 (𝜙𝑥 (𝑗) , 𝜙𝑦 (𝑖)))         (5)
through an iterative refinement in the interval domain,
such that the lowest level of abstract (the concrete) do-            Where 𝑋 ′ is the perturbed image. For each geometric
main is equivalent to an exhaustive search down to all               perturbation, they present the inverse transform
leaves in the model. This technique can be used to verify            function 𝑇𝜃−1 which is used to instantiate equation (5),
any property of interest defined within VoTE’s property
checker. For instance, in [26], VoTE was used to verify              Rotation, parameterized by angle, 𝜃 ∈ [0, 2𝜋]
versatile application-specific properties of the aircraft col-                           𝑐𝑜𝑠𝜃    𝑠𝑖𝑛𝜃 𝑥       𝑥𝑐𝑜𝑠𝜃 + 𝑦𝑠𝑖𝑛𝜃
lision avoidance system called ACAS Xu [18]. However,                  𝑇𝜃−1 (𝑥, 𝑦) = [               ][ ] = [               ] (6)
                                                                                         −𝑠𝑖𝑛𝜃   𝑐𝑜𝑠𝜃 𝑦      −𝑥𝑠𝑖𝑛𝜃 + 𝑦𝑐𝑜𝑠𝜃


                                                                 4
Translation, parameterized by horizontal and vertical             4. Proposed Method
shifts ℎ ∈ ℝ, 𝑣 ∈ ℝ
                                                           In this section, we present a method to capture composite
                       −1 (𝑥, 𝑦) = [𝑥 − ℎ]
                      𝑇ℎ,𝑣                             (7)
                                                           geometric perturbations based on real-world phenomena
                                    𝑦 −𝑣                   (like the combined effects of axial rotations and changes
                                                           in lighting) in the form of abstract intervals. The ba-
Scaling, parameterized by a scaling factor, 𝜙 ∈ ℝ, 𝜙 > −1, sic idea is to pass these abstract intervals to the VoTE
                       1                                   property checker [3] to verify stability and compute the
                           0     𝑥       𝑥/(1 + 𝜙)         extended stability metrics, which will be explained in
      𝑇𝜙−1 (𝑥, 𝑦) = [ 1+𝜙  1  ] [   ] = [           ]  (8)
                       0          𝑦       𝑦/(1 + 𝜙)        this section.
                          1+𝜙


3.8. Pixel-wise Transformations                                   4.1. Abstracting the Perturbations
Pixel-wise transformations are defined by the cumulative          For a particular input pixel 𝑥, if 𝑠 is the total number of
effects of brightness and contrast (i.e., simply changes          geometric transformation (eg. rotation) steps for a given
in lighting) acting on a pixel 𝑋𝑖,𝑗 of an image 𝑋 where           transformation, there exists a function, 𝜆(𝑥) that projects
the respective brightness and contrast parameters are             the effects of the transformation on this pixel at each step
𝑐, 𝑏 ∈ ℝ. These transformations are defined by Balunovic
et al. [15] and Mohapatra et al. [14] as,                                             𝜆(𝑥) = {𝑥1 , ..., 𝑥𝑠 }              (11)

            ′                                              such that 𝑥𝑖 is the outcome of the transformation at each
          𝑋𝑖,𝑗 = min(1, max(0, (1 + 𝑐) ⋅ 𝑋𝑖,𝑗 + 𝑏))     (9)
                                                           step 𝑖, 1 ≤ 𝑖 ≤ 𝑠. In order to capture these pixel values in
In our work, the pixel-wise transformations are parame- the interval domain, we use the abstraction function 𝛼,
terized by darkening and brightening offsets, 𝜁𝐿 , 𝜁𝐻 ∈ ℝ, which is designed to compute an over-approximation of
and the upper and lower bounds for the perturbations the problem,
are specified as,
                                                                         𝛼(𝜆(𝑥)) = (𝑚𝑖𝑛 𝜆(𝑥), 𝑚𝑎𝑥 𝜆(𝑥))            (12)
                  ′
                𝑋𝑖,𝑗 = (𝑋𝑖,𝑗 − 𝜁𝐿 , 𝑋𝑖,𝑗 + 𝜁𝐻 )        (10)   For a composite geometric transformation involving a
                                                              combination of one affine transformation (rotation, scal-
3.9. Composite Geometric Perturbations ing or translation) along with changes in lighting con-
                                                              ditions where 𝜁𝐿 and 𝜁𝐻 are real-world darkening and
Composite geometric perturbations based on real-world
                                                              brightening margins, if 𝑝 represents the initial pixel 𝑣𝑎𝑙𝑢𝑒
phenomena are a combination of one affine transforma-
                                                              before the transformations, the above (single transforma-
tion (like rotation, translation or scaling) and a pixel-wise
                                                              tion) abstraction function can be redefined as,
transformation (like changes in lighting). These pertur-
bations are visually represented in figure 2 as,                𝛼(𝜆(𝑥)) = ((𝑚𝑖𝑛 𝜆(𝑥)−𝜁 )−𝑝, (𝑚𝑎𝑥 𝜆(𝑥)+𝜁 )−𝑝) (13)
                                                                                            𝐿                     𝐻

                                                                  The redefined abstraction function returns a tuple (𝜖𝐿 , 𝜖𝐻 )
                                                                  that represents the perturbation margins of the pixel
                                                                  where,
                                                                                  𝜖𝐿 = (𝑚𝑖𝑛 𝜆(𝑥) − 𝜁𝐿 ) − 𝑝
                                                                                  𝜖𝐻 = (𝑚𝑎𝑥 𝜆(𝑥) + 𝜁𝐻 ) − 𝑝
                                                                  The lower and upper perturbation bounds are computed
                                                                  by applying the perturbation margins (𝜖𝐿 , 𝜖𝐻 ) to the
                                                                  pixel’s intensity as (𝑝 + 𝜖𝐿 , 𝑝 + 𝜖𝐻 )

                                                                  Running Example
                                                                  If a pixel 𝑥 in an image with an initial intensity, 𝑝 = 5 is
                                                                  rotated between 0° and 5° in 5 steps, from equation (11),
  Figure 2: Composite geometric perturbations based on
                real-world phenomena                                                 𝜆(𝑥) = {6, 2, 3, 1, 4}
                                                                  Applying equation (12),

                                                                                       𝛼(𝜆(𝑥)) = (1, 6)


                                                              5
If 𝜁𝐿 = 𝜁𝐻 = 1, applying equation (13),                                   4.3. Verification Workflow
             𝛼(𝜆(𝑥)) = ((1 − 1) − 5, (6 + 1) − 5)            In this section, we present the verification workflow that
                                                             shows the feature space modelling process incorporated
                      (𝜖𝐿 , 𝜖𝐻 ) = (−5, 2)                   into the VoTE [3] framework. We will use one of the case
                                                             studies in section 5.1 to illustrate the flow in Figure 3. If
These pixel-level perturbation margins are then applied we consider rotations between 0° and 10°, 𝐼 (which is
                                                                                                              𝐿
to the initial pixel value before the transformations to get the image generated by applying 𝜖 associated with all
                                                                                                    𝐿
the pixel-level perturbation bounds that are representa- pixels in a particular image) and 𝐼 (the corresponding
                                                                                                   𝐻
tive of composite geometric perturbations (like rotations image generated by applying 𝜖 ) represent the first and
                                                                                               𝐻
and changes in lighting).                                    last images in the verification process. We then ask the
                                                             VoTE Property Checker to verify all the images in the
       (𝑝 + 𝜖𝐿 , 𝑝 + 𝜖𝐻 ) = (5 + (−5), 5 + 2) = (0, 7)
                                                             transformed series.
From this example, we see that the pixel 𝑥 can take on
any value between (0, 7) during a composite geometric
transformation.

4.2. Feature Space Modelling
The abstraction process helps determine the upper and
lower pixel-level perturbation margins in order to achieve
efficient computations during the verification process.
Since VoTE operates based on abstracting each point
with an interval in one dimension, when applying to a
multidimensional space we need to extend the abstraction
function through feature space modelling to simulate
these perturbations for vision-based systems.

 Algorithm 1: Feature Space Modelling: Rotation
     Input: Angle 1 (𝑛1 ), Angle 2 (𝑛2 ), Offset-High (𝜁𝐻 ),
            Rot. Steps (r), Offset-Low (𝜁𝐿 ), Images (𝐼 )
     Output: Perturbation Definitions (𝜏 )
1 𝑁 , 𝜏 ← ∅, ( )      ▷ 𝑁 = Set of angles to apply
2 𝑁 ← generate-set-of-angles (𝑛1 , 𝑛2 , 𝑟)
3 for 𝑒𝑎𝑐ℎ 𝑖𝑚𝑎𝑔𝑒, 𝑖 ∈ 𝐼 do
4     Δ𝑖 ← ( )          ▷ Image level perturbations
5     for each pixel, 𝑥 ∈ 𝑖 do
6         𝜆←∅            ▷ Transformed pixel values                              Figure 3: Verification Workflow (MNIST)
7         for 𝑒𝑎𝑐ℎ 𝑎𝑛𝑔𝑙𝑒, 𝜈 ∈ 𝑁 do                                        The property checker returns a pass when the classifier
8             𝑥𝜈 ← rotate-image (𝑖, 𝜈)                                    makes stable predictions on all the images in the trans-
9             𝜆 ← 𝜆 ∪ {𝑥𝜈 }                                               formed series. If the result of the verification process
10           (𝜖𝐿 , 𝜖𝐻 ) ← 𝛼 (𝜆) ▷ 𝛼 f r o m e q u a t i o n ( 1 3 )       is fail, the property does not hold (the classifier is not
11           append (𝜖𝐿 , 𝜖𝐻 ) to Δ𝑖                                      stable). In this case, the VoTE property checker can be
12       append Δ𝑖 to 𝜏                                                   configured to return a counterexample that can help with
                                                                          further analysis. The illustration of the approach in more
                                                                          detail is described with use cases in section 5.
Note that while Algorithm 1 contains axial rotations as
the affine transformation, this can be easily adapted to the              4.4. Classifier Correctness
corresponding other affine transformations, i.e., scaling
                                                                          The correctness of a classifier (𝑓𝑐 ), with respect to an input
or translation computed by similar respective algorithms
                                                                          𝑥, denoted by 𝐼 𝑠𝐶𝑜𝑟𝑟𝑒𝑐𝑡(𝑓𝑐 , 𝑥) is proven when 𝑓𝑐 predicts
analogously.
                                                                          the correct label (𝑙) for an input sample (𝑥) according to
                                                                          a ground truth. This notion of correctness computed
                                                                          over a test set (𝑋𝑡𝑒𝑠𝑡 ) can be used to quantify accuracy as


                                                                      6
                                                                          |{ 𝑥 ∈ 𝑋𝑡𝑒𝑠𝑡 | 𝐼 𝑠𝐶𝑜𝑟𝑟𝑒𝑐𝑡(𝑓𝑐 , 𝑥) ∧ ¬𝐼 𝑠𝑆𝑡𝑎𝑏𝑙𝑒(𝑓𝑐 , 𝑥) }|
                                                                     F=
                                                                                                    |𝑋𝑡𝑒𝑠𝑡 |
                   |{ 𝑥 ∈ 𝑋𝑡𝑒𝑠𝑡 | 𝐼 𝑠𝐶𝑜𝑟𝑟𝑒𝑐𝑡(𝑓𝑐 , 𝑥)}|
     accuracy =                                          (14)             |{ 𝑥 ∈ 𝑋𝑡𝑒𝑠𝑡 | ¬𝐼 𝑠𝐶𝑜𝑟𝑟𝑒𝑐𝑡(𝑓𝑐 , 𝑥) ∧ 𝐼 𝑠𝑆𝑡𝑎𝑏𝑙𝑒(𝑓𝑐 , 𝑥) }|
                                  |𝑋𝑡𝑒𝑠𝑡 |                           V=
                                                                                                   |𝑋𝑡𝑒𝑠𝑡 |
4.5. Safety Property: Classifier Stability                                |{ 𝑥 ∈ 𝑋𝑡𝑒𝑠𝑡 | ¬𝐼 𝑠𝐶𝑜𝑟𝑟𝑒𝑐𝑡(𝑓𝑐 , 𝑥) ∧ ¬𝐼 𝑠𝑆𝑡𝑎𝑏𝑙𝑒(𝑓𝑐 , 𝑥) }|
                                                                    B=
                                                                                                    |𝑋𝑡𝑒𝑠𝑡 |
Accuracy on a test set is the most commonly used metric
to evaluate a classification model. However, accuracy
alone is not enough for convincing safety arguments                 We incorporate these definitions into VoTE’s Property
in models that use machine learning. In this paper, we              Checker which formally verifies classifier stability and
consider Stability, a property commonly used in related             automatically computes these metrics at the same time.
works. Robustness (and other metrics consistent with
the literature) are generalised to a notion of classifier
stability and correctness [9]. Note that compliance with            4.7. VoTE Property Checker
stability and its associated metrics alone isn’t generally      Recall that from the preliminaries, the VoTE property
enough to ensure system safety. Safety Engineers                checker (VoTE-Pc) returns a pass if the property is satis-
typically define requirements on software functions that        fied. For a tree-ensemble model (𝑓𝑐 ), and a test sample 𝑥
are much richer than this property alone.                       (𝑥 ∈ 𝑋𝑡𝑒𝑠𝑡 ), the following equations are used to verify sta-
                                                                bility and correctness, which can be extended to compute
Let 𝑓𝑐 ∶ 𝑥 → 𝑦 be the classifier subject to veri- the metrics from the equations in section 4.6.
fication where 𝑥 (the image) is an n-array vector
consisting of individual pixels, and 𝑦 is the output class               is_correct ← VoTE-Pc (𝑥 , 𝐼 𝑠𝐶𝑜𝑟𝑟𝑒𝑐𝑡(𝑓𝑐 , 𝑥))   (16)
or label. 𝜖𝐿 , 𝜖𝐻 ∈ ℝ are lower and upper perturbation
                                                                            is_stable ← VoTE-Pc (𝑥 , 𝐼 𝑠𝑆𝑡𝑎𝑏𝑙𝑒(𝑓𝑐 , 𝑥))  (17)
margins (𝜖𝐿 < 𝜖𝐻 ) associated with an individual pixel
in the image. If 𝑛 is the number of pixels in the image,
Δ = ((𝜖𝐿1 , 𝜖𝐻1 ), ..., (𝜖𝐿𝑛 , 𝜖𝐻𝑛 )) is a sequence containing 5. Case Studies
tuples of pixel-level perturbation margins associated
with the image. 𝜔 is an n-tuple of perturbations, with We apply our three affine transformations combined with
𝜔 = (𝜔1 , ..., 𝜔𝑛 ) and 𝜖𝐿𝑖 ⩽ 𝜔𝑖 ⩽ 𝜖𝐻𝑖 . If we consider a test changes in lighting to two case studies. We verify sta-
sample 𝑥 and all possible variations of 𝜔, the classifier 𝑓𝑐 bility and compute the extended metrics associated with
is stable on that sample (denoted by 𝐼 𝑠𝑆𝑡𝑎𝑏𝑙𝑒(𝑓𝑐 , 𝑥)) if classifier stability and correctness. Due to space restric-
and only if                                                     tions, we will only show the outcomes of composite rota-
                                                                tions in one study and composite scaling in the other. We
              𝑓𝑐 (𝑥) = 𝑓𝑐 (𝑥′ ) , where 𝑥′ = 𝑥 + 𝜔         (15) use scikit-learn [6] to train random forests and CatBoost
                                                                [7] to train gradient boosting machines. Experiments are
The above expression can easily be extended over a test conducted on a Windows 11 Machine running Ubuntu
set (𝑋𝑡𝑒𝑠𝑡 ). Note that 𝜔 when added to the base image 20.04 in WSL Mode (Windows Subsystem for Linux). The
(𝑥) produces a version of that base image that represents machine comes equipped with an Intel Core i7-10875H
the combined effects of a geometric transformation and CPU and 16 GB RAM.
changes in lighting. These different versions of the base
image can then be analyzed by the verification engine.
                                                                    5.1. Digit Recognition
4.6. Extended Stability Metrics                                  The MNIST dataset [22] is a collection of hand-written
                                                                 digits commonly used to evaluate machine learning al-
Stability and Robustness are standard metrics commonly gorithms. The dataset contains 70,000 gray scale images
used in the literature on adversarial machine learning. with a resolution of 28x28 pixels at 8bpp, encoded as a
They are generally used to quantify the security of clas- tuple of 784 pixels. The input regions surrounding each
sifiers at test time (for e.g., evasion attacks). Beyond sample in the test set were defined using feature space
stability, Ranzato and Zanella [9] define four additional modelling (section 4.3). The parameters for rotation were
metrics - robustness (R), fragility (F), vulnerability (V) and chosen as follows: Rotation Range = 0 − 10°, Rotation
breakage (B) that combine classifier stability with predic- Steps = 1001. For the changes in lighting, the parameters
tion accuracy in the presence of input perturbations. We were chosen as: 𝜁 = 2 (darkening) and 𝜁 = 3 (bright-
                                                                                    𝐿                      𝐻
recall                                                           ening). To make this case study even more realistic, the
        |{ 𝑥 ∈ 𝑋𝑡𝑒𝑠𝑡 | 𝐼 𝑠𝐶𝑜𝑟𝑟𝑒𝑐𝑡(𝑓𝑐 , 𝑥) ∧ 𝐼 𝑠𝑆𝑡𝑎𝑏𝑙𝑒(𝑓𝑐 , 𝑥) }| training data was augmented (doubled to 120,000 training
   R=                                                            samples) by adding samples that were randomly rotated
                                 |𝑋𝑡𝑒𝑠𝑡 |


                                                                7
in the range of ±10° using Keras [24]. The verification                                           and out). For the changes in lighting, the parameters
process was carried out on 10,000 test samples.                                                   were chosen as: 𝜁𝐿 = 2 (darkening) and 𝜁𝐻 = 3 (bright-
                                                                                                  ening). To make this case study even more realistic, the
 Param.       A (%)            R (%)           F (%)           B (%)              T (s)
 d   B     RF      GB       RF       GB     RF       GB     RF       GB     RF         GB         training data was augmented (doubled to 249,600 training
 3
 3
      3
      5
          51.26
          57.26
                  61.75
                  71.60
                          14.36
                           7.03
                                   13.61
                                   15.81
                                           36.90
                                           50.23
                                                   48.14
                                                   55.82
                                                           48.74
                                                           42.74
                                                                   38.25
                                                                   28.37
                                                                           0.21
                                                                           0.22
                                                                                      0.40
                                                                                      0.26
                                                                                                  samples) by adding samples that were randomly scaled
 5    5   76.59   79.56    2.75    40.89   73.84   38.67   23.41   20.44   0.22       0.26        in the range of ±10% using Keras [24]. The verification
 5   10   80.23   87.16    0.69    31.00   79.54   56.16   19.77   12.84   0.23       0.56
 5   15   81.95   89.94    0.88    23.74   81.07   66.20   18.05   10.06   0.36       13.33       process was carried out on 10,000 test samples.
 5   20   83.20   91.61    1.37    20.21   81.83   71.40   16.80    8.39   1.81      168.12
                                                                                                    Param.       A (%)          R (%)          F (%)           B (%)              T (s)
    Table 1: Results (Combined Rotations and Lighting)                                              d   B     RF      GB      RF    GB      RF       GB     RF       GB     RF         GB
                                                                                                    3   3    30.85   36.44   2.34   1.55   28.51   34.89   69.15   65.36   0.22       0.19
                                                                                                    3   5    35.39   43.92   1.72   1.43   33.67   42.49   64.61   56.08   0.24       0.22
                                                                                                    5   5    45.83   50.75   3.46   2.23   42.37   48.52   54.17   49.25   0.23       0.26
                                                                                                    5   10   51.27   61.83   2.99   0.68   48.28   61.15   48.73   38.17   0.33       0.79
Table 1 lists random forests (RF) and gradient boost-                                               5   15   53.42   68.04   2.83   0.13   50.59   67.91   46.58   31.96   2.01       21.50
ing machines (GB) included in the experiments with                                                  5   20   54.41   71.56   3.51   0.10   50.90   71.46   45.59   28.44   6.10      264.69

their maximum tree depth (d), number of trees (B), ac-            Table 2: Results (Combined Scaling and Lighting)
curacy (A), the extended stability metrics - robustness
(R), fragility (F), and breakage (B) along with the elapsed
time (T) taken for verification (in seconds). The column Table 2 lists random forests (RF) and gradient boosting
for vulnerability was omitted as it was zero in all experi- machines (GB) included in the experiments with their
ments.                                                      maximum tree depth (d), number of trees (B), accuracy
                                                            (A), the extended stability metrics - robustness (R),
                                                            fragility (F), and breakage (B) along with the elapsed
                                                            time (T) taken for verification (in seconds). Again, the
                                                            column for vulnerability was omitted as it was zero in
                                                            all experiments.

                                                                                                  In regard to this case study, if we consider the
                                                                                                  EMNIST [23] letter a and scaling in the range of ±10 %,
                                                                                                  with 𝜁𝐿 = 2 and 𝜁𝐻 = 3, 𝐼𝐿 and 𝐼𝐻 represent the first and
                                                                                                  last images in the verification process. We then ask the
                                                                                                  VoTE Property Checker to verify all the images in the
                                                                                                  transformed series. Figure 5 shows a fragment of the
                                                                                                  verification flow.


Figure 4: Crafting Perturbations (Composite Rotations)

Figure 4 shows the construction of the first and last im-
ages in the verification series for a particular sample.
From this figure, it is visible that the two images, i.e., the
first and the last resemble the ground truth, and the rest
of the images in the series are similar in that sense.

5.2. Handwritten Character Recognition
In our next experiments, we use the EMNIST-Letters
dataset, which is a variant of the EMNIST Dataset [23]
that includes upper and lower case handwritten char-
acters. This dataset contains 145,600 samples across 26
balanced classes. Each sample has a resolution of 28x28    Figure 5: Verification Workflow Fragment (EMNIST)
pixels at 8bpp, encoded as a tuple of 784 pixels. The in-
put regions surrounding each sample in the test set were
                                                          Discussions
defined using feature space modelling. The parameters
for scaling were chosen as follows: Scaling Steps = 201, From the experimental results in Table 1 and Table 2,
Scale Factor = ±10% which represents a 10% Zoom (in we note that the robustness of the learned models with


                                                                                              8
respect to composite geometric perturbations is much               verify these models within a fraction of a second for
lower compared to the standard model accuracy (which               most of the experiments. For larger models, while the
is somewhat expected). In the first case study, gradi-             verification time does increase, this issue is resolvable
ent boosting machines performed slightly better than               since due to the memory-efficient structure of the VoTE
random forests in terms of robustness. Increasing the              core algorithm, computations can always be parallelised
number of trees improves accuracy but causes a drop in             (not used in our experiments).
robustness. This indicates that the models were over-
fitted and that adding compositely perturbed images to             VoTE captures multiple inputs (precisely) in the
the training set may improve robustness. The high num-             interval domain, but whenever a prediction model
bers for fragility indicated that while these samples were         violates the stability property, the property checker
identified correctly, the classifier was unable to make            returns a counterexample which lies in the violating
stable predictions on them in presence of these types of           region. Figure 7 represents a few of the many coun-
perturbations. We would also like to mention that the ver-         terexamples discovered by VoTE during the verification
ification process would require a domain expert, due to            process (MNIST-Digits). Since the perturbations are
the complex nature of these perturbations. For instance,           almost indistinguishable by the naked eye, they are
if you pick a large rotation range (or a large scaling or          highlighted in pink. These counterexamples can be used
translation range), there will be a significant number of          for debugging or further analysis to repair or retrofit the
images in the verification series that won’t resemble the          models iteratively until provable stability is eventually
ground truth label. In this case, the verification process         achieved. For instance, counterexamples used for guided
needs to be done in small ranges.                                  model training can be embedded into a loss function to
                                                                   achieve better stability.


                                                                                 Figure 7: Counterexamples


                                                                   6. Conclusions and Future Works
                                                                   Applying formal verification to show that a system
       Figure 6: Comparing Verification Metrics                    exhibits its intended behaviour can help increase the
                                                                   trustworthiness of the decisions made by intelligent
We repeated the experiments in the second case study by            systems. In this work, we show that tree-ensemble based
varying the scale factors as: ±2%, ±4%, ±6%, ±8%, ±10%             models with comparatively good prediction accuracies
for the largest Gradient Boosting Machine. In figure               perform spectacularly poorly when presented with
6, we can see that as the scale factor is increased, the           samples containing composite geometric perturbations
robustness of the model decreases and the verification             based on real-world phenomena.
runtime increases. This is expected as the number of
perturbed pixels increases as the scale factor is increased.       Apart from the profound impacts on safety, these
                                                                   types of perturbations can also violate the security
Also, while the tree ensemble models chosen for                    properties of classifiers which could be detrimental
the experiments could be considered trivial, we would              in safety-critical scenarios. To address this issue,
like to point out that the verification problem was                the parameters in our method can be tweaked to
certainly non-trivial considering the composite nature             simultaneously check for adversarial perturbations in
of these perturbations, and the fact that the models were          addition to composite geometric perturbations. This
trained on high-dimensional data. What is positive in              represents a step towards safe and secure AI. Finally, our
this context is the fact that performing such analyses             method can also be used with different types of models
in tractable time is at all possible given the enormous            and verification engines depending on the context.
search space for these perturbations.
                                                                   For future works, we plan on exploring different
Considering the complexity of the verification                     abstract domains which can be used to tightly capture
problem, it is impressive that VoTE was able to efficiently        these perturbations to increase the scalability of VoTE to


                                                               9
even higher-dimensional inputs. We also plan to explore            Verification, testing, adversarial attack and defence,
different tree selection strategies, i.e., a systematic            and interpretability. Comput. Sci. Rev., 37, 100270.
analysis of the order in which the trees in the ensemble        [12] Engstrom, L., Tran, B., Tsipras, D., Schmidt, L., &
are analyzed in VoTE’s abstraction-refinement pipeline             Madry, A. (2019). Exploring the Landscape of Spatial
to further enhance the efficiency of the verification              Robustness. ICML.
process against the large nature of these perturbations.        [13] Pei, K., Cao, Y., Yang, J., & Jana, S.S. (2019). Towards
                                                                   Practical Verification of Machine Learning: The Case
                                                                   of Computer Vision Systems. ICSE Workshop on Test-
Acknowledgements                                                   ing for Deep Learning and Deep Learning for Testing
                                                                [14] Mohapatra, J., Weng, T., Chen, P., Liu, S., & Daniel,
This work was partially supported by the Wallenberg
                                                                   L. (2020). Towards Verifying Robustness of Neural Net-
AI, Autonomous Systems and Software Program (WASP)
                                                                   works Against A Family of Semantic Perturbations.
funded by the Knut and Alice Wallenberg Foundation.
                                                                   2020 IEEE/CVF Conference on Computer Vision and
                                                                   Pattern Recognition (CVPR), 241-249.
References                                                      [15] Balunovic, M., Baader, M., Singh, G., Gehr, T., &
                                                                   Vechev, M.T. (2019). Certifying Geometric Robustness
[1] Breiman, L. (2004). Random Forests. Machine Learn-             of Neural Networks. NeurIPS.
   ing, 45, 5-32.                                               [16] Wu, H., Tagomori, T., Robey, A., Yang, F., Matni,
[2] Friedman, J.H. (2001). Greedy function approxima-              N., Pappas, G., Hassani, H., Pasareanu, C.S., & Barrett,
   tion: A gradient boosting machine. Annals of Statis-            C. (2022). Toward Certified Robustness Against Real-
   tics, 29, 1189-1232.                                            World Distribution Shifts. ArXiv, abs/2206.03669.
[3] Törnblom, J., & Nadjm-Tehrani, S. (2019). An                [17] Yang, R., Laurel, J., Misailovic, S., & Singh, G. (2022).
   Abstraction-Refinement Approach to Formal Verifica-             Provable Defense Against Geometric Transforma-
   tion of Tree Ensembles. SAFECOMP Workshops.                     tions. arXiv preprint arXiv:2207.11177.
[4] Irsoy, O., Yildiz, O.T., & Alpaydin, E. (2012). Soft        [18] Katz, G., Barrett, C.W., Dill, D.L., Julian, K.D., &
   decision trees. Proceedings of the 21st International           Kochenderfer, M.J. (2017). Reluplex: An Efficient SMT
   Conference on Pattern Recognition (ICPR2012), 1819-             Solver for Verifying Deep Neural Networks. Interna-
   1822.                                                           tional Conference on Computer Aided Verification.
[5] Wing, J. M. (2021). Trustworthy AI. In Communica-           [19] Gowal, S., Qin, C., Huang, P., cemgil, T., Dvijotham,
   tions of the ACM (Vol. 64, Issue 10, pp. 64–71). As-            K., Mann, T.A., & Kohli, P. (2020). Achieving Robust-
   sociation for Computing Machinery (ACM). https://-              ness in the Wild via Adversarial Mixing With Disen-
   doi.org/10.1145/3448248                                         tangled Representations. 2020 IEEE/CVF Conference
[6] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel,            on Computer Vision and Pattern Recognition (CVPR),
   V., Thirion, B., Grisel, O., Blondel, M., Louppe, G.,           1208-1217.
   Prettenhofer, P., Weiss, R., Weiss, R.J., Vanderplas, J.,    [20] Hong, T., Hu, M., Yin, T., & Wang, S. (2022). A Multi-
   Passos, A., Cournapeau, D., Brucher, M., Perrot, M., &          Scale Convolutional Neural Network for Rotation-
   Duchesnay, E. (2011). Scikit-learn: Machine Learning            Invariant Recognition. Electronics.
   in Python. J. Mach. Learn. Res., 12, 2825-2830.              [21] Peri, A., Mehta, K., Mishra, A., Milford, M., Garg,
[7] Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush,           S., & Krishna, K.M. (2022). ReF - Rotation Equiv-
   A. V., & Gulin, A. (2018). CatBoost: unbiased boosting          ariant Features for Local Feature Matching. ArXiv,
   with categorical features. Advances in neural infor-            abs/2203.05206.
   mation processing systems, 31.                               [22] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998).
[8] Gokhale, T., Anirudh, R., Kailkhura, B., Thiagara-             Gradient-based learning applied to document recog-
   jan, J.J., Baral, C., & Yang, Y. (2021). Attribute-Guided       nition. Proc. IEEE, 86, 2278-2324.
   Adversarial Training for Robustness to Natural Per-          [23] Cohen, G., Afshar, S., Tapson, J.C., & Schaik, A.V.
   turbations. AAAI.                                               (2017). EMNIST: an extension of MNIST to handwrit-
[9] Ranzato, F., & Zanella, M. (2020). Abstract Interpre-          ten letters. ArXiv, abs/1702.05373.
   tation of Decision Tree Ensemble Classifiers. AAAI.          [24] Chollet, F. (2018). Keras: The Python Deep Learning
[10] Scher, S., & Trügler, A. (2022). Robustness of Ma-            library. https://keras.io
   chine Learning Models Beyond Adversarial Attacks.            [25] Cleland, G. M., Sujan, M. A., Habli, I., & Medhurst,
   ArXiv, abs/2204.10046.                                          J. (2012). Evidence: using safety cases in industry and
[11] Huang, X., Kroening, D., Ruan, W., Sharp, J., Sun,            healthcare. The Health Foundation.
   Y., Thamo, E., Wu, M., & Yi, X. (2020). A survey of          [26] Törnblom, J., & Nadjm-Tehrani, S. (2021). Scaling
   safety and trustworthiness of deep neural networks:             up Memory-Efficient Formal Verification Tools for
                                                                   Tree Ensembles. ArXiv, abs/2105.02595.


                                                           10