<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Weight-based Semantic Testing Approach for Deep Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Amany Alshareef</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicolas Berthier</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sven Schewe</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaowei Huang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>OCamlPro</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Liverpool</institution>
          ,
          <addr-line>Liverpool L69 3BX</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>While deep learning models have achieved state-of-the-art performance in a variety of fields, their susceptibility to adversarial examples has raised serious concerns over their application in safety-critical domains. Existing testing methodologies fail to consider interactions between neurons and the semantic representation that formed in the DNN through the training process. This paper proposes a weight-based testing metric that uses feature importance weights to measure the coverage of the test set and facilitates the generation of additional test cases targeting higher weights' features. Evaluations were conducted to compare the initial and final coverage of the proposed weighting approach with normal BN-based feature coverage. The testing coverage experiments indicated that the proposed weight metrics achieved higher coverage compared to the original feature metrics while maintaining the efectiveness of finding adversarial samples during the test case generation process.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;DNN testing metrics</kwd>
        <kwd>Bayesian abstraction</kwd>
        <kwd>Feature coverage</kwd>
        <kwd>Importance weights</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>hidden representations and generate additional test cases
based on them. This paper presents a testing approach
Software testing provides evidence to demonstrate that for neural networks that leverages the learned
representhe system meets its requirements or is error-free. The tations and feature importance to evaluate the test data’s
fact that deep learning models are data-driven, not coverage. The features’ importance weights reflect how
requirements-driven, makes defining their testing cri- the contribution of each hidden feature, extracted from
teria challenging. Technically, the accuracy of the learn- the lower-dimensional latent feature space, to the overall
ing models is reported based on the test dataset. This output of the network is distributed across the network.
standard metric for measuring the model’s overall perfor- This identified feature contribution enables the
determimance cannot be suficient or trustworthy in the safety- nation of the causal relationship between the neurons and
related domain, where most testing scenarios are ran- the model behaviour. Therefore, the importance weights
domly chosen from the entire dataset. Besides, the pro- relate to the semantic representation and provide insights
vided test data may not have good coverage of the data into the interaction mechanism underlying the output
distribution the model is trained on and may not repre- decision-making process.
sent the data obtained in the real world. The proposed approach to design weight-based
se</p>
      <p>Furthermore, most of the current proposed DNN test- mantic testing metrics for neural networks is using the
ing techniques rely on neuron activation as a metric to Bayesian network abstraction model of Berthier et al. [2].
measure the test data coverage. Such a criterion does The authors introduced a dimensionality reduction
technot prove its correlation to the system’s decision logic nique using feature extraction algorithms to abstract the
[1]. Moreover, these methods aim to transform the input behaviour of a neural network into a Bayesian network
data space to generate more test input and completely (BN). The work in [3] utilised that BN model to quantify
ignore the model-internal representations and their roles the importance of a neural network’s latent features. It
in the output decision. Observing that real-world high- developed a BN-based sensitivity analysis algorithm that
dimensional data lie on low-dimensional manifolds mo- estimates the importance of a neural network’s latent
tivates investigating where the data lie and modeling it features by analysing an associated BN’s sensitivity to
to be analysed instead of confined to the input domain. distributional shifts. They integrated various metrics to
There is little attempt to understand machine learning’s compute the diference between the original probability
The IJCAI-2023 AISafety and SafeRL Joint Workshop. distributions represented by the abstracted BN and the
* Corresponding author. distributions obtained after perturbation. Each latent
$ amany.alshareef@liverpool.ac.uk (A. Alshareef); feature was then assigned a weight value based on the
nicolas.berthier@ocamlpro.com (N. Berthier); measured sensitivity distance.
xsviaeonw.secih.hewuaen@g@livleivrperopool.oalc.a.uck.u(kS.(XSc.hHeuwaen)g;) In this work, we transform the traditional binary
cov© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License erage approach to a weighted probability problem and
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) define our coverage metric based on the latent features
importance. The neural network’s latent feature space many neurons in the network as possible during the
refers to the internal, hidden representations learned by testing phase. The more neurons that are activated with
the network. These representations are not directly ob- a specific value, the more complete the testing of the
servable but rather are formed as a result of the network network is considered to be.
processing input data and capturing their important pat- Neuron activation was firstly introduced by Pei et al.
terns [4]. The proposed weight-based testing criterion [5] as a systematic metric which calculates the number
emphasises that maximum test coverage is obtained from of activated neurons (w.r.t. ReLU activation function)
the presence of important features that have a dominat- during the testing. They proposed the DeepXplore which
ing influence on other features and the output decision. is a white box diferential testing algorithm for generating
To summarise, the main contributions of the paper are : test inputs that can discover inconsistencies between
multiple DNNs. Following the neuron coverage (NC)
• Semantic testing metrics measure a test dataset’s principle, DeepTest [6] and Dlfuzz [7] have made some
coverage based on the calculated feature weights. improvements to it. Although the NC metric has been
• A guided systematic approach that samples test shown to be efective at finding hidden bugs and has
cases targeting the higher-priority features. been used to test real-world DNNs, investigations by [8]
demonstrated that NC is too coarse and easy to achieve.
• Empirical studies on the quality of the proposed Further approaches such as DeepGauge [9], and
Deepweight-based coverage compared with the origi- Cover [8] have been developed to extend neuron
covnal BN-based coverage erage, with a focus on various activation value factors.</p>
      <p>Beyond that, more testing metrics, i.e., quantitative
pro2. Existing Testing Techniques jection coverage [10], safety coverage [11] and surprise
coverage [12], have been designed based on the
activaDeep neural network testing is an active research area tion functions and the syntactic connections between
where safety-critical applications are being deployed neurons in successive layers.
with them. Numerous techniques have been developed to Unfortunately, neuron activation and other structural
address the challenges of testing these learning systems coverage techniques have proven to be less efective in
in terms of test coverage criteria, test generation, and validating the safety behaviours of intelligent systems. A
test oracles. study by Li et al. [13] showed that there is no correlation
between the number of misclassified natural input tests
and their structural coverage on the corresponding neural
2.1. Testing Metrics and Coverage Criteria networks. There is still considerable ambiguity about
Testing coverage metrics are measurements used to eval- how such coverage criteria directly relate to the decision
uate the adequacy of testing by providing a quantitative logic of black-box machine learning systems. Especially
evaluation of how thoroughly a deep neural network in that case, the semantic relationship between layers is
has been tested according to specific criteria. Coverage- ignored. Additionally, structural coverage has a limited
guided deep neural network testing techniques are a class correlation with network robustness, where high neuron
of testing methods that aim to increase the coverage of coverage does not imply the network is robust to all
the network during testing, with the goal of covering dif- possible inputs or will behave well on unseen data [1, 14].
ferent regions of the input space and revealing as many There are relatively few testing strategies that address
potential bugs and unexpected behaviours as possible. the semantic aspects of DNN’s internal representation.
Enforcing higher coverage during the testing process One recent efort is the BN-based feature coverage
intromakes the network under investigation more likely to be duced in [2] that is improved with weights in this study.
robust and reliable. We divide the related existing testing Two testing coverage metrics are defined based on the
works in the literature into two categories: suggested BN abstraction: the BN-based feature coverage
(BFCov) and the BN-based feature-dependence coverage
(i) Structural coverage metrics that are defined based (BFdCov). These metrics give the proportion of hidden
on the syntactic characteristics of the NNs, and features or causal relationships between them that are
adequately exercised by a set of inputs. Moreover, the
au(ii) High-level semantic coverage metrics that are</p>
      <p>thors implemented a combined metric BFxCov(ℬ ,X)
concentrated on the semantic representations
created by NNs. Foofrthtehetwsopaacbeovliemaist,: wBeFCinocvlu(dℬeth,Xe) B×FCBoFvdmCeotvr(icℬbe,lXow).</p>
      <p>and refer the reader to the original document for the rest.</p>
      <sec id="sec-1-1">
        <title>Most proposed testing approaches have been focused</title>
        <p>on the structural testing coverage to measure the coverage
of the dataset relied on the individual neuron activation.
These techniques are based on the idea of activating as
Definition 2.1 (BN-based Feature Coverage). Given
a trained DNN  , the BFCov(ℬ , ) coverage of a
non-empty set of inputs  ⊂</p>
        <p>1
⃒⃒  , ⃒⃒</p>
        <p>∑︁
⃒ ,
(1)
bigger than .</p>
        <sec id="sec-1-1-1">
          <title>Informally, BFCov(ℬ ,X) ranges over [0, 1], and gives</title>
          <p>the percentage of feature intervals that are adequately
exercised by . Intuitively, the coverage metric checks
♯
the marginal probabilities  for every interval , ∈
♯
F, in the BN’s node that appears with a probability</p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>Furthermore, a closely related work to the proposed</title>
        <p>importance-based semantic testing is the
DeepImportance approach presented in [15]. The authors developed
a testing approach based on the Importance-Driven (IDC)
test adequacy criterion that employs the layer-wise
relevance propagation to identify the important neurons.</p>
      </sec>
      <sec id="sec-1-3">
        <title>They then evaluated the test’s adequacy by targeting different combinations of important neurons’ behaviours.</title>
        <p>2.2. Test Cases Generation Algorithms</p>
      </sec>
      <sec id="sec-1-4">
        <title>The existing set of test case generation algorithms for</title>
        <p>DNNs is categorised into: (i) Input mutation based, which
generates new test inputs, either natural or adversarial,
by altering the original data using transformation rules;
(ii) Fuzzing based, which generates invalid random input
data to detect faults and vulnerabilities in the model;
and (iii) Symbolic execution based, which is an analysis
technique that tests whether specific inputs cause each
part of a system to be executed.
crete execution directs the symbolic analysis to generate
a high coverage test suite. The DeepConcolic introduced
in [16] used a concolic testing algorithm that alternates
ing the trained DNN under test, and symbolic analyses,
which synthesise new test inputs based on some test
target that is chosen to increase coverage.
2.3. Test Cases Evaluation</p>
      </sec>
      <sec id="sec-1-5">
        <title>A Test oracle is a reference or ground truth that</title>
        <p>provides the expected output for a given input and is
used to compare the output of the system to determine
its accuracy.</p>
        <p>Overall, the structural coverage criteria focus on the
patterns that appear in the outputs of ReLU activation
functions, while the semantic coverage metrics are high-level
criteria that focus on the features that have been learned
by hidden layers of the DNN. The proposed semantic
metrics are based on the model-internal representations
and their contribution to the output behaviours.
tures range over a continuous space, therefore,
each feature component is discretised into a
finite set of feature intervals according to various
strategies, i.e., density- and uniform-based;
3. Bayesian network construction. This consists
in representing the probabilistic distribution of
each extracted feature with a node in the BN.</p>
      </sec>
      <sec id="sec-1-6">
        <title>Each node is associated with either a marginal probability table for hidden features of the first layer, or a conditional probability table (CPT) for hidden or output layers.</title>
        <p>Preliminaries.</p>
        <p>The BN ℬ , = (, ,  ) is an
abstracted model constructed from the DNN  and training
dataset . The  are nodes containing the extracted
lapair , and partitioned into a finite set of  intervals
denoted with ♯ exponents. The  are directed edges
indicating dependencies between features in successive
representing the conditional probability of the current
feature over its parent features w.r.t. .</p>
        <p>The feature sensitivity analysis process discussed in [3]
is calculated based on the change in the BN’s probability
distribution as follows:
, = ∑︀
 (ref , ′ )
∈F♯  (ref , ′ )
(2)</p>
        <sec id="sec-1-6-1">
          <title>Where  is the sensitivity weight of the feature ♯, ,</title>
          <p>ref is the original (reference) probability distribution
represented by the BN, ′ is the probability after
perturbing , , and  is a function returning the distance
between two probabilities distribution according to a given
metric . F♯ is the set of considered latent features.</p>
          <p>Concolic testing</p>
          <p>is a testing technique in which con- tent features from the  . Each feature is defined as a
concrete executions, which evaluate the test input us- layers, and  maps each node in  to a probability table
selected layers of a CNN: max_pooling2d_1, activation_6, the factor 1/ | , | that acts as an equals weight for
the BN abstraction ℬ , as: WFCov(ℬ , ) =</p>
          <p>def
∑︁
4.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Weight-based Testing</title>
      <sec id="sec-2-1">
        <title>This section provides a detailed technical description of</title>
        <p>the proposed weight-based semantic testing metrics and
algorithm. We first introduce and define the concepts of
feature coverage. Then, we describe how the test cases
are generated using the Concolic testing algorithm.
4.1. Weight Feature Metric
The BN abstraction and the hidden feature weights are
utilised to develop new coverage metrics that assess the
quality of a test dataset in terms of reporting the coverage
based on the non-uniform contribution theory. That is,
the metrics focus on the semantic values of the neuron
activation instead of the syntactic values of the adjusted
weights, which is a very local and less decisive criterion.
age of a non-empty set of inputs  ⊂
Definition 4.1 (Weight-based Feature Coverage).</p>
        <sec id="sec-2-1-1">
          <title>Given a trained DNN  , the weight-based feature cover</title>
          <p>D is obtained via
⃒ 
⃒ ,
(3)
4.2. Weight Feature Dependence Metric.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Further, the causal relationships exercised by a dataset X that the BN’s conditional probabilities define are used to develop the following coverage metric:</title>
        <p>Definition 4.2 (Weighted Feature Dependence Cov).</p>
        <sec id="sec-2-2-1">
          <title>Given a trained DNN  , the weight-based feature de</title>
          <p>pendence coverage is obtained via the ℬ , as:
WFdCov (ℬ , ) =</p>
          <p>∑︁
L♯, M∈ +
 ,
L♯, M·
def
⃒
⃒
⃒
⃒
⃒
⃒⃒ (, , ♯− 1
♯</p>
          <p>) ∈⃒⃒ 
♯
F, × F− 1⃒
♯
⃒
⃒</p>
          <p>♯ ⃒
︁( , |♯− 1︁) ≥ ⃒⃒
∨

︁(
layer and discretised into three intervals. The features 1,0 and 1,1 have marginal tables. Features 3,0 and 3,1 are illustrated
with a complete conditional probability table, while other CPTs have the same length ( number of intervals to the number
of parents), but are shortened in the diagram. The weight column shows per-node probability.
put layer, for which conditional probability table doesn’t
exist. Intuitively, in the same manner as the weighted
feature coverage, we iterate over all nodes in  +
calculate a weighted coverage. For each node, we look</p>
          <p>, and
♯
at its CPT which lives in the space F, ×
look at all of the values larger than . In other words,</p>
          <p>♯</p>
          <p>F− 1, and
(, |♯− 1) ≥</p>
          <p>♯
i.e. (
♯
, ) &lt; .
from the previous feature coverage, the values for which
marginal distribution is smaller than  are also included,
 , represents a set of nodes excluding the in- than 0.01, which means the node will contribute to the total
0.1730/0.8687 · 26/27 = 0.1917. Similarly, for the node
3,1, there are 25 out of 27 values with probability larger
coverage as 0.2697/0.8687 · 25/27 = 0.2875. Summing
up contribution from all 4 nodes with CPT assuming there
is one probabilities less than 0.01 for each 2,0 and 2,1,
the final weighted feature dependence coverage amounts
to 0.2190 + 0.2532 + 0.1917 + 0.2875 = 0.9514%.
. For the metric to be independent
4.3. Generalised Weighted Feature Cov.</p>
          <p>Example 3. Continuing the Example 2, the weighted
feature dependence coverage is considered now. The function
iterates over last 4 out of 6 nodes for which CPT exists.</p>
          <p>For this example, let calculate coverage for the node 3,0,
with  = 0.01. Taking a look into its CPT, there are 26
out of 27 items with probabilities larger than 0.01.
Furthermore, all marginal probabilities are larger than 0.01
too, which finally means that the coverage is 26/27. Now,
we calculate how much will the node amount to the
total weighted feature dependence coverage. Because the
weights in Figure 2 are normalized to sum to 1 for all
nodes, they have to be firstly renormalised, so that only
the ones with CPT sum to 1. Normalisation constant is
just the sum over all but first layer weights, which for
our example amounts to 0.8687. Finally, the
contribution of the node 3,0 to the total coverage amounts to
︂)
(5)
.</p>
          <p>To deliver a consistent coverage measure that is based on
every probability entry in the BN, the two feature
metrics 3 and 4 can be combined to produce the generalised
weight feature coverage. This generalised weighted
feature metric gives a single, unified coverage. In the
simplest approach, one can consider two coverages
decoupled from each other, and simply multiply them:
WFCovTot =</p>
          <p>WFCov ×
situations suficient, however the other possibility is to</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>WFdCov. This is in most</title>
        <p>average them on per-node level:
WFCovTot (ℬ , ) =
def
{︃ WFCovL♯, M</p>
        <p>if  = 1 ,
otherwise .</p>
        <p>∑︁
4.4. Coverage Criteria
Weight-based Feature Coverage Criterion. A
nonempty set of inputs  ⊂ D satisfies the Weight-based
feature coverage criterion that is obtained via the BN
abstraction ℬ , if WFCov(ℬ ,X) = 1.</p>
        <p>Weight-based Feature-dependence Coverage
Criterion. A non-empty set of inputs  ⊂ D
satisifes the Weight-based feature-dependence coverage
criterion that is obtained via the BN abstraction ℬ , if
WFdCov(ℬ ,X) = 1.
4.5. Concolic Test Generation</p>
      </sec>
      <sec id="sec-2-4">
        <title>Algorithm 1 Test Dataset Generation</title>
        <p>Input:
 ←
 ←</p>
      </sec>
      <sec id="sec-2-5">
        <title>DNN under test data set</title>
        <p>ℬ , ← abstract BN
 ← features sensitivity weights
Output: test inputs 0, coverage
1: 0 ← sampling initial seed test inputs from 
2: ℬ ,0 ← initialising the BN prob. tables with 0
3:  _ ← intervals with prob ≤ 
4: for  = 1 to max iterations do
5:  ←  _ with highest weight in 
6: select a test input  ∈ 0
7: construct an LP problem based on 
8: solve the optimisation objective:</p>
      </sec>
      <sec id="sec-2-6">
        <title>The weight-based feature metrics are implemented on</title>
        <p>the DeepConcolic tool 1 and the features weights are min ‖(1,1, . . . , 1,|1|) − (1,1, . . . , 1,|1|)‖∞
used as criteria to direct the Concolic testing algorithm. 9: ′ = (1,1, . . . , 1,|1|)
A detailed description of the test generation procedure is 10: if ′ passes the oracle then
provided in Algorithm 1. 11: ′ ← newly generated test input</p>
        <p>For a given trained DNN  on a dataset  and the 12: if  (′) =  () then
associated abstract BN ℬ , , the features weights  13: 0 ← 0∪ {′}
are calculated for all extracted features. We assume that 14: update ℬ ,0 probabilities
suitable feature extraction and discretisation have been 15: update coverage
tauprpelioefdtohne aℬtra,in.inTghesatmesptlegeneratiotno porbotcaeinduthree ssttarurtcs- 1167:: else′ ← adversarial input
by randomly sampling an initial seed set of test inputs 18: end if
0 from  data set that is correctly classified by 19: end if
 , and initialising the probability tables in the BN to 20: end for
produce ℬ ,0 . Next, the algorithm identifies the test
target intervals  _ = {♯, } through analysing
the non-epsilon probabilities of the marginal or condi- After solving the LP problem and extracting the newly
tional probability tables in ℬ ,0 . The non-epsilon is generated test input ′ from values of input neurons:
the probabilities that are less than  and not yet met ′ = (1,1, . . . , 1,|1|), the algorithm will check two
by the current set of input test cases in 0. Thus, the properties of the new input ′. Does the ′ pass the oracle,
 _ consist a set of hidden feature interval(s) i.e., is it structurally close enough to  w.r.t the ∞ norm?
that should be elicited by the test input to be generated. If yes, then, does the ′ output the same classification
The test case generation algorithm then iterates  label of , in other words, is  (′) ==  ()? If yes,
times according to the following: then, the ′ is considered a valid input and added to
First, identifying the  ∈  _ with the high- the test input 0 = 0∪ {′}. Otherwise, the ′ is
est importance weight in  , and selecting a test input considered adversarial for  , as ′ is both deemed close
 ∈ 0 based on some heuristics, such as closeness to enough to  from which it is derived, and it is not assigned
the targeted interval . The implemented assumption to the same classification label as  by  . Accordingly, the
ifnd a good-enough candidate input  is searching for an probabilities in ℬ ,0 are updated to account for the
input  ∈ 0 whose feature value is close to the target new test ′ and then recalculate the coverage. The test
interval boundaries. Then, constructing an LP problem case generation continues if the test criteria obtained via
based on  and solve the optimisation objective that seeks the new ℬ ,0 is not yet satisfied.
to minimise the distance between activations of input Note that, the new test ′ may not actually improve
neurons and . This problem is formulated as: reported coverage if it is just "closer" to the target interval
than  but does not hit it. The expectation is that, ′ will
Minimise: ‖(1,1, . . . , 1,|1|) − (1,1, . . . , 1,|1|)‖∞ later be selected to generate a new input ′′ according
(6) to the same process, and eventually the target interval
might be reached.</p>
        <p>Where 1,1, . . . , 1,|1| is the set of all input neurons.
1The tool is available at https://github.com/TrustAI/DeepConcolic.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Evaluation</title>
      <p>are: one, three, and five. Finally, the extended Concolic
testing tool is run on both DNNs models with a
maximum 100 iterations per run. Each run is initialised with
uniformly drawn test sets 0 of 10 and 100 correctly
classified inputs.</p>
      <p>This section reports on the experimental analysis
conducted to evaluate the performance of the suggested
coverage metrics and the usability of the weight in guiding
the adapted Concolic test case generation. The first set of
analyses examined the quality of the existing BN-based 5.3. Results and Discussion
feature metrics originated by Berthier et al. [2]. Then, the
eficiency of the developed feature weights was tested To analyse the testing outcomes, it is necessary to
careand compared to the previous coverage results.The re- fully select and decide how to split diferent categories.
search questions that were investigated are: Since the experiment objective is to demonstrate the
increase in coverage over the run time, the primary
vari1. RQ1: Do existing BN-based testing metrics guar- ables will be the initial coverage, the final coverage, and
antee covering a model’s critical parts and direct- the time it takes to obtain the final coverage. So, there
ing their test generation algorithm to target the are two numerical parameters: run-time and coverage.
most relevant features? Other parameters are categorical: initial or final; ICA
2. RQ2: Does the proposed coverage metrics deliver or PCA; initial test sizes. Therefore, plotting the result
a reliable testing measure in terms of reporting in space of run-time vs. coverage, and have one error
the coverage prioritising the important internal point representing each categorical class - initial PCA,
representation of the model? initial ICA, final PCA and final ICA, will illustrate the
desired intention. For each of those variables, the errors
are calculated as following: For the run-time, standard
5.1. Datasets and Models normal error is expected, so the mean and one standard
Two trained CNN models have been trained for the exper- deviation are calculated. This amounts to 68% interval
iments: the first one targets the Fashion-MNIST classifi- around the mean. For coverage, however, distribution is
cation problem with 89.03% validation accuracy, and the neither normal nor symmetric. Therefore, median and
second model targets the CIFAR-10 dataset with 81.00% 68% interval around it is computed, equivalent to the
validation accuracy. The models are reasonably sized, previous case.
with more than 10 layers, including blocks of
convolutional and max-pooling layers, followed by a series of RQ1: Coverage Quality Analysis Using Existing
dense layers. Three diferent layers with various func- Metrics. The plots in Figure 3 show the results of a
tionality are chosen for the testing to fairly cover all types standard test generation process, for two of our datasets.
of layers. For the two models, the considered layers are First and second rows show BFCov and BFxCov metrics
the convolutional ReLU, 2d max pooling, and dense ReLU. respectively. Every column difers in the initial test size
Note that our proposed testing approach is applicable 0 ∈ {10, 100}. Each individual plot shows initial and
to any size of neural network since it is based on an ab- final coverage distributions (their medians and 68%
restracted model that performs a dimension reduction on gions), for PCA and ICA methods. The interpretation is
any number of desired layers. that higher median line on coverage, better the median
coverage and smaller the errors.
5.2. Experimental Setup The analysed outcomes illustrate that test generation
process consistently enlarges the median of the coverage,
In the following experiments, the high-level criteria is which is expected. However, the spread of a distribution
used to investigate how a test dataset exercises a set of stays similar, with a few exceptions. Larger number of
hidden features that has been learned from the training runs could improve the precision of results, however, we
dataset and internally represented by any layer of the believe the main reason for such a spread is that only
CNNs. Therefore, the reliance will be placed on the latent runs with higher initial coverage managed to improve.
features learned by the trained CNN models. Multiple The ones with low initial coverage were hard to improve
strategies for linear dimensionality reduction and dis- and stayed the same. It can be observed that the
concretisation of each feature component were applied to stant 0.33333333333 initial coverage that appeared
freconstruct various BN abstraction scheme. Two linear fea- quently in all testing situations, did not increase in most
ture extraction techniques were selected: PCA and ICA, cases (note the minimum coverage -initial and final for
with two to five numbers of extracted features for each all charts is 0.33).
of the abstracted layers. The Kernel Density Estimation Considering the initial test size, we can see that larger
(KDE) and uniform-based discretisation are considered, initial test size, i.e.,, 0 = 100 consistently results in
with varying numbers of the uniform partitions bins that larger (sometime comparable) coverage. A larger 0
gives the synthesis algorithm more leeway to find can- the weight-based approach exhibits advantages in
imdidates from which to derive new inputs that hit tar- proving the BN-based feature coverage. In particular, the
get intervals that are not exercised by any test in 0. study examines whether the weight-based feature
metFor the PCA and ICA, there’s no apparent diference rics will achieve higher coverage with less run time than
between two methods, one exception for the Fashion- the original metrics.</p>
      <p>MNIST, 0 = 100, BFCov metric, where PCA results in Figure 4 shows the results for the weighted coverage,
much tighter distribution. As the same is not visible in in equivalent arrangement as the previous Figure 3.
Comthe BFxCov metric, the significance of this result cannot paring the two figures, as can be seen, the minimal
startbe assessed. Both BFCov and BFxCov metric generally ing coverage in the majority of the plots is greater than
agree with the level of improvement during the testing. the value of 0.33, which occurred often in the previous
Considering runtime, the charts express that ICA method experiment. This expected increase in initial coverage
is slightly more expensive in all situations. results from the fact that one coverage is being weighted</p>
      <p>Giving a deeper inspection to the all-finals coverage, a and the other is not. This small growth gives a greater
query about 1.00 achieved coverage shows it only ob- opportunity for the coverage to be improved during the
tained twice with the bfc criterion on the CIFAR-10 testing. An example from the weight coverage testing
dataset. Both situations occurred with a 100 initial test experiment that gave 0.3895021093 initial coverage that
size using the ICA with two and three extracted fea- increased to 0.7261016195 final coverage with 51 new
gentures per layer and the KDE discretisation method. This erated inputs. This is evident from the preceding finding,
implied a total of 254 combination traces out of 256 which reported that starting testing with a higher initial
(64 per testing criteria per CNN model) did not satisfy coverage has a better chance of increasing. Furthermore,
BFCov(ℬ ,X) = 1 neither BFxCov(ℬ ,X) = 1, after the initial coverage in all plots, except for the
CIFAR100 iterations. Observe the final coverage in Figure 3, 10 with 0 = 100, are consistent with initial coverage
with red and yellow colours, the average median final in Figure 3. That indicate most of features with higher
coverage is around 0.87, which mean there are 0.13 of the weights are not covered yet.
networks remain not tested. What if the not covered fea- Considering the final coverage, the charts show a
sigtures are the vital element of the neural network? There nificant improvement in the coverage for the WFCovTot
are neither guarantees nor any information about the compared to the BFxCov for both datasets and PCA/ICA
untested elements. This issue will be evidenced in the methods. The reason for this is that the generation
profollowing experiments. cess is led by the most important parts of the BN, which
have larger weights. A notable observation is that the
RQ2: Weight Features Coverage using Proposed minimum final coverage increased considerably, which
Metrics. The following experiments assess whether indicates the higher-importance intervals were covered</p>
    </sec>
    <sec id="sec-4">
      <title>6. Conclusion</title>
      <p>with the new input first. This trend is the same for all
coverages except for the F-MNIST model with 0 = 10.</p>
      <p>Finally, considering runtime, weighted coverage takes This paper introduced a weight-based semantic testing
more time for the initial computation, which spend ad- approach that measures how well the DNN is tested by
foditional time calculating feature weights. However, con- cusing on the important features of the DNN using its
abvergence is reached slightly faster compared to the non- stracted Bayesian network. Investigating the high-level
weighted case. Improvement is of a few percent, thus not feature weights revealed the network’s internal decision
so significant. mechanism and how it processes the input data. The
developed weighted feature metrics achieved higher
test5.4. Further Results ing coverage than the original metrics, with an emphasis
on covering important learned representations. The test
The above experiments clearly demonstrated the efec- generation algorithm is directed to synthesise new input
tiveness of the weighted coverage compared with the ba- targeting features with higher importance scores. The
sic coverage. Both metrics were able to generate new sets conducted experiments empirically validated the
applicaof inputs that achieve high coverage. The plot in Figure 5 bility and efectiveness of the proposed weight metrics.
shows the growth of the generated test set with respect This serves as a strong argument in favour of increasing
to the testing iterations. Overall, between 10% to 60% of the trustworthy performance of the DNN models.
iterations produce new test cases. However, the WFCov
and WFdCov enforced the higher coverage on the more
relevant training dataset features. Consider one testing References
scenario illustrated in Table 1, with the bfc criterion, the
testing algorithm was able to generate 25 test inputs, of
which two hit desired intervals: the third interval of the
ifrst feature extracted from max_pooling2d_1 layer and
the fourth interval of the second feature extracted from
the same layer. The coverage increased properly in the
same proportions. On the other hand, the wfc criterion
systemically picked out the interval from  _
set, and the algorithm was able to synthesise new tests
for three high-weight intervals before the 100 iterations
were over. Note that a full experiment document with
clear coverage numbers and feature weights presented in
a tabular manner will be uploaded to the arXiv database.
[1] S. Yan, G. Tao, X. Liu, J. Zhai, S. Ma, L. Xu, X. Zhang,</p>
      <p>Correlations between deep neural network model
coverage criteria and model quality, in: Proc. of the
28th ACM, 2020, pp. 775–787.
[2] N. Berthier, A. Alshareef, J. Sharp, S. Schewe,</p>
      <p>X. Huang, Abstraction and symbolic execution of
deep neural networks with bayesian approximation
of hidden features, arXiv preprint arXiv:2103.03704
(2021).
[3] A. Alshareef, N. Berthier, S. Schewe, X. Huang,</p>
      <p>Quantifying the importance of latent features in
neural networks, in: CEUR Workshop Proceedings,
volume 3087, 2022.</p>
      <p>BN specification
criterion
bfc
wfc
—
—
0.0237
0.0159
0.0159
0.8254
0.8413
0.8646
0.88
0.8964
#gen</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <volume>25</volume>
          +
          <article-title>1 adv.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <volume>27</volume>
          +
          <article-title>3 adv.</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>