1. Introduction

Weight-based Semantic Testing Approach for Deep Neural Networks

Amany Alshareef

Nicolas Berthier

0 1

Sven Schewe

Xiaowei Huang

1 0 OCamlPro , France 1 University of Liverpool , Liverpool L69 3BX , United Kingdom

While deep learning models have achieved state-of-the-art performance in a variety of fields, their susceptibility to adversarial examples has raised serious concerns over their application in safety-critical domains. Existing testing methodologies fail to consider interactions between neurons and the semantic representation that formed in the DNN through the training process. This paper proposes a weight-based testing metric that uses feature importance weights to measure the coverage of the test set and facilitates the generation of additional test cases targeting higher weights' features. Evaluations were conducted to compare the initial and final coverage of the proposed weighting approach with normal BN-based feature coverage. The testing coverage experiments indicated that the proposed weight metrics achieved higher coverage compared to the original feature metrics while maintaining the efectiveness of finding adversarial samples during the test case generation process.

eol>DNN testing metrics Bayesian abstraction Feature coverage Importance weights

1. Introduction

hidden representations and generate additional test cases based on them. This paper presents a testing approach Software testing provides evidence to demonstrate that for neural networks that leverages the learned representhe system meets its requirements or is error-free. The tations and feature importance to evaluate the test data’s fact that deep learning models are data-driven, not coverage. The features’ importance weights reflect how requirements-driven, makes defining their testing cri- the contribution of each hidden feature, extracted from teria challenging. Technically, the accuracy of the learn- the lower-dimensional latent feature space, to the overall ing models is reported based on the test dataset. This output of the network is distributed across the network. standard metric for measuring the model’s overall perfor- This identified feature contribution enables the determimance cannot be suficient or trustworthy in the safety- nation of the causal relationship between the neurons and related domain, where most testing scenarios are ran- the model behaviour. Therefore, the importance weights domly chosen from the entire dataset. Besides, the pro- relate to the semantic representation and provide insights vided test data may not have good coverage of the data into the interaction mechanism underlying the output distribution the model is trained on and may not repre- decision-making process. sent the data obtained in the real world. The proposed approach to design weight-based se

Furthermore, most of the current proposed DNN test- mantic testing metrics for neural networks is using the ing techniques rely on neuron activation as a metric to Bayesian network abstraction model of Berthier et al. [2]. measure the test data coverage. Such a criterion does The authors introduced a dimensionality reduction technot prove its correlation to the system’s decision logic nique using feature extraction algorithms to abstract the [1]. Moreover, these methods aim to transform the input behaviour of a neural network into a Bayesian network data space to generate more test input and completely (BN). The work in [3] utilised that BN model to quantify ignore the model-internal representations and their roles the importance of a neural network’s latent features. It in the output decision. Observing that real-world high- developed a BN-based sensitivity analysis algorithm that dimensional data lie on low-dimensional manifolds mo- estimates the importance of a neural network’s latent tivates investigating where the data lie and modeling it features by analysing an associated BN’s sensitivity to to be analysed instead of confined to the input domain. distributional shifts. They integrated various metrics to There is little attempt to understand machine learning’s compute the diference between the original probability The IJCAI-2023 AISafety and SafeRL Joint Workshop. distributions represented by the abstracted BN and the * Corresponding author. distributions obtained after perturbation. Each latent $ amany.alshareef@liverpool.ac.uk (A. Alshareef); feature was then assigned a weight value based on the nicolas.berthier@ocamlpro.com (N. Berthier); measured sensitivity distance. xsviaeonw.secih.hewuaen@g@livleivrperopool.oalc.a.uck.u(kS.(XSc.hHeuwaen)g;) In this work, we transform the traditional binary cov© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License erage approach to a weighted probability problem and CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) define our coverage metric based on the latent features importance. The neural network’s latent feature space many neurons in the network as possible during the refers to the internal, hidden representations learned by testing phase. The more neurons that are activated with the network. These representations are not directly ob- a specific value, the more complete the testing of the servable but rather are formed as a result of the network network is considered to be. processing input data and capturing their important pat- Neuron activation was firstly introduced by Pei et al. terns [4]. The proposed weight-based testing criterion [5] as a systematic metric which calculates the number emphasises that maximum test coverage is obtained from of activated neurons (w.r.t. ReLU activation function) the presence of important features that have a dominat- during the testing. They proposed the DeepXplore which ing influence on other features and the output decision. is a white box diferential testing algorithm for generating To summarise, the main contributions of the paper are : test inputs that can discover inconsistencies between multiple DNNs. Following the neuron coverage (NC) • Semantic testing metrics measure a test dataset’s principle, DeepTest [6] and Dlfuzz [7] have made some coverage based on the calculated feature weights. improvements to it. Although the NC metric has been • A guided systematic approach that samples test shown to be efective at finding hidden bugs and has cases targeting the higher-priority features. been used to test real-world DNNs, investigations by [8] demonstrated that NC is too coarse and easy to achieve. • Empirical studies on the quality of the proposed Further approaches such as DeepGauge [9], and Deepweight-based coverage compared with the origi- Cover [8] have been developed to extend neuron covnal BN-based coverage erage, with a focus on various activation value factors.

Beyond that, more testing metrics, i.e., quantitative pro2. Existing Testing Techniques jection coverage [10], safety coverage [11] and surprise coverage [12], have been designed based on the activaDeep neural network testing is an active research area tion functions and the syntactic connections between where safety-critical applications are being deployed neurons in successive layers. with them. Numerous techniques have been developed to Unfortunately, neuron activation and other structural address the challenges of testing these learning systems coverage techniques have proven to be less efective in in terms of test coverage criteria, test generation, and validating the safety behaviours of intelligent systems. A test oracles. study by Li et al. [13] showed that there is no correlation between the number of misclassified natural input tests and their structural coverage on the corresponding neural 2.1. Testing Metrics and Coverage Criteria networks. There is still considerable ambiguity about Testing coverage metrics are measurements used to eval- how such coverage criteria directly relate to the decision uate the adequacy of testing by providing a quantitative logic of black-box machine learning systems. Especially evaluation of how thoroughly a deep neural network in that case, the semantic relationship between layers is has been tested according to specific criteria. Coverage- ignored. Additionally, structural coverage has a limited guided deep neural network testing techniques are a class correlation with network robustness, where high neuron of testing methods that aim to increase the coverage of coverage does not imply the network is robust to all the network during testing, with the goal of covering dif- possible inputs or will behave well on unseen data [1, 14]. ferent regions of the input space and revealing as many There are relatively few testing strategies that address potential bugs and unexpected behaviours as possible. the semantic aspects of DNN’s internal representation. Enforcing higher coverage during the testing process One recent efort is the BN-based feature coverage intromakes the network under investigation more likely to be duced in [2] that is improved with weights in this study. robust and reliable. We divide the related existing testing Two testing coverage metrics are defined based on the works in the literature into two categories: suggested BN abstraction: the BN-based feature coverage (BFCov) and the BN-based feature-dependence coverage (i) Structural coverage metrics that are defined based (BFdCov). These metrics give the proportion of hidden on the syntactic characteristics of the NNs, and features or causal relationships between them that are adequately exercised by a set of inputs. Moreover, the au(ii) High-level semantic coverage metrics that are

thors implemented a combined metric BFxCov(ℬ ,X) concentrated on the semantic representations created by NNs. Foofrthtehetwsopaacbeovliemaist,: wBeFCinocvlu(dℬeth,Xe) B×FCBoFvdmCeotvr(icℬbe,lXow).

and refer the reader to the original document for the rest.

Most proposed testing approaches have been focused

on the structural testing coverage to measure the coverage of the dataset relied on the individual neuron activation. These techniques are based on the idea of activating as Definition 2.1 (BN-based Feature Coverage). Given a trained DNN , the BFCov(ℬ , ) coverage of a non-empty set of inputs ⊂

1 ⃒⃒ , ⃒⃒

∑︁ ⃒ , (1) bigger than .

Informally, BFCov(ℬ ,X) ranges over [0, 1], and gives

the percentage of feature intervals that are adequately exercised by . Intuitively, the coverage metric checks ♯ the marginal probabilities for every interval , ∈ ♯ F, in the BN’s node that appears with a probability

Furthermore, a closely related work to the proposed

importance-based semantic testing is the DeepImportance approach presented in [15]. The authors developed a testing approach based on the Importance-Driven (IDC) test adequacy criterion that employs the layer-wise relevance propagation to identify the important neurons.

They then evaluated the test’s adequacy by targeting different combinations of important neurons’ behaviours.

2.2. Test Cases Generation Algorithms

The existing set of test case generation algorithms for

DNNs is categorised into: (i) Input mutation based, which generates new test inputs, either natural or adversarial, by altering the original data using transformation rules; (ii) Fuzzing based, which generates invalid random input data to detect faults and vulnerabilities in the model; and (iii) Symbolic execution based, which is an analysis technique that tests whether specific inputs cause each part of a system to be executed. crete execution directs the symbolic analysis to generate a high coverage test suite. The DeepConcolic introduced in [16] used a concolic testing algorithm that alternates ing the trained DNN under test, and symbolic analyses, which synthesise new test inputs based on some test target that is chosen to increase coverage. 2.3. Test Cases Evaluation

A Test oracle is a reference or ground truth that

provides the expected output for a given input and is used to compare the output of the system to determine its accuracy.

Overall, the structural coverage criteria focus on the patterns that appear in the outputs of ReLU activation functions, while the semantic coverage metrics are high-level criteria that focus on the features that have been learned by hidden layers of the DNN. The proposed semantic metrics are based on the model-internal representations and their contribution to the output behaviours. tures range over a continuous space, therefore, each feature component is discretised into a finite set of feature intervals according to various strategies, i.e., density- and uniform-based; 3. Bayesian network construction. This consists in representing the probabilistic distribution of each extracted feature with a node in the BN.

Each node is associated with either a marginal probability table for hidden features of the first layer, or a conditional probability table (CPT) for hidden or output layers.

Preliminaries.

The BN ℬ , = (, , ) is an abstracted model constructed from the DNN and training dataset . The are nodes containing the extracted lapair , and partitioned into a finite set of intervals denoted with ♯ exponents. The are directed edges indicating dependencies between features in successive representing the conditional probability of the current feature over its parent features w.r.t. .

The feature sensitivity analysis process discussed in [3] is calculated based on the change in the BN’s probability distribution as follows: , = ∑︀ (ref , ′ ) ∈F♯ (ref , ′ ) (2)

Where is the sensitivity weight of the feature ♯, ,

ref is the original (reference) probability distribution represented by the BN, ′ is the probability after perturbing , , and is a function returning the distance between two probabilities distribution according to a given metric . F♯ is the set of considered latent features.

Concolic testing

is a testing technique in which con- tent features from the . Each feature is defined as a concrete executions, which evaluate the test input us- layers, and maps each node in to a probability table selected layers of a CNN: max_pooling2d_1, activation_6, the factor 1/ | , | that acts as an equals weight for the BN abstraction ℬ , as: WFCov(ℬ , ) =

def ∑︁ 4.

Weight-based Testing This section provides a detailed technical description of

the proposed weight-based semantic testing metrics and algorithm. We first introduce and define the concepts of feature coverage. Then, we describe how the test cases are generated using the Concolic testing algorithm. 4.1. Weight Feature Metric The BN abstraction and the hidden feature weights are utilised to develop new coverage metrics that assess the quality of a test dataset in terms of reporting the coverage based on the non-uniform contribution theory. That is, the metrics focus on the semantic values of the neuron activation instead of the syntactic values of the adjusted weights, which is a very local and less decisive criterion. age of a non-empty set of inputs ⊂ Definition 4.1 (Weight-based Feature Coverage).

Given a trained DNN , the weight-based feature cover

D is obtained via ⃒ ⃒ , (3) 4.2. Weight Feature Dependence Metric.

Further, the causal relationships exercised by a dataset X that the BN’s conditional probabilities define are used to develop the following coverage metric:

Definition 4.2 (Weighted Feature Dependence Cov).

Given a trained DNN , the weight-based feature de

pendence coverage is obtained via the ℬ , as: WFdCov (ℬ , ) =

∑︁ L♯, M∈ + , L♯, M· def ⃒ ⃒ ⃒ ⃒ ⃒ ⃒⃒ (, , ♯− 1 ♯

) ∈⃒⃒ ♯ F, × F− 1⃒ ♯ ⃒ ⃒

♯ ⃒ ︁( , |♯− 1︁) ≥ ⃒⃒ ∨ ︁( layer and discretised into three intervals. The features 1,0 and 1,1 have marginal tables. Features 3,0 and 3,1 are illustrated with a complete conditional probability table, while other CPTs have the same length ( number of intervals to the number of parents), but are shortened in the diagram. The weight column shows per-node probability. put layer, for which conditional probability table doesn’t exist. Intuitively, in the same manner as the weighted feature coverage, we iterate over all nodes in + calculate a weighted coverage. For each node, we look

, and ♯ at its CPT which lives in the space F, × look at all of the values larger than . In other words,

♯

F− 1, and (, |♯− 1) ≥

♯ i.e. ( ♯ , ) < . from the previous feature coverage, the values for which marginal distribution is smaller than are also included, , represents a set of nodes excluding the in- than 0.01, which means the node will contribute to the total 0.1730/0.8687 · 26/27 = 0.1917. Similarly, for the node 3,1, there are 25 out of 27 values with probability larger coverage as 0.2697/0.8687 · 25/27 = 0.2875. Summing up contribution from all 4 nodes with CPT assuming there is one probabilities less than 0.01 for each 2,0 and 2,1, the final weighted feature dependence coverage amounts to 0.2190 + 0.2532 + 0.1917 + 0.2875 = 0.9514%. . For the metric to be independent 4.3. Generalised Weighted Feature Cov.

Example 3. Continuing the Example 2, the weighted feature dependence coverage is considered now. The function iterates over last 4 out of 6 nodes for which CPT exists.

For this example, let calculate coverage for the node 3,0, with = 0.01. Taking a look into its CPT, there are 26 out of 27 items with probabilities larger than 0.01. Furthermore, all marginal probabilities are larger than 0.01 too, which finally means that the coverage is 26/27. Now, we calculate how much will the node amount to the total weighted feature dependence coverage. Because the weights in Figure 2 are normalized to sum to 1 for all nodes, they have to be firstly renormalised, so that only the ones with CPT sum to 1. Normalisation constant is just the sum over all but first layer weights, which for our example amounts to 0.8687. Finally, the contribution of the node 3,0 to the total coverage amounts to ︂) (5) .

To deliver a consistent coverage measure that is based on every probability entry in the BN, the two feature metrics 3 and 4 can be combined to produce the generalised weight feature coverage. This generalised weighted feature metric gives a single, unified coverage. In the simplest approach, one can consider two coverages decoupled from each other, and simply multiply them: WFCovTot =

WFCov × situations suficient, however the other possibility is to

WFdCov. This is in most

average them on per-node level: WFCovTot (ℬ , ) = def {︃ WFCovL♯, M

if = 1 , otherwise .

∑︁ 4.4. Coverage Criteria Weight-based Feature Coverage Criterion. A nonempty set of inputs ⊂ D satisfies the Weight-based feature coverage criterion that is obtained via the BN abstraction ℬ , if WFCov(ℬ ,X) = 1.

Weight-based Feature-dependence Coverage Criterion. A non-empty set of inputs ⊂ D satisifes the Weight-based feature-dependence coverage criterion that is obtained via the BN abstraction ℬ , if WFdCov(ℬ ,X) = 1. 4.5. Concolic Test Generation

Algorithm 1 Test Dataset Generation

Input: ← ←

DNN under test data set

ℬ , ← abstract BN ← features sensitivity weights Output: test inputs 0, coverage 1: 0 ← sampling initial seed test inputs from 2: ℬ ,0 ← initialising the BN prob. tables with 0 3: _ ← intervals with prob ≤ 4: for = 1 to max iterations do 5: ← _ with highest weight in 6: select a test input ∈ 0 7: construct an LP problem based on 8: solve the optimisation objective:

The weight-based feature metrics are implemented on

the DeepConcolic tool 1 and the features weights are min ‖(1,1, . . . , 1,|1|) − (1,1, . . . , 1,|1|)‖∞ used as criteria to direct the Concolic testing algorithm. 9: ′ = (1,1, . . . , 1,|1|) A detailed description of the test generation procedure is 10: if ′ passes the oracle then provided in Algorithm 1. 11: ′ ← newly generated test input

For a given trained DNN on a dataset and the 12: if (′) = () then associated abstract BN ℬ , , the features weights 13: 0 ← 0∪ {′} are calculated for all extracted features. We assume that 14: update ℬ ,0 probabilities suitable feature extraction and discretisation have been 15: update coverage tauprpelioefdtohne aℬtra,in.inTghesatmesptlegeneratiotno porbotcaeinduthree ssttarurtcs- 1167:: else′ ← adversarial input by randomly sampling an initial seed set of test inputs 18: end if 0 from data set that is correctly classified by 19: end if , and initialising the probability tables in the BN to 20: end for produce ℬ ,0 . Next, the algorithm identifies the test target intervals _ = {♯, } through analysing the non-epsilon probabilities of the marginal or condi- After solving the LP problem and extracting the newly tional probability tables in ℬ ,0 . The non-epsilon is generated test input ′ from values of input neurons: the probabilities that are less than and not yet met ′ = (1,1, . . . , 1,|1|), the algorithm will check two by the current set of input test cases in 0. Thus, the properties of the new input ′. Does the ′ pass the oracle, _ consist a set of hidden feature interval(s) i.e., is it structurally close enough to w.r.t the ∞ norm? that should be elicited by the test input to be generated. If yes, then, does the ′ output the same classification The test case generation algorithm then iterates label of , in other words, is (′) == ()? If yes, times according to the following: then, the ′ is considered a valid input and added to First, identifying the ∈ _ with the high- the test input 0 = 0∪ {′}. Otherwise, the ′ is est importance weight in , and selecting a test input considered adversarial for , as ′ is both deemed close ∈ 0 based on some heuristics, such as closeness to enough to from which it is derived, and it is not assigned the targeted interval . The implemented assumption to the same classification label as by . Accordingly, the ifnd a good-enough candidate input is searching for an probabilities in ℬ ,0 are updated to account for the input ∈ 0 whose feature value is close to the target new test ′ and then recalculate the coverage. The test interval boundaries. Then, constructing an LP problem case generation continues if the test criteria obtained via based on and solve the optimisation objective that seeks the new ℬ ,0 is not yet satisfied. to minimise the distance between activations of input Note that, the new test ′ may not actually improve neurons and . This problem is formulated as: reported coverage if it is just "closer" to the target interval than but does not hit it. The expectation is that, ′ will Minimise: ‖(1,1, . . . , 1,|1|) − (1,1, . . . , 1,|1|)‖∞ later be selected to generate a new input ′′ according (6) to the same process, and eventually the target interval might be reached.

Where 1,1, . . . , 1,|1| is the set of all input neurons. 1The tool is available at https://github.com/TrustAI/DeepConcolic.

5. Evaluation

are: one, three, and five. Finally, the extended Concolic testing tool is run on both DNNs models with a maximum 100 iterations per run. Each run is initialised with uniformly drawn test sets 0 of 10 and 100 correctly classified inputs.

This section reports on the experimental analysis conducted to evaluate the performance of the suggested coverage metrics and the usability of the weight in guiding the adapted Concolic test case generation. The first set of analyses examined the quality of the existing BN-based 5.3. Results and Discussion feature metrics originated by Berthier et al. [2]. Then, the eficiency of the developed feature weights was tested To analyse the testing outcomes, it is necessary to careand compared to the previous coverage results.The re- fully select and decide how to split diferent categories. search questions that were investigated are: Since the experiment objective is to demonstrate the increase in coverage over the run time, the primary vari1. RQ1: Do existing BN-based testing metrics guar- ables will be the initial coverage, the final coverage, and antee covering a model’s critical parts and direct- the time it takes to obtain the final coverage. So, there ing their test generation algorithm to target the are two numerical parameters: run-time and coverage. most relevant features? Other parameters are categorical: initial or final; ICA 2. RQ2: Does the proposed coverage metrics deliver or PCA; initial test sizes. Therefore, plotting the result a reliable testing measure in terms of reporting in space of run-time vs. coverage, and have one error the coverage prioritising the important internal point representing each categorical class - initial PCA, representation of the model? initial ICA, final PCA and final ICA, will illustrate the desired intention. For each of those variables, the errors are calculated as following: For the run-time, standard 5.1. Datasets and Models normal error is expected, so the mean and one standard Two trained CNN models have been trained for the exper- deviation are calculated. This amounts to 68% interval iments: the first one targets the Fashion-MNIST classifi- around the mean. For coverage, however, distribution is cation problem with 89.03% validation accuracy, and the neither normal nor symmetric. Therefore, median and second model targets the CIFAR-10 dataset with 81.00% 68% interval around it is computed, equivalent to the validation accuracy. The models are reasonably sized, previous case. with more than 10 layers, including blocks of convolutional and max-pooling layers, followed by a series of RQ1: Coverage Quality Analysis Using Existing dense layers. Three diferent layers with various func- Metrics. The plots in Figure 3 show the results of a tionality are chosen for the testing to fairly cover all types standard test generation process, for two of our datasets. of layers. For the two models, the considered layers are First and second rows show BFCov and BFxCov metrics the convolutional ReLU, 2d max pooling, and dense ReLU. respectively. Every column difers in the initial test size Note that our proposed testing approach is applicable 0 ∈ {10, 100}. Each individual plot shows initial and to any size of neural network since it is based on an ab- final coverage distributions (their medians and 68% restracted model that performs a dimension reduction on gions), for PCA and ICA methods. The interpretation is any number of desired layers. that higher median line on coverage, better the median coverage and smaller the errors. 5.2. Experimental Setup The analysed outcomes illustrate that test generation process consistently enlarges the median of the coverage, In the following experiments, the high-level criteria is which is expected. However, the spread of a distribution used to investigate how a test dataset exercises a set of stays similar, with a few exceptions. Larger number of hidden features that has been learned from the training runs could improve the precision of results, however, we dataset and internally represented by any layer of the believe the main reason for such a spread is that only CNNs. Therefore, the reliance will be placed on the latent runs with higher initial coverage managed to improve. features learned by the trained CNN models. Multiple The ones with low initial coverage were hard to improve strategies for linear dimensionality reduction and dis- and stayed the same. It can be observed that the concretisation of each feature component were applied to stant 0.33333333333 initial coverage that appeared freconstruct various BN abstraction scheme. Two linear fea- quently in all testing situations, did not increase in most ture extraction techniques were selected: PCA and ICA, cases (note the minimum coverage -initial and final for with two to five numbers of extracted features for each all charts is 0.33). of the abstracted layers. The Kernel Density Estimation Considering the initial test size, we can see that larger (KDE) and uniform-based discretisation are considered, initial test size, i.e.,, 0 = 100 consistently results in with varying numbers of the uniform partitions bins that larger (sometime comparable) coverage. A larger 0 gives the synthesis algorithm more leeway to find can- the weight-based approach exhibits advantages in imdidates from which to derive new inputs that hit tar- proving the BN-based feature coverage. In particular, the get intervals that are not exercised by any test in 0. study examines whether the weight-based feature metFor the PCA and ICA, there’s no apparent diference rics will achieve higher coverage with less run time than between two methods, one exception for the Fashion- the original metrics.

MNIST, 0 = 100, BFCov metric, where PCA results in Figure 4 shows the results for the weighted coverage, much tighter distribution. As the same is not visible in in equivalent arrangement as the previous Figure 3. Comthe BFxCov metric, the significance of this result cannot paring the two figures, as can be seen, the minimal startbe assessed. Both BFCov and BFxCov metric generally ing coverage in the majority of the plots is greater than agree with the level of improvement during the testing. the value of 0.33, which occurred often in the previous Considering runtime, the charts express that ICA method experiment. This expected increase in initial coverage is slightly more expensive in all situations. results from the fact that one coverage is being weighted

Giving a deeper inspection to the all-finals coverage, a and the other is not. This small growth gives a greater query about 1.00 achieved coverage shows it only ob- opportunity for the coverage to be improved during the tained twice with the bfc criterion on the CIFAR-10 testing. An example from the weight coverage testing dataset. Both situations occurred with a 100 initial test experiment that gave 0.3895021093 initial coverage that size using the ICA with two and three extracted fea- increased to 0.7261016195 final coverage with 51 new gentures per layer and the KDE discretisation method. This erated inputs. This is evident from the preceding finding, implied a total of 254 combination traces out of 256 which reported that starting testing with a higher initial (64 per testing criteria per CNN model) did not satisfy coverage has a better chance of increasing. Furthermore, BFCov(ℬ ,X) = 1 neither BFxCov(ℬ ,X) = 1, after the initial coverage in all plots, except for the CIFAR100 iterations. Observe the final coverage in Figure 3, 10 with 0 = 100, are consistent with initial coverage with red and yellow colours, the average median final in Figure 3. That indicate most of features with higher coverage is around 0.87, which mean there are 0.13 of the weights are not covered yet. networks remain not tested. What if the not covered fea- Considering the final coverage, the charts show a sigtures are the vital element of the neural network? There nificant improvement in the coverage for the WFCovTot are neither guarantees nor any information about the compared to the BFxCov for both datasets and PCA/ICA untested elements. This issue will be evidenced in the methods. The reason for this is that the generation profollowing experiments. cess is led by the most important parts of the BN, which have larger weights. A notable observation is that the RQ2: Weight Features Coverage using Proposed minimum final coverage increased considerably, which Metrics. The following experiments assess whether indicates the higher-importance intervals were covered

6. Conclusion

with the new input first. This trend is the same for all coverages except for the F-MNIST model with 0 = 10.

Finally, considering runtime, weighted coverage takes This paper introduced a weight-based semantic testing more time for the initial computation, which spend ad- approach that measures how well the DNN is tested by foditional time calculating feature weights. However, con- cusing on the important features of the DNN using its abvergence is reached slightly faster compared to the non- stracted Bayesian network. Investigating the high-level weighted case. Improvement is of a few percent, thus not feature weights revealed the network’s internal decision so significant. mechanism and how it processes the input data. The developed weighted feature metrics achieved higher test5.4. Further Results ing coverage than the original metrics, with an emphasis on covering important learned representations. The test The above experiments clearly demonstrated the efec- generation algorithm is directed to synthesise new input tiveness of the weighted coverage compared with the ba- targeting features with higher importance scores. The sic coverage. Both metrics were able to generate new sets conducted experiments empirically validated the applicaof inputs that achieve high coverage. The plot in Figure 5 bility and efectiveness of the proposed weight metrics. shows the growth of the generated test set with respect This serves as a strong argument in favour of increasing to the testing iterations. Overall, between 10% to 60% of the trustworthy performance of the DNN models. iterations produce new test cases. However, the WFCov and WFdCov enforced the higher coverage on the more relevant training dataset features. Consider one testing References scenario illustrated in Table 1, with the bfc criterion, the testing algorithm was able to generate 25 test inputs, of which two hit desired intervals: the third interval of the ifrst feature extracted from max_pooling2d_1 layer and the fourth interval of the second feature extracted from the same layer. The coverage increased properly in the same proportions. On the other hand, the wfc criterion systemically picked out the interval from _ set, and the algorithm was able to synthesise new tests for three high-weight intervals before the 100 iterations were over. Note that a full experiment document with clear coverage numbers and feature weights presented in a tabular manner will be uploaded to the arXiv database. [1] S. Yan, G. Tao, X. Liu, J. Zhai, S. Ma, L. Xu, X. Zhang,

Correlations between deep neural network model coverage criteria and model quality, in: Proc. of the 28th ACM, 2020, pp. 775–787. [2] N. Berthier, A. Alshareef, J. Sharp, S. Schewe,

X. Huang, Abstraction and symbolic execution of deep neural networks with bayesian approximation of hidden features, arXiv preprint arXiv:2103.03704 (2021). [3] A. Alshareef, N. Berthier, S. Schewe, X. Huang,

Quantifying the importance of latent features in neural networks, in: CEUR Workshop Proceedings, volume 3087, 2022.

BN specification criterion bfc wfc — — 0.0237 0.0159 0.0159 0.8254 0.8413 0.8646 0.88 0.8964 #gen

25 + 1 adv.

27 + 3 adv.