=Paper=
{{Paper
|id=Vol-2785/paper16
|storemode=property
|title=Bayesian Neural Predictive Monitoring
|pdfUrl=https://ceur-ws.org/Vol-2785/paper16.pdf
|volume=Vol-2785
|authors=Luca Bortolussi,Francesca Cairoli,Nicola Paoletti,S. A. Smolka,S. D. Stoller
|dblpUrl=https://dblp.org/rec/conf/overlay/BortolussiCPSS20
}}
==Bayesian Neural Predictive Monitoring==
<pdf width="1500px">https://ceur-ws.org/Vol-2785/paper16.pdf</pdf>
<pre>
                                                  Proceedings of the
     2nd Workshop on Artificial Intelligence and Formal Verification, Logics, Automata and Synthesis (OVERLAY),
                                                  September 25, 2020


                Bayesian Neural Predictive Monitoring

Luca Bortolussi1,4 , Francesca Cairoli1 , Nicola Paoletti2 , S. A. Smolka3 , and S. D. Stoller3
        1
            Department of Mathematics and Geoscience, University of Trieste, Trieste
             2
               Department of Computer Science, Royal Holloway University, London
                3
                  Department of Computer Science, Stony Brook University, USA
                4
                  Modelling and Simulation Group, Saarland University, Germany


                                                    Abstract

                 Neural State Classification (NSC) is a recently proposed method for runtime
             predictive monitoring of Hybrid Automata (HA) using deep neural networks
             (DNNs). NSC predictors have very high accuracy, yet are prone to prediction
             errors that can negatively impact reliability. To overcome this limitation, we
             present Neural Predictive Monitoring (NPM), a technique that complements NSC
             predictions with estimates of the predictive uncertainty. These measures yield
             principled criteria for the rejection of predictions likely to be incorrect, without
             knowing the true reachability values. We also present an active learning method
             that significantly reduces the NSC predictor’s error rate and the percentage of
             rejected predictions. NPM is based on the use of Bayesian techniques, Bayesian
             Neural Networks and Gaussian Processes, to learn respectively the predictor and
             the rejection rule. Our approach is highly efficient, with computation times on the
             order of milliseconds, and effective, managing in our experimental evaluation to
             successfully reject almost all incorrect predictions.

1    Problem Statement
Hybrid systems are a central model for many safety-critical cyber-physical system applications [2]. Their
verification typically amounts to solving a hybrid automata (HA) reachability checking problem [11].
Given an HA M, with state space X, time-bounded reachability checking is concerned with establishing
whether, given an initial state x and a set of target states D ⊆ X – typically a set of unsafe states to
avoid – the HA admits a trajectory starting from x that reaches D within a time T . It is denoted as
M |= Reach(D, x, T ). Due to its high computational cost, reachability checking is usually limited to
design-time (offline) analysis [7]. Our focus is on the online analysis of hybrid systems and, in particular,
on the predictive monitoring (PM) problem [9], i.e., the problem of predicting, at runtime, whether a
violation can be reached from the current state within a given time bound. We aim to derive a function
that can solve the PM problem. In solving this problem, we assume a distribution X of HA states and
seek the monitor that predicts HA reachability with minimal error probability w.r.t. X .
Problem 1 (Predictive monitoring for HA reachability). Given an HA M with state space X, a distribution
X over X, a time bound T and set of unsafe states D ⊆ X, find a function F ∗ : X → {0, 1} that minimizes
the probability that F ∗ makes a mistake in predicting the truth value of M |= Reach(D, x, T ):
                                  Prx∼X (F ∗ (x) 6= 1(M |= Reach(D, x, T ))),

                 Copyright © 2020 for this paper by its authors.
                 Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
96


where 1 is the indicator function. A state x ∈ X is called positive w.r.t. a predictor F : X → {0, 1} if
F (x) = 1. Otherwise, x is called negative (w.r.t. F ).

     Finding F ∗ , i.e., finding a function approximation with minimal error probability, is indeed a classical
machine learning problem, a supervised classification problem in particular, with F ∗ being the classifier,
i.e., the function mapping HA state inputs x into one of two classes: positive or negative.

Training set. In supervised learning, one minimizes a measure of the empirical prediction error w.r.t. a
training set. In our case, the training set Zt is obtained from a finite sample Xt of X by labelling the training
inputs x ∈ Xt using some reachability oracle, that is, a hybrid automata reachability checker like [1, 6, 8, 10].
Hence, given a sample Xt of X , the training set is defined by Zt = {(x, 1 (Reach(D, x, T )) | x ∈ Xt }.

Prediction errors. No machine learning technique can completely avoid prediction errors. Therefore,
we have to deal with predictive monitors F that are prone to prediction errors, which are of two kinds:
false positives (FPs) and false negatives (FNs). These errors are respectively denoted by predicates fn(x)
and fp(x), whereas pe(x) = fn(x) ∨ fp(x) denotes a generic prediction error.

    Let f be the discriminant function associated to the classifier F , i.e., the function that maps the inputs
into class likelihoods. A central objective of this work is to derive, given a predictor f , a rejection criterion
Rf able to identify states x that are wrongly classified by F , i.e., FNs and FPs, without knowing the true
reachability value of x. Further, Rf should be optimal, that is, it should ensure minimal probability of
rejection errors w.r.t. the state distribution X .
    Our solution relies on enriching each prediction with a measure of predictive uncertainty: given f , we
define a function uf : X → U mapping an HA state x ∈ X into some measure uf (x) of the uncertainty of
f about x. The only requirements are that uf must be point-specific and should not use any knowledge
about the true reachability value. The function uf is then used to build an optimal error detection
criterion, as explained below.
Problem 2 (Uncertainty-based error detection). Given a reachability predictor f , a distribution X over
HA states X, a predictive uncertainty measure uf : X → U over some uncertainty domain U , and a
kind of error e ∈ {pe, fn, fp}, find an optimal error detection rule G∗f,e : U → {0, 1}, i.e., a function that
minimizes the probability that G∗f,e makes a mistake in recognizing errors of type e:

                                          Prx∼X (e(x) 6= Gf∗,e (uf (x))).

    As for Problem 1, we can obtain a sub-optimal solution Gf,e to Problem 2 by expressing the latter as
a supervised learning problem, where the inputs are, once again, sampled according to X and labelled
using a reachability oracle. These observations need to be independent from the above introduced training
set Zt , i.e., the set used to learn the reachability predictor f .
    For e ∈ {pe, f p, f n}, the final rejection rule Rf,e for detecting HA states where the reachability
prediction should not be trusted, and thus rejected, is readily obtained by the composition of the
uncertainty measure and the error detection rule Rf,e = Gf,e ◦ uf : X → {0, 1}, where Rf,e (x) = 1 if the
prediction on state x is rejected; Rf,e (x) = 0, otherwise.

Validation set. As done for the training set, a set Zv is obtained from a sample Xv of X , independent
from Xt . Given an uncertainty measure uf , we build a validation set Wve = {(uf (x), 1(e(x))) | x ∈ Xv }
which is used to learn Gf,e . The inputs of Wve , referred to as Uv , are the uncertainty measures evaluated
on Xv : Uv = {uf (x) | x ∈ Xv }. Each input uf (x) ∈ Uv is then labelled respectively with 1 or 0 depending
on whether or not the classifier f makes an error of type e.

     A graphical scheme of the full NPM pipeline is presented in 1.
                                                                                                             97


               State distribution
                                             Validation set
                       +
                                                  𝑍#
               Reachability oracle
                                                                                    Active
                                                                                   Learning


                  Training set         Neural State            Uncertainty      Error detection
                       𝑍"              Classifier 𝐹           quantification       criterion


Figure 1: Overview of the NPM framework. Training of the neural state classifier F and retraining via
active learning are performed offline. The only components used at runtime are the classifier F and the
error-detection criterion.


2     Methods
In the NSC method of [14] deep neural networks are used to solve Problem 1. In this work, we rely on
Bayesian neural networks (BNN) [5] which combine the best of probabilistic models and neural networks
by learning a posterior distribution over the weights of the neural network. BNNs benefit from the
statistical guarantees of probabilistic models and from the ability of neural networks to be universal
function approximators. In a nutshell, a prior over the parameters is used as input to a neural network, the
output of the neural network is then used to compute the likelihood – with a specified distribution – and
from these it is possible to compute the posterior distribution of the parameters, either by Hamiltionian
Monte Carlo (HMC) sampling [4] or by Variational Inference (VI) [5]. The rationale behind this choice is
that BNNs are well suited to solve Problem 2 as they are able to capture predictive uncertainty via the
posterior predictive distribution [5], which accounts for the uncertainty about the parameters. The function
uf maps a point x ∈ X to the mean µx and the standard deviation σx of the empirical approximation of
the predictive distribution.

Bayesian uncertainty measure. Given observations Zt and a BNN fw with w ∼ p(w|Zt ), we define
the Bayesian uncertainty measure uf : X → U ⊆ R2 as the function mapping inputs x into the empirical
mean, µx , and variance, σx2 , of the predictive distribution p(ŷ | x, Zt ), where ŷ is the class predicted by
fw , where w is a sample from the posterior p(w | Zt ). Formally, ∀x ∈ X, uf (x) = (µx , σx2 ).

    An unseen input x must have sufficiently low uncertainty values in order for the prediction to be
accepted. We seek to find those uncertainty values that optimally separate the points in Uv – the validation
set – in relation to their classes, that is, separate points yielding errors from those that do not. Optimal
thresholds can be automatically identified by solving an additional supervised learning problem, in this
work we leverage Gaussian Process classification (GPC) [15].

2.1    Active Learning
As the accuracy of a classifier increases with the quality and the quantity of observed data, adding samples
to Zt will generate a more accurate predictor, and similarly, adding samples to Wv will lead to more
precise error detection. Ideally, one wants to maximize accuracy while using the least possible amount of
additional samples, because obtaining labeled data is expensive (in NPM, labelling each sample entails
solving a reachability checking problem), and the size of the datasets affects the complexity and the
dimension of the problem. Therefore, to improve the accuracy of our learning models efficiently, we need
a strategy to identify the most “informative” additional samples.
    For this purpose, we propose an uncertainty-aware active learning solution, where the re-training
points are derived by first sampling a large pool of unlabeled data, and then considering only those points
98


where the current predictor f is still uncertain according to the rejection rule Rf .

Active learning algorithm. We query the HA reachability checker to label the points rejected by Rf
and we divide them into two groups: one group is added to the training set Zt , producing the augmented
dataset Zta ; the other is added to the validation set Zv , producing Zva . The first step consists in retraining
the reachability predictor on Zta . Let fa denote the new predictor. The second step requires extracting
the augmented validation dataset Wva from Zva . To do so, we must first train a new uncertainty measure
ufa so as to reflect the new predictor fa . Then, the new error detection rule Gfa is trained on Wva . In
conclusion, this process leads to an updated rejection rule Rfa = Gfa ◦ ufa , which is expected to have a
reduced rate of incorrect rejections.

    A frequentist approach is also possible [16]: the NSC is a deterministic Neural Network, the uncertainty
is measured using the theory of conformal predictions [3, 13, 17] and the rejection rule is described by a
support vector classifier [5].


3       Experimental Results
We experimentally evaluate the proposed method on a benchmark of six hybrid system models with
varying degrees of complexity. We consider four deterministic case studies: the model of the spiking neuron
(SN) action potential [14] and the classic inverted pendulum (IP) on a cart, which are two-dimensional
models with non-linear dynamics, the artificial pancreas (AP) [12], which is a six-dimensional non-linear
model, and the helicopter model (HC) [14], which is a linear model with 29 state variables. In addition,
we analyze two non-deterministic models with non-linear dynamics: a cruise controller (CC) [14], whose
input space has four dimensions, and a triple water tank (TWT)1 , which is a three-dimensional model.

Performance measures. We want our method to be capable of working at runtime, which means it
must be extremely fast in making predictions and deciding whether to trust them. The time required
to train the reachability predictor and the error detection rule does not affect its runtime efficiency, as
it is performed in advance (offline) only once. Furthermore, to avoid overfitting, we did not tune the
architecture (i.e., number of neurons and hidden layers) to optimize the performance for our data and,
for the sake of simplicity, we choose the same architecture for all the case studies. However, each case
study requires a fine tuning of the the inference hyper-parameters. VI parameters differ from HMC
hyper-parameters. Keeping that in mind, the relevant performance metrics for NPM are the accuracy
of the reachability predictor F , the error detection rate, that measures the proportion of errors made on
the test set by F that are actually recognized by Rf , and the overall rejection rate of the rejection rule
Rf , that measures the overall proportion of test points rejected by Rf . Clearly, we want our method to
be reliable and thus, detect the majority of prediction errors (high detection rate) without being overly
conservative, i.e., keeping a low rejection rate, in order not to reduce its effectiveness.

Computational costs. Training an NPM requires training the state classifier, generating the datasets
Wv , which requires computing the uncertainty values for each point in Zv , and finally training the error
rejection rule. All these steps are performed offline. On the other hand, the time required to evaluate
the NPM, on a given test input x∗ , is composed of the time needed to make a prediction and the time
needed to choose whether to accept it or not, which take on the order of milliseconds. NPM’s efficiency is
not directly affected by the complexity of the system under analysis but only by the complexity of the
underlying learning problems.

Evaluation. We consider three configurations: the initial configuration, where the predictor and error
detection rules are derived via supervised learning; the active configuration, where the initial models
are retrained via our uncertainty-aware active learning; and the passive configuration, where the initial
models are retrained using a uniform sampling strategy to augment the dataset and with the same number
     1 http://dreal.github.io/benchmarks/networks/water/
                                                                                                          99


of observations of the active configuration. In this way, we can evaluate the benefits of employing an
uncertainty-based criterion to retrain our models.
    Our experimental evaluation demonstrates that our reachability predictors attain high accuracies,
consistently above 97.4% and the benefits of active learning are visible from an overall reduction of the
rejection rate and an overall increase in the state-classifier accuracy compared to the passive approach.
Furthermore, VI scales better than HMC with respect to the dimension of the system and we observe that
on average VI outperforms HMC: the initial VI approach yield an NSC accuracy of 98.95%, a rejection
rate of 5.19% and a recognition rate of 86.18%. The passive results introduce only minor improvements,
whereas active learning yields a significant reduction in the rejection rate (from 5.19% to 2.36%), an
increase in the NSC accuracy (from 98.95% to 99.32%) and an increase in the overall recognition rate
(from 86.18% to 91.85%).

                     Acc. Det. rate          Rej. rate              Acc.     Det. rate     Rej. rate
      Artificial Pancreas (AP)
       Initial  VI 99.29        95.77               5.17    HMC     98.95         98.09          8.32
      Active    VI 99.71        96.55               1.75    HMC     99.38         90.32          4.17
      Passive VI 99.47         100.00               3.29    HMC     95.12        100.00         26.79
      Inverted Pendulum (IP)
       Initial  VI 99.58        80.95               1.68    HMC     87.45        100.00         24.72
      Active    VI 99.58        85.71               1.31    HMC     99.15         87.06          4.17
      Passive VI 99.71          75.81               1.13    HMC     98.49        100.00         13.02
      Cruise Controller (CC)
       Initial  VI 99.01        97.98               5.53    HMC     97.22         99.64          8.41
      Active    VI 99.84       100.00               1.25    HMC     99.47         94.34          3.14
      Passive VI 99.74         100.00               1.46    HMC     95.75         97.03          8.42
      Triple Water Tank (TWT)
       Initial  VI 99.60        77.50               1.11    HMC     97.50         96.80          5.04
      Active    VI 99.67        84.85               1.61    HMC     99.20         95.00          3.70
      Passive VI 99.50          75.40              20.32    HMC     91.86         52.58         11.31
      Helicopter (HC)
       Initial  VI 98.14        88.71              13.64    HMC     97.74         89.82         14.71
      Active    VI 98.90        92.86               1.98    HMC     97.74         73.01          7.06
      Passive VI 98.66          87.54               9.94    HMC     97.77         65.47          6.40
      Spiking Neuron (SN)
       Initial  VI 98.18        91.21               7.91    HMC     98.32         74.40          9.89
      Active    VI 98.20        91.11               6.26    HMC     98.89         87.38          5.91
      Passive VI 98.20          98.52              14.57    HMC     98.21         74.86         14.85

Table 1: Comparison of initial and active approaches on an illustrative case study (the artificial pancreas).
Measures of performance are: Acc. denotes the NSC accuracy, Det. rate denotes the error detection
rate and Rej. rate denotes the rejection rate.


4    Conclusion
Our approach overcomes the computational footprint of reachability checking (infeasible at runtime), while
improving on traditional runtime verification by being able to detect future violations in a preemptive
way. Among future directions, we plan to extend our approach to support more complex and real-world
systems that include noise and partial observability.


References
100


 [1] M. Althoff. An introduction to CORA 2015. In Proc. of the Workshop on Applied Verification for
     Continuous and Hybrid Systems, 2015.

 [2] R. Alur. Formal verification of hybrid systems. In Proceedings of the Ninth ACM International
     Conference on Embedded Software (EMSOFT), pages 273–278, Oct 2011.
 [3] V. Balasubramanian, S.-S. Ho, and V. Vovk. Conformal prediction for reliable machine learning:
     theory, adaptations and applications. Newnes, 2014.
 [4] M. Betancourt.     A conceptual introduction to hamiltonian monte carlo.              arXiv preprint
     arXiv:1701.02434, 2017.
 [5] C. M. Bishop. Pattern recognition and machine learning. Springer, 2006.
 [6] S. Bogomolov, M. Forets, G. Frehse, K. Potomkin, and C. Schilling. Juliareach: a toolbox for
     set-based reachability. In Proceedings of the 22nd ACM International Conference on Hybrid Systems:
     Computation and Control, pages 39–44, 2019.
 [7] T. Brihaye, L. Doyen, G. Geeraerts, J. Ouaknine, J.-F. Raskin, and J. Worrell. On reachability for
     hybrid automata over bounded time. In L. Aceto, M. Henzinger, and J. Sgall, editors, Automata,
     Languages and Programming, pages 416–427, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg.
 [8] X. Chen, E. Ábrahám, and S. Sankaranarayanan. Flow*: An analyzer for non-linear hybrid systems.
     In International Conference on Computer Aided Verification, pages 258–263. Springer, 2013.
 [9] X. Chen and S. Sankaranarayanan. Model predictive real-time monitoring of linear systems. In
     Real-Time Systems Symposium (RTSS), 2017 IEEE, pages 297–306. IEEE, 2017.
[10] S. Gao, S. Kong, and E. M. Clarke. dreal: An smt solver for nonlinear theories over the reals. In
     International conference on automated deduction, pages 208–214. Springer, 2013.

[11] T. A. Henzinger, P. W. Kopke, A. Puri, and P. Varaiya. What’s decidable about hybrid automata?
     Journal of computer and system sciences, 57(1):94–124, 1998.
[12] N. Paoletti, K. S. Liu, S. A. Smolka, and S. Lin. Data-driven robust control for type 1 diabetes under
     meal and exercise uncertainties. In International Conference on Computational Methods in Systems
     Biology, pages 214–232. Springer, 2017.
[13] H. Papadopoulos. Inductive conformal prediction: Theory and application to neural networks. In
     Tools in artificial intelligence. InTech, 2008.
[14] D. Phan, N. Paoletti, T. Zhang, R. Grosu, S. A. Smolka, and S. D. Stoller. Neural state classification
     for hybrid systems. In Automated Technology for Verification and Analysis, volume 11138 of Lecture
     Notes in Computer Science, pages 422–440, 2018.
[15] C. E. Rasmussen and C. K. Williams. Gaussian processes for machine learning, volume 1. MIT press
     Cambridge, 2006.
[16] S. D. Stoller. Neural predictive monitoring. In Runtime Verification: 19th International Conference,
     RV 2019, Porto, Portugal, October 8–11, 2019, Proceedings, volume 11757, page 129. Springer Nature,
     2019.
[17] V. Vovk, A. Gammerman, and G. Shafer. Algorithmic learning in a random world. Springer Science
     & Business Media, 2005.

</pre>