A GPGPU–based Simulator for Prism: Statistical Verification of Results of PMC [extended abstract] Marcin Copik1 and Artur Rataj2 and Bożena Woźna-Szcześniak3 1 RWTH Aachen University, Schinkelstr. 2, 52062 Aachen, Germany mcopik@gmail.com 2 IITiS, Polish Academy of Sciences ul. Bałtycka 5, 44-100 Gliwice, Poland arturrataj@gmail.com 3 IMCS, Jan Długosz University Al. Armii Krajowej 13/15, 42-200 Czȩstochowa, Poland. b.wozna@ajd.czest.pl Abstract. We describe a GPGPU–based Monte Carlo simulator integrated with Prism. It supports Markov chains with discrete or continuous time and a subset of properties expressible in PCTL, CSL and their variants extended with rewards. The simulator allows an automated statistical verification of results obtained us- ing Prism’s formal methods. Keywords: GPGPU, Monte Carlo simulation, Prism, probabilistic model checking, statistical model checking, probabilistic logics. 1 Introduction We present a GPGPU–based simulator which extends the model checker Prism [9]. The simulator uses the Monte Carlo method for a statistical probabilistic model check- ing [14, 10] (SPMC). SPMC involves a generation of a large number of random paths (i.e. samples) in a probabilistic Markov chain, evaluating a given property on each path, and finally finding an average of these evaluations, which approximate a correct value of the property. Monte Carlo methods typically are able to precisely compute confidence intervals (CI) around the approximated value. The GPGPU simulator (further called S G ) is integrated with Prism, which al- lows to check a single model implementation using either one of Prism’s probabilistic model checking (PMC) methods, or S G . S G supports the same models and properties as the Prism’s CPU–based simulator (further denoted S C ), yet the latter, lacking GPGPU acceleration and on–the–fly compilation of the model, is considerably slower. Beside the simulator itself, we present its simple application: an automated method of verifying property values computed using Prism’s formal methods. Namely, the user may request, that certain property classes be computed in two steps: 1. A PMC step. A property is computed using one of Prism ’s formal PMC methods; 2. An automated statistical verification (ASV) step. The obtained property value v is statistically evaluated using S G ; it is then checked if v fits into a number of confidence intervals (CI) of various confidence levels. Because a significant imprecision in a PMC method is typically caused by some mechanism for reducing computational complexity, the user might tend to disable such a mechanism, rather than wait for a statistical verification. This is why it is crucial that the verifying simulator be fast: it should effectively save user’s time, by providing data in tight CI within a short time. The paper is constructed as follows. Section 2 describes what is currently sup- ported by S G . In Sec. 3 we provide a description of some implementation details of S G . In Sec. 4 ASV is discussed Section 5 presents a case study. Finally, the last section concludes the paper. 2 Supported models and properties S G accepts a specific set of properties, for two finite probabilistic classes of Markov chains: a discrete-time Markov chain (DTMC) and a continuous-time Markov chain (CTMC). Unlike DTMC, where each transition corresponds to a discrete time-step, in a CTMC transitions occur in continuous time given by a negative exponential distribution. Both of the classes can be enriched with rewards structures, resulting respectively in rDTMC and rCTMC. A reward structure allows to specify two distinct types of rewards: state (instantaneous) and transition (cumulative) ones, assigned respectively to states and transitions by means of a reward function. Formal definitions of all of the above systems can be found e.g. in [8]. The temporal logics Probabilistic Computation Tree Logic (PCTL) [6] and Con- tinuous Stochastic Logic (CSL) [1] can be used to specify properties for respectively DTMCs and CTMCs. S G recognises only flat subsets of each logic. We will refer to these subsets as respectively FlatPCTL and FlatCSL. Definition 1 (Syntax of FlatPCTL). Let a ∈ AP be an atomic proposition, ∼∈ {<, ≤ , ≥, >}, p ∈ [0, 1] a probability bound, and k is a non–negative integer or ∞. The syntax of FlatPCTL is defined inductively as follows: φ ::= P∼p [ψ], ψ ::= Xφ1 | G≤k φ1 | F≤k φ1 | φ1 U≤k φ1 | φ1 R≤k φ1 , φ1 ::= a | φ1 ∧ φ1 | ¬φ1 . In the syntax above, we distinguish between state formulae φ , φ1 and path formulae ψ, which are evaluated over states and paths, respectively. A property of a model is always expressed as a state formula. The path modalities (i.e., next state – X, bounded globally – G, bounded eventually – F≤k , bounded until – U≤k , and bounded release – R≤k ), which are standard in temporal logics, can occur only within the scope of the probabilistic operator P∼p [·]. Intuitively, a state s satisfies P∼p [ψ] if the probability of taking a path from s satis- fying path formula ψ meets the bound ∼ p. Next, Xφ is true if φ is satisfied in the next state; G≤k φ is true if φ holds for all time-steps that are less or equal to k; F≤k φ is true if φ is satisfied within k time-steps; φ1 U≤k φ2 is true if φ2 is satisfied within k time-steps and φ1 is true from now on until φ2 becomes true. φ1 R≤k φ2 is true if either φ1 is satisfied within k time-steps and φ2 is true from now on up to the point where φ1 becomes true, or φ2 holds for all time-steps that are less or equal to k. The formal semantics over DTMC can be found e.g. in [6, 8]. S G supports also an extension of FlatPCTL allowing specifications over reward structures by means of the following state formulae: R∼r [C≤k ] | R∼r [I=k ] | R∼r [Fφ ] where ∼∈ {<, ≤, ≥, >}, r ∈ IR≥0 , k ∈ IN, and φ is a FlatPCTL formula. The formal semantics over rDTMC can be found in [8]. Here we only provide an intuition. Namely, a state s of an rDTMC satisfies R∼r [C≤k ], if from state s the expected reward cumulated after k time-steps satisfies ∼ r. Next, a state s of an rDTMC satisfies R∼r [I=k ], if from state s the expected state reward at time-step k satisfies ∼ r. Finally, a state s of an rDTMC satisfies R∼r [Fφ ], if from state s the expected reward cumulated before a state satisfying φ is reached meets the bound ∼ r. Definition 2 (Syntax of FlatCSL). Let a and p be as in Definition 1, and I be an interval of IR≥0 . The syntax of FlatCSL is defined inductively as follows: φ ::= P∼ p [ψ], ψ ::= Xφ1 | GI φ1 | FI φ1 | φ1 UI φ1 | φ1 RI φ1 , φ1 ::= a | φ1 ∧ φ1 | ¬φ1 . Satisfying P∼p [ψ] and path modalities are the same for FlatCSL as for FlatPCTL, except that the parameter of the modalities is an interval I of the non-negative reals, rather than an integer upper bound. For example, the path formula φ1 UI φ2 holds if φ2 is satisfied at some time instant in the interval I and always earlier φ1 holds. S G supports an extension of FlatCSL allowing specifications over reward structures in a manner similar to FlatPCTL, the only difference are the time bounds: R∼r [C≤t ] | R∼r [I=t ] | R∼r [Fφ ] where ∼∈ {<, ≤, ≥, >}, r,t ∈ IR≥0 , and φ is a FlatCSL formula. The extension of CTMC with rewards is analogous to that of DTMC, given above, barring the mentioned differences in time bounds. The formal semantics over rCTMC can be found in [8], see though that S G does not support therein mentioned steady state. 3 Implementation of S G In a case of Markov models implemented in the Prism language, a generation of ran- dom simulation paths is not computationally expensive. A single transition consists of an evaluation of its guards, an enumeration of updates if viable, a random selection of subsequent transitions and finally an estimation of properties. The syntax of guards and updates allows simple arithmetical and logical operations and a few basic mathematical functions, such as a power or a logarithm. Prism comes already with the mentioned CPU–based simulator S C , well–suited to debugging tasks but slow, as it is sequential and generates each path by reinterpreting a model specification. Instead of such an on–the–fly reinterpretation, the tool Ymer [15] compiles expres- sions to a form which is faster to evaluate repeatedly. Another tool APMC [7], in, turn provides a translation of Prism models to C programs which are later compiled and executed. Another obstacle preventing the S C from achieving a reasonable performance is its inherent sequentiality. A Monte Carlo simulation is considered to be embarrassingly parallel – it samples the model by generating a large number of independent random paths. Prism has approached this problem by providing the ability to perform dis- tributed sampling, Ymer and APMC support distributed sampling as well. The latest version of Ymer implement multi–threaded sampling as well [16]. The improvement in parallel and distributed sampling is limited by the number of threads supported on multi–core CPU systems. A processor with a rich set of instruc- tions and multiple cache levels is a perfect tool for complex and general problems but using it for a simple simulation of a moderate Markov model would be a very expen- sive over–engineering, both financial cost– and energy–wise. On the contrary, recent advances in GPGPUs made them a very efficient replacement for such computations, which benefit from massive parallelism. This simultaneous execution of hundreds and thousands lightweight threads on a GPGPU comes at a price: well-known limitations include a restriction of a group of threads to execute the same instruction at the same time or a burdensome memory model with a high cost of non–regular memory access patterns. We believe though, that those restrictions do not play a significant role in a Monte Carlo simulation of Prism models. For that purpose we have chosen OpenCL [13] as a framework and programming language for GPGPU simulation. 3.1 Architecture Our decision to implement the simulator engine for Prism using OpenCL has been driven by its capability of supporting many types of devices offered by different ven- dors, the most common being graphic cards and multi–core CPU servers. We will fur- ther use an OpenCL–specific terminology. To simplify the implementation we have used OpenCL bindings for Java, the main source language of Prism, provided through the JavaCL library [2]. Fig. 1 presents a general scheme of the simulator. Within Prism there is no sig- nificant difference between starting a simulation in using S C or S G . In both cases a model and a list of properties is required. Then, a just–in–time source-to-source trans- lation of the model and properties produces a dedicated compute kernel (block Kernel generator in Fig. 1) which is basically a program for a number of vector processors, which use a single–instruction–multiple–data paradigm. The approach is very differ- ent from the reinterpretation scheme implemented in S C , which stores the model and properties in memory structures, and not as an automatically generated program. S G ex- ecutes kernels implemented in OpenCL C language, a modified version of C99, which has been adapted to OpenCL’s device abstraction and stripped from features usually not allowed on a device, such as recursion or function pointers. Those simple programs can be effectively compiled into native bytecode of the device (typically, a graphic card). The syntax of Prism language makes the translation process rather straightforward due to the similarity between its expressions and the OpenCL C language. Only mi- nor and automatically applied changes allow to obtain an expression valid in the latter language. 3.2 Method A scheme of generating a single random path (sample) is presented in Algorithm 1. Capitalised identifiers indicate constants which are injected into kernel source. Many Prism model OpenCL device Kernel generator OpenCL compiler Simulator runtime Simulation PCTL properties results Fig. 1. The OpenCL-based simulator engine in Prism. details have been omitted in the case of continuous time models. For example, in CTMC both times of entering and leaving state may be required in certain situations, such as a bounded until or a property with cumulative reward. The fargument idx is an unique identifier of kernel instance, offset specifies how many paths have been processed in kernels previously computed on the device. The argument seed seeds the generator of pseudorandom numbers. As the scheme is an equivalent of an OpenCL kernel, the two last arguments are OpenCL–specific storage buffers in the device memory, where the kernel is allowed to save results of property verification, and also the length of created path used to display sampling statistics. The first loop (lines 5–7) resets the collected statistical data about properties, the second loop (lines 8–27) generates the path until any of the following: its maximum length is reached (line 8), a deadlock (line 14) or a self–loop (line 18) occurs, or all properties are verified precisely enough (line 24). The last loop (lines 29–31) copies the collected data into the global memory. The implementation had to be specially adapted for GPGPU devices, given that there can be a significant slow–down if the kernel diverges from the SIMD paradigm. Another example of such change is choosing always the smallest, most space efficient integer type for holding a state variable, which is possible as Prism models specify ranges for each such variable. For efficiency, if a simulated model updates a variable with a value exceeding these ranges, then the behaviour is undefined. A large speed–up can be achieved by handling an update synchronised between Prism modules in one step. If such update is performed on the same copy of state vector, it may induce a race condition of the type Read-After-Write. S G detects such situations and creates addi- tional copies of the affected variables, if necessary, instead of relying on the OpenCL compiler, which might handle the issue less effectively. Creating only one instance of a state vector is crucial for performance because it decreases significantly memory space used by the kernel, as discussed later with memory complexity of the algorithm. The simulation kernel is also capable of detecting when there is only one transition avail- able, and it does not change the state. Such a behaviour indicates that there is only a single self–loop in the current state, which is interpreted by S G as a stop condition. If there is an unbounded property which has not been satisfied yet, its value is not going to change. If there is a property with a lower bound, which has not been reached yet, it can be evaluated immediately and the process of simulating the current path ends. Our pseudorandom number generator of choice is Random123 [12], which provides a performance satisfying our needs. For more details on model conversion into a form for GPGPU computation see [3]. Algorithm 1 A generic path generation algorithm for OpenCL kernel. 1: procedure KERNEL(idx, o f f set, seed, path lengths, property results) 2: prng ← initialize prng(seed, idx, o f f set) . A distinct seed for each path 3: state vector ← INITIAL STATE VECTOR 4: time ← 0 5: for each property p ∈ PROPERTIES do 6: reset(results p) 7: end for 8: for i < MAX PATH LENGTH do 9: active updates ← evaluate guards(state vector) . Single and synchronised 10: time update(time) . More complex for CTMC 11: for each property p ∈ PROPERTIES do 12: active properties ←property p update(state vector, results p, time) 13: end for 14: if active updates = 0 then 15: break . Deadlock detected 16: end if 17: no change ← update(prng, state vector, active updates) 18: if no change ∧ active updates = 1 then 19: for each property p ∈ PROPERTIES do 20: active properties ←property p update(state vector, results p, time) 21: end for 22: break . Loop detected 23: end if 24: if ¬active properties then 25: break . Stop sampling 26: end if 27: end for 28: path lengths[idx + o f f set] = i . Save results in global memory 29: for each property ∈ property results do 30: property[idx + o f f set] = results p . Save results in global memory 31: end for 32: end procedure A generated kernel is passed as a string of source code to a specific device compiler (block OpenCL compiler in Fig. 1). This compilation does not add a significant overhead on modern OpenCL platforms – we have found that it typically takes less than one second for tested models. A kernel represented in device bytecode is sent to simulator’s runtime (see again Fig. 1), where a range of work items is enqueued on the device, as described in more details in the next section. Each one of them is responsible for producing exactly one random path through the model and its identifier idx, is paired with an offset, in order to produce a unique key across many separate kernel enqueued on the device. The key is necessary for correct accessing memory storage and unique random seeds for each generated path. 4 Automatic statistical verification The ASV step is optionally triggered at the user request, in one of the following ways: – unconditionally; – if a quantitative property is being computed, like the probability value; – if a steady state is detected prematurely in an iterative PMC method. The last criterion is discussed in more detail in the example in Sec. 5. The ability to process an extensive number of samples in a very short time often allows for a reliable and fast ASV of a property value v obtained using PMC. In the ASV step, in order to save the user’s time, S G must finish within a predefined time Tmax . After the ASV step, the user is presented with a set of diagnostics, so that he can estimate the correctness of v. Let the simulator estimate a value w of the same property. Let RCI be the ratio of the width of CI at confidence level 90% to w. The following independent diagnostics can be presented: – Tmax too low to reach RCI < RCI CI −2 and can be max ; Rmax has a default value of 1 · 10 customised; – v within a CI of a confidence level x, outside a CI of a confidence level y, where x ∈ L = {90%, 95%, 99%}, y ∈ L ∨ {< 90%}. 5 Case study An instantaneous reward in a CTMC can be estimated by weighting the reward func- tion over a probabilistic distribution of states at a time instant t. Prism computes the distribution by uniformisation (Jensen’s method) [5] which discretises the CTMC with respect to a constant rate. Then, probabilities are approximated by a finite summation Z of Poisson–distributed steps of the derived DTMC. The number of these steps de- pends on the precision required, and is computed using the Fox–Glynn method [4]. Yet, in order to shorten computation time, when performing that summation, Prism also tests, if a steady state has been reached, by finding a maximum difference, either rela- tive or absolute, between elements in solution vectors from two successive steps. If the difference is smaller than a constant threshold ε, the summation is terminated early. Prism’s default criterion of the termination is to use a relative difference and ε = 10−6 . The criterion can be customised in order to set a compromise between preci- sion and computational complexity. To illustrate the ASV, we will discuss a model where the detection of a steady state is premature if the default termination criterion is used. In effect, were the automatic SV not enabled, the user would obtain incorrect results without a warning. 300 16 14 250 5 · 106 12 200 4 · 106 10 time [sec] Exp(Cr ) 150 8 Ti 3 · 106 N Ts formal, ε = 1 · 10−6 6 N 100 2 · 106 simulation 4 50 CI 99% 1 · 106 CI x100, 99% 2 0 0 0 0 500 1000 1500 10 100 1000 t t (a) (b) Fig. 2. (a) Estimates of Exp(Cr ), CIs are narrow enough to be hardly seen, thus their magnification by 100 is also show. (b) elapsed CPU time and the number of samples per single property, Ti and Ts are total times of, respectively, the formal method performing Z, and the statistical estimation on S G , N is the number of samples chosen by S G in order to fit within Tmax . We will use a modified Model 2 from [11]. It is a simple CTMC with multiple clients and servers, in which some of the servers can occasionally be broken. To trigger the said premature termination of Z, we modify the model by using constants rs = 1, rl = 500 (see [11] for details), i.e. the server is slower at initiating a connection with a client, but faster at processing a request. We ask for an expected instantaneous reward value Cr , equal to the number of clients requesting at a given time instant t. Let the user choose the default termination criterion, and let he also request, that SV be used for up to Tmax = 1.5 sec. after a PMC step such that a steady state detection has terminated early Z (or any other formal iterative method which uses ε). Exp(Cr ) against t, found using both a formal PMC and S G , is depicted in Fig. 2(a). We see that the model undergoes a fast change at t ≈ 100, during which the increase of requests becomes rapidly slower. The constant Tmax allows for a fairly narrow CI at 99% confidence level – its width never reaches 0.1. In the case of the discussed diagram RCI < 4 · 10−4 for any t – such a narrow CI makes it probable that following the said change at t ≈ 100, the PMC results become less and less precise. This would trig respective warnings, that values from the results of the PMC step fall outside a CI of a high confidence level. The relative temporal overhead of SV for the chosen Tmax is illustrated in Fig. 2(b). It is largest for small t and makes Prism run for about 50% longer. For t & 200 Prism needs less than 10% of an additional CPU time to perform the SV step. The figure also shows the number of samples which can be computed within Tmax ; wee see that the number decreases for larger t due to longer paths which need to be generated by the simulator. In order to obtain the CPU times in Fig. 2(b), we have used an AMD R9 Nano, a modern mid-range GPU with 4096 stream processors. Let us compare that GPU to 14 12 10 time [sec] 8 6 CPU–based 4 GPGPU–based 2 0 0 500 1000 1500 2000 t Fig. 3. Computation time on two OpenCL devices. another OpenCL device, containing two Intel Xeon E5-2630 v3 CPUs with clock fre- quency 2.40GHz, each of them providing 8 cores with HyperThreading, resulting in 32 threads for OpenCL. Fig. 3 compares the simulation time from Fig. 2(b) to that of the CPU OpenCL device, both evaluating the same number of samples. In the case of the latter device, we see times in the range of 8 and 12 seconds, i.e. it is several times slower. This shows the advantage of streams processors in the case of a large number of independent, non–memory intensive tasks. S C has been excluded from the chart as it is not optimised for speed. At t = 2000, where long paths make it especially advantageous to use the on–the–fly compilation, it took S C over 130 minutes to evaluate the same reward property on the same CPU, thousands of times slower comparing a respective simulation on a GPGPU. 6 Discussion The SV limits itself now to diagnostics, but it could be straightforwardly extended to influence on the PMC step. For example, if a property pi turns out to be computed imprecisely in the PMC step, and the following property to compute pi+1 differs only in the time instant t, the SV could automatically decrease ε for the computation of pi+1 . We expect to release an open-source version of the tool in the following months. The further development would be focused on a parallelisation across multiple OpenCL devices, with a dynamic and automatic load balancing. References 1. Aziz, A., Sanwal, K., Singhal, V., Brayton, R.: Model-checking continuous-time markov chains. Transactions on Computational Logic (TOCL) 1(1), 162–170 (2000) 2. Chafik, O.: Javacl. https://github.com/nativelibs4java/JavaCL (2016) 3. Copik, M.: GPU-accelerated stochastic simulator engine for PRISM model checker. Bache- lor’s Thesis, Silesian University of Technology, Gliwice, Poland (2014) 4. Fox, B.L., Glynn, P.W.: Computing poisson probabilities. Commun. ACM 31(4), 440–445 (1988) 5. Gross, D., Miller, D.: The randomization technique as a modeling tool and solution procedure for transient Markov processes. Operations Research 32(2), 926–944 (1984) 6. Hansson, H., Jonsson, B.: A logic for reasoning about time and reliability. Formal aspects of computing 6(5), 512–535 (1994) 7. Herault, T., Lassaigne, R., Peyronnet, S.: Apmc 3.0: Approximate verification of discrete and continuous time markov chains. In: Proceedings of the 3rd International Conference on the Quantitative Evaluation of Systems, QEST ’06, pp. 129–130. IEEE Computer Society, Washington, DC, USA (2006). DOI 10.1109/QEST.2006.5. URL http://dx.doi.org/ 10.1109/QEST.2006.5 8. Kwiatkowska, M., Norman, G., Parker, D.: Stochastic model checking. In: Formal Methods for the Design of Computer, Communication and Software Systems: Performance Evaluation (SFM’07), LNCS, vol. 4486, pp. 220–270. Springer (2007) 9. Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: Verification of probabilistic real- time systems. In: G. Gopalakrishnan, S. Qadeer (eds.) Proc. 23rd International Conference on Computer Aided Verification (CAV’11), LNCS, vol. 6806, pp. 585–591. Springer (2011) 10. Legay, A., Delahaye, B., Bensalem, S.: Statistical model checking: An overview. In: Inter- national Conference on Runtime Verification, pp. 122–135. Springer (2010) 11. Pourranjbar, A., Hillston, J., Bortolussi, L.: Don’t just go with the flow: Cautionary tales of fluid flow approximation. In: Computer Performance Engineering - 9th European Workshop, pp. 156–171 (2012) 12. Salmon, J.K., Moraes, M.A., Dror, R.O., Shaw, D.E.: Parallel random numbers: As easy as 1, 2, 3. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, pp. 16:1–16:12. ACM, New York, NY, USA (2011). DOI 10.1145/2063384.2063405. URL http://doi.acm.org/10.1145/ 2063384.2063405 13. Stone, J.E., Gohara, D., Shi, G.: Opencl: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering 12(1-3), 66–73 (2010) 14. Younes, H., Kwiatkowska, M., Norman, G., Parker, D.: Numerical vs. statistical probabilistic model checking. International Journal on Software Tools for Technology Transfer (STTT) 8(3), 216–228 (2006) 15. Younes, H.L.S.: Ymer: A statistical model checker. In: Proceedings of the 17th Interna- tional Conference on Computer Aided Verification, CAV’05, pp. 429–433. Springer-Verlag, Berlin, Heidelberg (2005). DOI 10.1007/11513988 43. URL http://dx.doi.org/ 10.1007/11513988_43 16. Younes, H.L.S.: Ymer. https://github.com/hlsyounes/ymer (2016)