=Paper=
{{Paper
|id=Vol-2267/213-217-paper-39
|storemode=property
|title=Scalable semantic virtual machine framework for language-agnostic static analysis
|pdfUrl=https://ceur-ws.org/Vol-2267/213-217-paper-39.pdf
|volume=Vol-2267
|authors=Maxim Menshchikov
}}
==Scalable semantic virtual machine framework for language-agnostic static analysis==
Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 SCALABLE SEMANTIC VIRTUAL MACHINE FRAMEWORK FOR LANGUAGE-AGNOSTIC STATIC ANALYSIS Maxim Menshchikov Saint Petersburg State University E-mail: maximmenshchikov@gmail.com The more static program analysis spreads in the industry, the clearer it becomes that its real-world usage requires something more than just optimized algorithms to make execution fast. Distribution of computations to cluster nodes is one of the viable ways for speeding up the process. The research demonstrates one of the approaches for organizing distributed static analysis – “Semantic Virtual Machines”, which is based upon symbolic execution of language-agnostic intermediate codes of programs. These codes belong to the developed language for program representation with semantic objects behind. As objects are fully serializable, they can be saved to and later retrieved from a distributed database. That opens up a possibility for making an analyzer spread to many cluster nodes. Semantic Hypervisor is the core of the system which acts as a master node for managing runnable virtual machines and also a relay between them and a semantics storage. Main efforts are put into the development of intercommunication mechanisms which can be effective in most of the scenarios, as well as into integration of the approach in existing static analyzer. The study shows the architecture, describes its advantages and disadvantages, suggests solutions for the biggest problems observed during research. The empirical study shows improved performance compared to a single node run, and the result is almost linearly scalable due to the high locality of data. The main result is the created scalable and unified language-agnostic architecture for static analysis with semantics data in a distributed storage behind of it. The significance of the research is in high achieved performance level alongside with a clean architecture. Keywords: static analysis, distribution, scalability, semantic virtual machine, semantic hypervisor © 2018 Maxim Menshchikov 213 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 1. Introduction Static analysis has an imminent problem: less precision and more abstractions lead to the loss of detection quality, higher precision leads to an increased execution time. Real-world applications tend to increase source code size with the time, the state space size grows even faster. Static analysis can benefit from distributed computations, but the lack of effectiveness makes it quite questionable whether it should be used or not. The paper suggests one of the approaches which shows good results in our testing. The novelty of the approach is that intermediate code is distributed rather than the source code of an application, and a good level of parallelism is achieved for intraprocedural and in some ways interprocedural analysis types. About the project. Our analyzer is a work-in-progress tool aimed at finding bugs in programs via abstract interpretation and proving their absence using model checking, which is done in a language-agnostic manner using syntax unification and “semantic virtual machine” intermediate representation, which is described in a paper. 2. Preliminary conversions Before doing the core part of the analysis, source codes have to be read. As multiple languages are supported, generalized abstract syntax trees are retrieved and converted to language-agnostic intermediate codes. That is done in MapReduce [2] fashion according to an algorithm described in [1], except that intermediate codes are generated. The key to distributiveness is that all objects are serializable and can be stored in a NoSQL database like MongoDB, being merged at a later stage. To distribute files among nodes, we use a greedy algorithm in which the suggested input size is estimated for each node, after that files are allocated to nodes, taking the biggest file at each step. This algorithm is suboptimal, but measuring code complexity is even costlier, while still providing an inaccurate measure of full analysis time. 2.1. Virtual Machine language Key properties of our intermediate codes are defined as follows. Commands are simple to read and process, and expression grammar isn’t defined by language itself. We consider the expression language another entity which is supported throughout the static analyzer. As in [1], the semantics is based on resources, expressions, and fragments. Fragments, which are essentially groups of expressions with similar properties, make up an execution flow but are not preserved during the transformation to virtual machine codes. Each variable or a function has the corresponding resource assigned. Expressions help apply actions to operands like resources and other expressions. Precise modeling of memory accesses is made using immutable constant references, constant values, and mutable variables, which are used only for the propagation of return values during function inlining, while the first two object types help simulate low-level load/store semantics. Our command set includes the instruction for declarative resource preloading – declare (resource). Usually resource declaration can be omitted, however, there are a few exceptions. First of all is dynamic arrays, which size is defined at exact code point. Second is a case of preloading resources from a distributed database before its use, resulting in an improved performance. Other instructions are: • assign, init, branch [loop], end branch, invoke, return, ... – more common commands. • check [false expr], constraint [true expr] – constraint system management commands. The example of C → IR transformation is presented in Table 1. 214 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 Table 1. Various code representations Regular code in C Function-separated intermediate codes Inlined intermediate codes int f(int x) f: enter function (Variable(x)) init ConstVal(cv1) = 4 { branch x == 0 check cv1 == 0 if (x == 0) return 1 assign Variable(result) = 0 return 1; end branch init ConstVal(cv2) = «%d» return 0; return 0 invoke printf WITH } exit function ARGUMENTS cv2, result printf(«%d», f(4)); declare f init ConstVal(cv1) = 4 assign Variable(result) = 0 invoke f with arguments cv1 store to Variable(result) init ConstVal(cv2) = «%d» invoke printf with arguments cv2, result 3. Semantic virtual machines When the preparation stage is done, the semantics is collected for the whole program and intermediate codes are generated for each procedure. In most of the cases, this is more than enough for performing Model Checking or Abstract Interpretation routines. These algorithms are out of the scope of this study, however, the framework for doing such processing is quite generic. The framework is based upon processing of virtual machine codes with a semantic storage behind. The hypervisor is the core of the framework, which is directly connected to a semantic database. Having brief knowledge of the target program, it manages entry point selection and virtual machine startup. The virtual machine is more like a code interpreter than a physical machine emulator in a traditional sense. It executes the codes and applies static analysis algorithms. It contacts hypervisor to get some semantic data when needed. This process is called object lookup, and the core idea is that objects can be obtained either from local high-performant storage, or from a global storage, or from node-specific local storages. If the object isn’t found after all, it is created with deducted type and marked as non-existent. In proper programs, such objects always have existent counterparts. Hypervisor runs virtual machines on some entry points in generalization (intraprocedural) or instantiation (interprocedural) modes. The typically required semantics is preloaded to a virtual machine instance – e.g. semantics referenced in headers is definitely required, as well as one introduced by the source file. Semantics introduced by other files is loaded upon request (lazy semantics loading). This logic makes data highly local with little penalties when loading other parts. The excessive data is unloaded to save memory (automatic semantics unloading). It is essential to simplify the code as much as possible during generalization or IR translation as it dramatically improves performance. The other possible optimization is to allocate virtual machines on a per-module basis, especially taking network topology into account (per-module VM allocation). 4. Evaluation The developed approach had been integrated into a static analyzer. For testing, we used our yet-to-be-published Verification Example Framework with many defects of various kinds, and results are outlined in Table 2. Previously the algorithm without intermediate code representation had been used, but we found it hard enough to establish fast and generalized enough language-agnostic analysis. Without virtual machines, too many aspects had to be taken into account due to the redundancy of human-readable source code forms. Yet the usage of assembler-like presentation with more atomic store operations would have made analysis miss important facts about a program. The semantic virtual machine run is definitely taking more time than a regular run would, but with the right degree of parallelism, it is fully compensated. The higher precision rate also justifies its usage. 215 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 Table 2. Test results Approach Performance gain (%) Precision (% of found issues) Semantic Virtual Machine 100 100 Distributed Semantic VM (4 nodes) 369 100 Distributed Semantic VM (8 nodes) 720 100 VM with assembler-like input 114 63 Generic semantics-driven analysis 176 80 4.1. Problems observed during evaluation During the evaluation, we observed a few issues which are mostly related to static analysis field problems rather than intercommunication mechanisms. First of all, the process of line/column keeping within error reports is imminently ineffective as each expression is ought to allocate (file ID, location) pairs. For that, we used LLVM/Clang-like solution in which only raw position within files is preserved and the diagnostics engine has to cache files and lines before showing reports. Second, we found that low-level-like IR programs with immutable objects require excessive processing of loops. A single node can identify variables by their in-memory addresses, but loop unrolling or function inlining requires a “refresh” of IR representation, replacing old variables and constant references or values with newly allocated counterparts. Persistent storage depends on object ID but essentially has the same issue. Reallocation isn’t always needed as ID can be changed in-place when the previous usage had been committed to SMT program or abstract interpretation lattices. Clearly, this is the biggest issue of a semantic database-driven language. The last problem is that performance gain is decreasing when the code base becomes more cross-dependent. For intraprocedural analysis it isn’t a big problem, especially with resume-based verification as the processing of models is done early, but interprocedural kinds of analysis still do block in the process. The extensive use of global resources adds a lot to the problem, however, we suppose that such architecture of a target program is a problem by itself and analyzer’s slowdown is only a side-effect. 5. Discussion The resulting architecture is almost linearly scalable for some kinds of analysis and most of the applications not involving the intensive data exchange. The ephemerality of the word “almost” is highly questionable. A significant footprint may be noticed if unstructured sources are provided. This issue is negligible due to the way safety-critical software is developed – the best practices prohibit relying on global variables, and, with the help of internal command for cache prewarming, such projects might cause only a negligible slowdown. The interprocedural analysis suggests the use of dependency analysis scheduling, which is a part of additional research not covered by the paper. The architecture is slower than other tested designs in a sequential mode: programs in intermediate codes are not generated for free. Still, it makes the analysis language-agnostic, indirectly improving code quality by reusing all of the verification sequences for all languages instead of one, so we find the usage of the approach fairly judged. We would still want to further examine interprocedural analysis in the next research. 6. Related work Most studies focus on analyzing distributed systems rather than developing a static analyzer. The main pattern used in the paper is based upon MapReduce [2], the well-known distributed model allowing to reduce the problem to multiple chunks in a failsafe manner and then glue up results together. We don’t focus on technical details but rather describe the measures we had to take in order to perform such an analysis. In the study [3] the authors adopt the actor model to a whole program analysis, particularly testing it on call graph analysis. Our approach is also aiming at providing a framework rather than solving a single issue. The DiViNe model checker [4] provides an efficient way to distribute analysis to multiple nodes using MPI, however, unlike our research, it is aiming at using LLVM [5] operational codes rather than own development – not without a reason as it helps quickly get to support new 216 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 languages supported by LLVM-based compilers. Our language can heavily exploit the database behind in order to preserve memory and improve the ease of distributing between nodes, and we see it an advantage over LLVM codes. The research [6] modifies a static analyzer for scalability via OpenMPI. It is based upon LLVM IR, and results demonstrated by authors show its efficiency only on large projects. Unlike that project, the analyzer of ours is built with HPC in mind, although it was not a primary goal at the current stage. Intermediate code representations aren’t uncommon in the static analysis field. The most widespread intermediate representation is LLVM [5] which is used in many compilers including Clang. Many analyzers use LLVM IR because it is easy to start with. However, our opinion (supported by authors of [6]) is that this kind of IR requires significant efforts to apply to static analysis and often is a reason for inefficiency. Saturn project [7] introduces Calypso – a generic-purpose logic programming language. Based on research reports, it was created due to similar reasons. 7. Conclusion and future work The main result of the research is a developed model for distributing static analysis load to multiple nodes. The unique intermediate code representation language with a semantic database behind had been created with a purpose of simplifying the analysis stage. As source programs are translated to this language, they are sent directly to virtual machines for execution using the suggested greedy file distribution algorithm, making analysis stage language-agnostic in described terms. Hypervisor’s role is in controlling and orchestrating of virtual machines, in turn, the latter can request non-local semantics from the hypervisor. The distributed object lookup algorithm had been designed to cope with large projects. Tests show that while presented model suggests a non-zero overhead due to translation to IR, the analysis significantly benefits from load distribution and a good semantics preservation, which results in improved performance with few nodes compared to a single node run (although among single node runs the model is definitely not the fastest) and better precision levels. The overall evaluation had proven that the model is quite flexible, so the future work will be concentrated on bringing more analysis types onboard rather than changing the model. Certain aspects like source distribution and semantics reducing can be further improved by employing more HPC techniques. References [1] Menshchikov M.A., Lepikhin T.A. Applying MapReduce to static analysis //Proceedings of XLVII International Scientific Conference on Control Processes and Stability (CPS’17). V. 4 (20), P. 433- 444. [2] Dean J., Ghemawat S. MapReduce: simplified data processing on large clusters // Communications of the ACM. 2008. V. 51. No. 1. P. 107-113. [3] Garbervetsky D., Zoppi E., Livshits B. Toward full elasticity in distributed static analysis: the case of callgraph analysis // Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 2017. P. 442-453. [4] Barnat J. et al. Divine: Parallel distributed model checker // Parallel and Distributed Methods in Verification, 2010 Ninth International Workshop on, and High Performance Computational Systems Biology, Second International Workshop on. IEEE, 2010. P. 4-7. [5] Lattner C., Adve V. LLVM: A compilation framework for lifelong program analysis & transformation // Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization. IEEE Computer Society, 2004. 75 p. [6] Abdullin A., Stepanov D., Akhin M. Distributed Analysis of the BMC Kind: Making It Fit the Tornado Supercomputer // International Conference on Tools and Methods for Program Analysis. Springer, Cham, 2017. P. 1-10. [7] Aiken A. et al. An overview of the Saturn project // Proceedings of the 7th ACM SIGPLAN- SIGSOFT workshop on Program analysis for software tools and engineering. ACM, 2007. P. 43-48. 217