=Paper=
{{Paper
|id=Vol-2953/SEIM_2021_paper_23
|storemode=property
|title=Generator of Automated Tools for Program Instrumentation
|pdfUrl=https://ceur-ws.org/Vol-2953/SEIM_2021_paper_23.pdf
|volume=Vol-2953
|authors=Mikhail Onischuck,Vladimir Itsykson
}}
==Generator of Automated Tools for Program Instrumentation==
Generator of automated tools for program instrumentation Mikhail Onischuck Vladimir Itsykson Peter the Great St.Petersburg Polytechnic University Peter the Great St.Petersburg Polytechnic University Saint-Petersburg, Russian Federation Saint-Petersburg, Russian Federation orcid.org/0000-0001-5359-0161 orcid.org/0000-0003-0276-4517 Abstract. Instrumentation is one of the methods used in dynamic program analysis for assessing software performance. This paper proposes a technology for constructing software instrumentation tools for different programming languages. The instrumentation functions are described within this approach using several grammar-based DSLs. The obtained instrumentation toolkit is the result of generating a new system based on formal descriptions of the instrumentation process. The tracing functions are embedded into the original program using a TXL utility; a conversion program is also generated for this utility. The developed prototype was tested on 4 large projects written in different programming languages: Java, Python, C++ and Object Pascal. The tests confirmed the efficiency of the approach and the applicability of the developed prototype. Keywords — instrumentation, instrumentation tools generation, formal language grammar, TXL I. Introduction Controlling software quality is a major challenge for the II. Technology for generating instrumentation tools software industry. IT companies spend tremendous efforts and resources on diverse methods and practices for software In general, depending on the programming language and quality assurance. The starting stage generally consists in the runtime environment, either the source code or some software quality analysis, checking whether the software is intermediate representation (e.g., bytecode) can be compliant with the specifications and searching for errors. instrumented. For example, [2] considers source code Instrumentation is a mechanism used for software analysis. instrumentation in C++ for verifying potential Instrumentation is commonly understood as inserting vulnerabilities that can threaten the system security. A additional code into the source program, which allows technology developed in [3] introduces an instrumentation detecting and monitoring the parameters characterizing the framework based on analysis and processing of an software performance, with options for debugging and intermediate representation containing some form of an troubleshooting [1]. abstract syntax tree. In contrast, [4] and [5] discuss bytecode instrumentation for a Java virtual machine. There are different approaches to program Because our study deals with a universal instrumentation instrumentation. Manual instrumentation implies that the system independent of the programming language and developers independently log snippets, relying on their runtime environment, the only way to organize this system understanding of the program logic. Depending on the goals would have to be source-code instrumentation. set in automated instrumentation, the developers use some Instrumentation consists in converting the source tool for inserting the logging code into the source program program into a new one, with special code added to it which based on certain rules. Unfortunately, most of the existing enables tracing once the program is executed. This means tools come with limitations restricting the options for code that the task of instrumentation is reduced to converting instrumentation. What is more, such tools are generally (transforming) one source code into another. hard-coded to only work with a specific programming language, even though instrumentation may be required for A. Methodology programs in any language. Since one of the requirements to the approach developed In this paper, we propose a universal approach to solving is the option to instrument programs written in different the instrumentation problem incorporating a declarative languages, it should incorporate a mechanism for generating framework using the grammar of the target programming instrumentation systems for different target languages. An language, serving to generate automated systems for obvious solution is to use the grammar of the target software instrumentation. language as input for the instrumentation system. Since The paper is organized as follows. Section 2 describes standard grammars are not originally designed for the technology we have developed for generating instrumentation, a grammar markup mechanism should be instrumentation systems. Section 3 considers a prototype developed, adding semantics to the grammar for subsequent instrumentation tool generator. Section 4 presents the results use in instrumentation rules. We achieve this by utilizing of testing the approach on real software projects. Section 5 grammar annotations, created once for each target analyzes the existing solutions in automated programming language. The actual instrumentation rules are instrumentation. The paper is concluded by summarizing the developed by the user solving a specific applied problem. results and outlining the directions for further development. The general schematic for the approach developed is shown in Fig. 1. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). descent down the parse tree to be modified by inserting additional code (the top-down one-pass method). Let us define a workspace where instrumentation will be done, i.e., the working context that is some subset of CST nodes given as Y = {g ∈ G | A(g)}, where A (g) is the condition (logical statement about the properties) that the element g belongs to some subset Y, while G is the entire set of nodes of a particular parse tree. At the same time, using a CST means using a large number of intermediate nodes in accordance with the grammar rules used constructing it. Consequently, information should be collected during the top-down traversal of the parse tree (a), to be subsequently used for (b) making a decision whether a node belongs to the instrumentation context before actually inserting the Fig. 1. General schematic of the generator. code. The module for generating the transformation rules takes Thus, we have formulated a method for processing parse as input the grammar of the target language, the grammar tree nodes and a mechanism for controlling the workspace annotation for instrumentation, and the instrumentation of this process. rules; the module then generates transformation rules based on these data, feeding them to the input of the program transformation module together with the original source C. Transformation of program source code code. The program transformation module in turn generates The source code of programs written in a wide range of an instrumented program semantically equivalent to the programming languages can be processed either by original one with the added instrumentation code. developing specific tool finely tailored to the individual languages, or taking an existing tool or language (with the appropriate runtime environment), such as, for example, B. Representation and processing of program source code “The Meta-Environment” [6], Stratego/XT [7], Rascal [8], Source code transformation has been examined in great TXL [9] or similar. We have chosen the TXL language to detail; a wide range of tools have been developed that can confirm the applicability of the approach. effectively solve this task using some form of a parse tree or TXL is a domain-specific language designed to support a concrete syntax tree (CST) for internal representation. The source analysis and processing through rule-based structural nodes of this tree represent syntactic structures and transformation [9]. The FreeTXL compiler/interpreter individual elements adopted in the target programming (referred to as the TXL utility from now on) is the official language, so making changes to the processed code (its runtime environment for programs in the TXL language. transformation) can be distinguished as a separate stage of instrumentation (see Fig. 1). Such systems require a The TXL language is relatively simple and pure; description of transformations at the level of individual furthermore, the TXL utility offers such benefits as binary parse tree nodes (and/or sets of nodes) to operate. While it executables available for various software platforms and can be difficult to understand this description or correlate it easy integration (providing XML output). For these reasons, with the problem solved by the user, an additional we decided to use the TXL utility for transformation over complication is that it should strictly follow the grammar of the source code which is in turn produced by the the language used. These drawbacks lead us to identify instrumentation system generator. In view of this, the another significant stage of instrumentation that is preparing generated 'instrumentation system' is understood in this for changes (generating a description of the study as the description of transformations in the TXL transformations). The description of the changes can be language together with the grammar of the target simplified by introducing some primary representation with programming language and the TXL runtime environment. limited functions but at the same time universal enough to The where clause is used in TXL to constrain the be used for processing source code in different programming composition for transformation rules and functions [10], languages. Because there are no generally accepted while sequences of such expressions are combined by means standards for grammar formatting and structuring, the above of conjunction. On the other hand, applying logical requirement means that the instructions describing the disjunction to comparison operators from the start within purpose of changes (instrumentation rules) should be this syntactic structure allows introducing expressions in separated from the instructions describing repeating conjunctive normal form (CNF), obtained from predicates primitive operations specific to the given grammar and the constructed by users to describe instrumentation contexts. syntax elements used (grammar annotation, Fig. 1). Due to the functional nature of the TXL language, we Taking CST as their internal representation, the systems decided to use the arguments/parameters of the chain of for source code transformation process syntactic structures functions to pass the nodes collected during the traversal by moving from higher (file, module, expression set) to down the parse tree. In this case, the “chain” is a particular lower levels (string literals, digits). The order in which CST solution to the instrumentation problem, accounting for the nodes are visited and whether the same node can be revisited above one-pass method for CST traversal. depends solely on the specifics of a particular tool and the required transformations. Therefore, the instrumentation task can be solved using a method for step-by-step recursive D. Grammar annotation 4) refining the instrumentation context using programming language keywords, modifiers (enclosed in It can be a challenge to extract the directed acyclic graph square brackets) and text patterns, if any are required for the (DAG) representing the nested syntactic hierarchy from the task to be solved by the user; grammar of an arbitrary programming language. This obstacle can be overcome by either manual grammar 5) setting specific instrumentation points by their marking up or by constructing heuristics sufficient for the identifiers; purpose of defining the scope of the instrumentation on the parse tree. For this purpose, we constructed an XML-based 6) creating user-defined variables from text elements format describing grammar annotations, including such and constant values; information as: 7) a namelist of fragments to be inserted ● description of the syntactic structures of the target simultaneously into the same place, specifying the programming language that are the most significant parameters, if any are required according to the text of the for the end user in accordance with the grammar, also snippet used. including: The main benefits of the DSL developed is that it ○ a text template describing the type of structure; provides two methods for describing instrumentation ○ instrumentation points combined with a contexts (item 2) and an option for successively refining the simplified instrumentation algorithm; context specified (items 4, 5). This way, the user can ● directed acyclic graph of the hierarchy of nesting considerably limit the workspace for transformations while syntactic structures; the implementation details remain concealed. Nevertheless, ● points of interest describing the text data to be the transformation descriptions output by the generator can retrieved from the parse tree nodes; be used as the initial step for more complex instrumentation ● auxiliary user-defined functions in the TXL routines. language. F. Tool generation procedure E. User description of the instrumentation process The input artifacts for the “generator and transformation The prototype uses a declarative domain-specific tool” system are the following: language (DSL) to describe user-defined rules; it is based on the languages of such projects as Annotation File Utilities ● source code of the program to be instrumented; [11] and AspectJ [12]. Fig. 2 shows an example of the rules ● grammar description of the target language that the described using the developed language from the system's source code is written in; end user perspective, aimed at logging the first if-statement executing in the context of the "main" method contained in ● grammar annotation; the "Main" class. ● description of user-defined instrumentation rules; ● source code snippets (i.e. “fragments”) in the target programming language to be inserted; ● additional startup and runtime environment parameters. As an intermediate output, the generator provides the instructions for the transformations to be performed with a specific input file with the source code in accordance with the grammar given for the target programming language. The output artifact is the source code in the target programming language, which has been subjected to the required transformations. Fig. 2. End user description of the instrumentation rules. The developed prototype automatically generates the transformation instructions as a set of interconnected TXL The following main components comprise the user functions in the following order: description of the set of rules are (the number of the list item corresponds to the circled number in the figure): 1) load, parse and check the dependences of the source code fragments used in accordance with the rules 1) listing the source code fragments used in this set of described by the end user; rules and their relative file paths; 2) calculate the maximum distances from the root 2) listing the instrumentation contexts of interest to node for each node of the DAG representing the key the user (both simple, i.e., a set of statements about the structures of the target language in accordance with the properties of a syntactic structure, and composite, i.e., grammar annotation provided; several contexts joined by first-order logical operators); 3) build wrappers over standard comparison 3) grouping the instrumentation steps as named rules; operators; 4) build functions for implementing the tasks assigned to the points of interest; 5) build functions checking whether CST nodes belong to contexts; 6) build function chains in accordance with instrumentation rules allowing for the user's requirements; 7) build auxiliary TXL functions; 8) build user-defined functions; 9) build the main TXL function and apply policies for Fig. 3. Operation of level modifier visualized. additional user-defined functions; Function chains should be generated for each individual 10) update the states of functions that are chain expression containing a context refinement and the keyword elements; "add" (see Fig. 2) together with a list of code fragments, as a separate rule (group of refinements) in accordance with the 11) generate TXL instructions for the required description order. transformations and call the TXL utility. The general structure of a typical chain of calls to programmer-defined domain-specific functions: III. Prototype implementation 1) C-functions (collect) are rule-type TXL functions, To test the efficiency of the above approach, we designed to accumulate information from the parse tree constructed a prototype instrumentation tools generator nodes. This information is later used to assess whether the combining the generator application built based on the node belongs to the chosen instrumentation context. Such TinyXML2 [13] and Boost [14] libraries in C++ for the functions operate by calling the next function from the chain purpose of parsing grammar annotations along with user and passing it all the values of the arguments that were descriptions and interaction with the TXL utility, received by the current function, together with the node respectively, and the application itself. Fig. 4 shows the considered. general schematic for the prototype together with the 2) F-function (filter) is a function designed to filter artifacts necessary to solve the problem posed: in CST nodes relative to contexts described by the end user by accordance with the initial model (Fig. 1), the generator calling the auxiliary function for assessing whether the node utility is a module producing an intermediate (optionally belongs to the context and passing it the collected nodes. cached) description of instrumentation instructions in a 3) R-functions (refine) are functions designed to refine format that can be used by the second part of the two, i.e., the context to a limited subset of some required syntactic the transformation system. The generator utility acts as a structures of the target programming language in accordance module producing the transformation rules in this case, with one of the implemented modifiers: while the TXL utility performs the function of the program a) "first" searches and processes only the first transformation module. The colors of the arrows in the encountered node from the current subtree in accordance figure correspond to different frequencies of analysis and with the type specified in the grammar annotation; processing of artifacts by the generator and the TXL utility b) "all" searches and processes all nodes in (orange is more frequent, purple is less frequent), the colors accordance with the type specified in the annotation; of the input artifacts characterize the degree to which it is c) "level" searches and processes the nodes located at difficult for the user to create them (red is very difficult, the same (first) nesting level. A schematic example blue is moderately difficult, green is easy). illustrating how this modifier works is given in Fig. 3: different nesting levels of syntactic structures are shown from the bottom-up, a sequence of structures from the standpoint of source code is shown from left to right; the colors correspond to different types of CST nodes (nodes of the required type are colored in red; nodes to be processed are colored in dark red); the numbers indicate the order in which the pass is performed in the TXL environment. 4) I-function (instrument) are functions designed directly for instrumentation in accordance with the patterns (search and replace) specified in the annotation and the operation algorithm. Fig. 4. General schematic of the prototype. IV. Prototype testing The applicability of the developed instrumentation V. Comparison with counterparts method and the functionality of the implemented prototype were verified using several industrial open source projects. There is a wide range of different software systems The projects and the source code samples were chosen based offering options for automating the instrumentation process on the capabilities of the TXL grammars available at the to some degree. Examples of such systems include tools for time of this study [15]; notably, some elements of the test coverage analysis (GCC Gcov [19], Froglogic Squish grammars were slightly modified to better suit the described Coco Coverage Scanner [20]), code analysis (Testwell instrumentation approach (increased separation of syntactic CTC++ Preprocessor [21], Bullseye Coverage [22]), tracing structures). As a result, four projects were chosen for the and statistics calculations (Google Web Tracing Framework experiments, written in four different programming [23]), vulnerability assessment (see [2] and [4]). languages: Each of these projects only works with a very small ● AspectJ [12] is a system and DSL designed to subset of programming languages, and they need to be implement aspect-oriented programming principles further developed and adapted to work with new languages. within the Java language. Version 1.9.5 was chosen For open source projects, this can be done by forking the for the experiment. main code base but this takes a lot of time and resources. Commercial tools (Testwell CTC++, Bullseye Coverage and ● Keras library [16] is an add-on library for high-level Froglogic Squish Coco) can only be expanded by the processing and construction of deep learning neural developers. network models for the Python language. Version 2.3.1 was chosen for the experiment. Compared to such instrumentation methods as, for example, bytecode processing [5], application of the ● Boost library [14] is a multifunctional modular aspect-oriented programming paradigm [24], or reduction to library for building software products using the C++ a single intermediate representation with subsequent language. Version 1.72.0 was chosen for the reconstruction [3], the main difference of the approach experiment. proposed our study is that the user can flexibly control the instrumentation process and adapt the instrumentation for ● Lazarus IDE [17] is a graphical cross-platform other programming languages by specifying instrumentation environment for rapid application development in the rules in terms of the target programming language, also Object Pascal language and its dialects. Version 2.0.8 independently creating and annotating grammars in terms of was chosen for the experiment. parsing texts in formal languages. Different syntactic structures of the languages were instrumented in the experiments, such as import sections, class bodies, methods, branch operators, and loops, in VI. Conclusion particular using such means as different nesting levels of The study presents an approach to software programming language structures. The source codes of the instrumentation, with a prototype developed for a generator performed experiments are available in [18]. for automated instrumentation tools, for which the DSL of The approach was found to be efficient the most for instrumentation rules and the format for writing annotations instrumenting an AspectJ project written in Java. As for for formal grammars were described. We tested the other projects, we found both minor limitations associated prototype on several industrial open-source projects, with multivariate forms of some syntactic structures (for confirming that it was functioning properly and that the example, the import statement in Python), and major approach could be applied successfully. difficulties when the descriptive capabilities of the If a program is represented as a text in some formal grammars of languages provided by the TXL developers language with a developed grammar describing the and/or the community were found to be insufficient. In structures of this language, this approach can be considered particular, the Object Pascal grammar covered about 80% of sufficiently universal for solving the instrumentation the code base of the Lazarus IDE project at the time of the problem. This, however, implies that the capabilities of such study, while the C++ grammar (specifically designed for the a generator mainly depend on the capabilities of the older version of the C++ standard) covered only 21% of the transformation system applied and the grammar of the target Boost library base. Simplified language grammars should be language, which was confirmed experimentally. High further refined for the developed prototype to be used performance is the most crucial factor for program industrially, providing full support for the standards of these execution: it can be achieved by making the program as languages. close as possible to a machine-generated semi-structured We can conclude from our findings that the proposed format, removing most of the information that does not approach and the developed prototype generator are largely improve this indicator (i.e., information about the original applicable for the tasks described. There are certain structure of the program). All of this limits the potential limitations because existing grammars of the programming applications of the approach to an environment with access languages are imperfect; moreover, the prototype to structural information about the program. constructed has some drawbacks yet to be eliminated. The technique can be further developed by expanding the DSLs constructed, which can allow overcoming the existing context constraints, and conducting more focused experimental studies for a wider range of programming languages. [22] BullseyeCoverage Measurement Technique. Bullseye Testing Technology [online]. [viewed 10.02.2021]. Available from: VII. References https://bullseye.com/ measurementTechnique.html [23] Instrumenting Code with tracing-framework by Google. Google Inc [online]. [viewed 10.02.2021]. Available from: [1] Ломакина Л., Вигура А. Тестирование программных систем на https://google.github.io/tracing-framework/instrumenting-code.html основе компьютерной алгебры. Международный журнал экспериментального образования. 2016, 10-1, 132–134. [24] Mahrenholz D., Spinczyk O., Schroder-Preikschat W. Program instrumentation for debugging and monitoring with AspectC++. [2] Li H. et al. Automated source code instrumentation for verifying Proceedings Fifth IEEE International Symposium on Object-Oriented potential vulnerabilities. IFIP International Conference on ICT Real-Time Distributed Computing. ISIRC 2002. 2002, 249-256. Systems Security and Privacy Protection. Springer, Cham. 2016, 211-226. [3] Chittimalli P. K., Shah V. GEMS: a generic model based source code instrumentation framework. IEEE Fifth International Conference on Software Testing, Verification and Validation. 2012, 909-914. [4] Chander A., Mitchell J. C., Shin I. Mobile code security by Java bytecode instrumentation. Proceedings DARPA Information Survivability Conference and Exposition II. DISCEX'01. 2001, Т.2., 27-40. [5] Binder W., Hulaas J., Moret P. Advanced Java bytecode instrumentation. Proceedings of the 5th international symposium on Principles and practice of programming in Java. 2007, 135-144. [6] The Meta-Environment. Centrum Wiskunde & Informatica [online]. [viewed 13.02.2021]. Available from: http://www.meta-environment.org/ [7] Stratego Program Transformation Language [online]. [viewed 13.02.2021]. Available from: http://strategoxt.org/ [8] Rascal MPL - About. Centrum Wiskunde & Informatica [online]. [viewed 13.02.2021]. Available from: https://www.rascal-mpl.org/about/ [9] About TXL. Queen’s University and the Txl Project [online]. [viewed 10.02.2021]. Available from: http://txl.ca/txl-abouttxl.html [10] Cordy J. Excerpts from the TXL cookbook. International Summer School on Generative and Transformational Techniques in Software Engineering. Springer. 2009, 27–91. [11] Annotation File Format Specification. Веб-сайт проекта The Checker Framework [online]. [viewed 10.02.2021]. Available from: https://checkerframework.org/annotation-file-utilities/annotation-file-f ormat.html [12] The AspectJ Project. The Eclipse Foundation [online]. [viewed 10.02.2021]. Available from: http://www.eclipse.org/aspectj/ [13] Lee Thomason. TinyXML2 is a simple, small, efficient, C++ XML parser that can be easily integrated into other programs [online]. [viewed 10.02.2021]. Available from: https://github.com/leethomason/tinyxml2 [14] Boost C++ Libraries. Boost.org [online]. [viewed 10.02.2021]. Available from: https://www.boost.org/ [15] TXL World. Queen’s University and the Txl Project [online]. [viewed 10.02.2021]. Available from: https://txl.ca/txl-resources.html [16] Keras Team. Keras: the Python deep learning API [online]. [viewed 10.02.2021]. Available from: https://keras.io/ [17] Lazarus and Free Pascal Team. Lazarus Homepage [online]. [viewed 10.02.2021]. Available from: https://www.lazarus-ide.org/ [18] Onischuck Mikhail. Source Code Instrumentation System [online]. [viewed 14.02.2021]. Available from: https://github.com/dog-m/SCIS/tree/master/example/experiments_wit h_production_level_projects [19] Gcov (Using the GNU Compiler Collection (GCC)). Free Software Foundation, Inc [online]. [viewed 10.02.2021]. Available from: https://gcc.gnu.org/onlinedocs/gcc/Gcov.html [20] Squish Coco 4.3.2. Froglogic GmbH [online]. [viewed 10.02.2021]. Available from: https://doc.froglogic.com/squish-coco/latest/squishcoco.pdf [21] Testwell CTC++ Description. Testwell [online]. [viewed 10.02.2021]. Available from: http://www.testwell.fi/ctcdesc.html