=Paper= {{Paper |id=Vol-2953/SEIM_2021_paper_23 |storemode=property |title=Generator of Automated Tools for Program Instrumentation |pdfUrl=https://ceur-ws.org/Vol-2953/SEIM_2021_paper_23.pdf |volume=Vol-2953 |authors=Mikhail Onischuck,Vladimir Itsykson }} ==Generator of Automated Tools for Program Instrumentation== https://ceur-ws.org/Vol-2953/SEIM_2021_paper_23.pdf
                   Generator of automated tools for program
                                instrumentation
                           Mikhail Onischuck                                                       Vladimir Itsykson

           Peter the Great St.Petersburg Polytechnic University                   Peter the Great St.Petersburg Polytechnic University
                   Saint-Petersburg, Russian Federation                                   Saint-Petersburg, Russian Federation
                     orcid.org/0000-0001-5359-0161                                          orcid.org/0000-0003-0276-4517


          Abstract. Instrumentation is one of the methods used in dynamic program analysis for assessing software performance. This
      paper proposes a technology for constructing software instrumentation tools for different programming languages. The
      instrumentation functions are described within this approach using several grammar-based DSLs. The obtained instrumentation
      toolkit is the result of generating a new system based on formal descriptions of the instrumentation process. The tracing functions
      are embedded into the original program using a TXL utility; a conversion program is also generated for this utility. The developed
      prototype was tested on 4 large projects written in different programming languages: Java, Python, C++ and Object Pascal. The
      tests confirmed the efficiency of the approach and the applicability of the developed prototype.
         Keywords — instrumentation, instrumentation tools generation, formal language grammar, TXL

                             I. Introduction
          Controlling software quality is a major challenge for the          II. Technology for generating instrumentation tools
      software industry. IT companies spend tremendous efforts
      and resources on diverse methods and practices for software            In general, depending on the programming language and
      quality assurance. The starting stage generally consists in        the runtime environment, either the source code or some
      software quality analysis, checking whether the software is        intermediate representation (e.g., bytecode) can be
      compliant with the specifications and searching for errors.        instrumented. For example, [2] considers source code
      Instrumentation is a mechanism used for software analysis.         instrumentation in C++ for verifying potential
      Instrumentation is commonly understood as inserting                vulnerabilities that can threaten the system security. A
      additional code into the source program, which allows              technology developed in [3] introduces an instrumentation
      detecting and monitoring the parameters characterizing the         framework based on analysis and processing of an
      software performance, with options for debugging and               intermediate representation containing some form of an
      troubleshooting [1].                                               abstract syntax tree. In contrast, [4] and [5] discuss bytecode
                                                                         instrumentation for a Java virtual machine.
          There are different approaches to program                          Because our study deals with a universal instrumentation
      instrumentation. Manual instrumentation implies that the           system independent of the programming language and
      developers independently log snippets, relying on their            runtime environment, the only way to organize this system
      understanding of the program logic. Depending on the goals         would have to be source-code instrumentation.
      set in automated instrumentation, the developers use some              Instrumentation consists in converting the source
      tool for inserting the logging code into the source program        program into a new one, with special code added to it which
      based on certain rules. Unfortunately, most of the existing        enables tracing once the program is executed. This means
      tools come with limitations restricting the options for code       that the task of instrumentation is reduced to converting
      instrumentation. What is more, such tools are generally            (transforming) one source code into another.
      hard-coded to only work with a specific programming
      language, even though instrumentation may be required for          A. Methodology
      programs in any language.
                                                                             Since one of the requirements to the approach developed
          In this paper, we propose a universal approach to solving      is the option to instrument programs written in different
      the instrumentation problem incorporating a declarative            languages, it should incorporate a mechanism for generating
      framework using the grammar of the target programming              instrumentation systems for different target languages. An
      language, serving to generate automated systems for                obvious solution is to use the grammar of the target
      software instrumentation.                                          language as input for the instrumentation system. Since
          The paper is organized as follows. Section 2 describes         standard grammars are not originally designed for
      the technology we have developed for generating                    instrumentation, a grammar markup mechanism should be
      instrumentation systems. Section 3 considers a prototype           developed, adding semantics to the grammar for subsequent
      instrumentation tool generator. Section 4 presents the results     use in instrumentation rules. We achieve this by utilizing
      of testing the approach on real software projects. Section 5       grammar annotations, created once for each target
      analyzes     the    existing     solutions     in    automated     programming language. The actual instrumentation rules are
      instrumentation. The paper is concluded by summarizing the         developed by the user solving a specific applied problem.
      results and outlining the directions for further development.         The general schematic for the approach developed is
                                                                         shown in Fig. 1.




Copyright © 2021 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                  descent down the parse tree to be modified by inserting
                                                                  additional code (the top-down one-pass method).
                                                                      Let us define a workspace where instrumentation will be
                                                                  done, i.e., the working context that is some subset of CST
                                                                  nodes given as Y = {g ∈ G | A(g)}, where A (g) is the
                                                                  condition (logical statement about the properties) that the
                                                                  element g belongs to some subset Y, while G is the entire set
                                                                  of nodes of a particular parse tree. At the same time, using a
                                                                  CST means using a large number of intermediate nodes in
                                                                  accordance with the grammar rules used constructing it.
                                                                  Consequently, information should be collected during the
                                                                  top-down traversal of the parse tree (a), to be subsequently
                                                                  used for (b) making a decision whether a node belongs to
                                                                  the instrumentation context before actually inserting the
               Fig. 1. General schematic of the generator.        code.

    The module for generating the transformation rules takes          Thus, we have formulated a method for processing parse
as input the grammar of the target language, the grammar          tree nodes and a mechanism for controlling the workspace
annotation for instrumentation, and the instrumentation           of this process.
rules; the module then generates transformation rules based
on these data, feeding them to the input of the program
transformation module together with the original source           C. Transformation of program source code
code. The program transformation module in turn generates            The source code of programs written in a wide range of
an instrumented program semantically equivalent to the            programming languages can be processed either by
original one with the added instrumentation code.                 developing specific tool finely tailored to the individual
                                                                  languages, or taking an existing tool or language (with the
                                                                  appropriate runtime environment), such as, for example,
B. Representation and processing of program source code           “The Meta-Environment” [6], Stratego/XT [7], Rascal [8],
    Source code transformation has been examined in great         TXL [9] or similar. We have chosen the TXL language to
detail; a wide range of tools have been developed that can        confirm the applicability of the approach.
effectively solve this task using some form of a parse tree or        TXL is a domain-specific language designed to support
a concrete syntax tree (CST) for internal representation. The     source analysis and processing through rule-based structural
nodes of this tree represent syntactic structures and             transformation [9]. The FreeTXL compiler/interpreter
individual elements adopted in the target programming             (referred to as the TXL utility from now on) is the official
language, so making changes to the processed code (its            runtime environment for programs in the TXL language.
transformation) can be distinguished as a separate stage of
instrumentation (see Fig. 1). Such systems require a                  The TXL language is relatively simple and pure;
description of transformations at the level of individual         furthermore, the TXL utility offers such benefits as binary
parse tree nodes (and/or sets of nodes) to operate. While it      executables available for various software platforms and
can be difficult to understand this description or correlate it   easy integration (providing XML output). For these reasons,
with the problem solved by the user, an additional                we decided to use the TXL utility for transformation over
complication is that it should strictly follow the grammar of     the source code which is in turn produced by the
the language used. These drawbacks lead us to identify            instrumentation system generator. In view of this, the
another significant stage of instrumentation that is preparing    generated 'instrumentation system' is understood in this
for changes (generating a description of the                      study as the description of transformations in the TXL
transformations). The description of the changes can be           language together with the grammar of the target
simplified by introducing some primary representation with        programming language and the TXL runtime environment.
limited functions but at the same time universal enough to            The where clause is used in TXL to constrain the
be used for processing source code in different programming       composition for transformation rules and functions [10],
languages. Because there are no generally accepted                while sequences of such expressions are combined by means
standards for grammar formatting and structuring, the above       of conjunction. On the other hand, applying logical
requirement means that the instructions describing the            disjunction to comparison operators from the start within
purpose of changes (instrumentation rules) should be              this syntactic structure allows introducing expressions in
separated from the instructions describing repeating              conjunctive normal form (CNF), obtained from predicates
primitive operations specific to the given grammar and the        constructed by users to describe instrumentation contexts.
syntax elements used (grammar annotation, Fig. 1).
                                                                      Due to the functional nature of the TXL language, we
    Taking CST as their internal representation, the systems      decided to use the arguments/parameters of the chain of
for source code transformation process syntactic structures       functions to pass the nodes collected during the traversal
by moving from higher (file, module, expression set) to           down the parse tree. In this case, the “chain” is a particular
lower levels (string literals, digits). The order in which CST    solution to the instrumentation problem, accounting for the
nodes are visited and whether the same node can be revisited      above one-pass method for CST traversal.
depends solely on the specifics of a particular tool and the
required transformations. Therefore, the instrumentation
task can be solved using a method for step-by-step recursive
D. Grammar annotation                                                      4) refining the instrumentation context using
                                                                       programming language keywords, modifiers (enclosed in
    It can be a challenge to extract the directed acyclic graph        square brackets) and text patterns, if any are required for the
(DAG) representing the nested syntactic hierarchy from the             task to be solved by the user;
grammar of an arbitrary programming language. This
obstacle can be overcome by either manual grammar                         5) setting specific instrumentation points by their
marking up or by constructing heuristics sufficient for the            identifiers;
purpose of defining the scope of the instrumentation on the
parse tree. For this purpose, we constructed an XML-based                 6) creating user-defined variables from text elements
format describing grammar annotations, including such                  and constant values;
information as:                                                            7) a namelist of fragments to be inserted
   ● description of the syntactic structures of the target             simultaneously into the same place, specifying the
     programming language that are the most significant                parameters, if any are required according to the text of the
     for the end user in accordance with the grammar, also             snippet used.
     including:                                                            The main benefits of the DSL developed is that it
      ○ a text template describing the type of structure;              provides two methods for describing instrumentation
      ○ instrumentation points combined with a                         contexts (item 2) and an option for successively refining the
           simplified instrumentation algorithm;                       context specified (items 4, 5). This way, the user can
   ● directed acyclic graph of the hierarchy of nesting                considerably limit the workspace for transformations while
     syntactic structures;                                             the implementation details remain concealed. Nevertheless,
   ● points of interest describing the text data to be                 the transformation descriptions output by the generator can
     retrieved from the parse tree nodes;                              be used as the initial step for more complex instrumentation
   ● auxiliary user-defined functions in the TXL                       routines.
     language.

                                                                       F.   Tool generation procedure
E. User description of the instrumentation process
                                                                           The input artifacts for the “generator and transformation
    The prototype uses a declarative domain-specific                   tool” system are the following:
language (DSL) to describe user-defined rules; it is based on
the languages of such projects as Annotation File Utilities                 ● source code of the program to be instrumented;
[11] and AspectJ [12]. Fig. 2 shows an example of the rules                 ● grammar description of the target language that the
described using the developed language from the system's                      source code is written in;
end user perspective, aimed at logging the first if-statement
executing in the context of the "main" method contained in                  ● grammar annotation;
the "Main" class.                                                           ● description of user-defined instrumentation rules;
                                                                            ● source code snippets (i.e. “fragments”) in the target
                                                                              programming language to be inserted;
                                                                            ● additional startup      and    runtime    environment
                                                                              parameters.
                                                                           As an intermediate output, the generator provides the
                                                                       instructions for the transformations to be performed with a
                                                                       specific input file with the source code in accordance with
                                                                       the grammar given for the target programming language.
                                                                          The output artifact is the source code in the target
                                                                       programming language, which has been subjected to the
                                                                       required transformations.

          Fig. 2. End user description of the instrumentation rules.       The developed prototype automatically generates the
                                                                       transformation instructions as a set of interconnected TXL
   The following main components comprise the user                     functions in the following order:
description of the set of rules are (the number of the list item
corresponds to the circled number in the figure):                         1) load, parse and check the dependences of the
                                                                       source code fragments used in accordance with the rules
    1) listing the source code fragments used in this set of           described by the end user;
rules and their relative file paths;
                                                                           2) calculate the maximum distances from the root
   2) listing the instrumentation contexts of interest to              node for each node of the DAG representing the key
the user (both simple, i.e., a set of statements about the             structures of the target language in accordance with the
properties of a syntactic structure, and composite, i.e.,              grammar annotation provided;
several contexts joined by first-order logical operators);
                                                                          3) build       wrappers    over    standard    comparison
   3)    grouping the instrumentation steps as named rules;            operators;
    4) build functions for implementing the tasks assigned
to the points of interest;
   5) build functions checking whether CST nodes
belong to contexts;
    6) build function chains in accordance with
instrumentation rules allowing for the user's requirements;
   7)    build auxiliary TXL functions;
   8)    build user-defined functions;
   9) build the main TXL function and apply policies for                      Fig. 3. Operation of level modifier visualized.
additional user-defined functions;                                 Function chains should be generated for each individual
   10) update the states of functions that are chain            expression containing a context refinement and the keyword
elements;                                                       "add" (see Fig. 2) together with a list of code fragments, as a
                                                                separate rule (group of refinements) in accordance with the
    11) generate TXL instructions for the required              description order.
transformations and call the TXL utility.
   The general structure of a typical chain of calls to
programmer-defined domain-specific functions:                                III. Prototype implementation
    1) C-functions (collect) are rule-type TXL functions,           To test the efficiency of the above approach, we
designed to accumulate information from the parse tree          constructed a prototype instrumentation tools generator
nodes. This information is later used to assess whether the     combining the generator application built based on the
node belongs to the chosen instrumentation context. Such        TinyXML2 [13] and Boost [14] libraries in C++ for the
functions operate by calling the next function from the chain   purpose of parsing grammar annotations along with user
and passing it all the values of the arguments that were        descriptions and interaction with the TXL utility,
received by the current function, together with the node        respectively, and the application itself. Fig. 4 shows the
considered.                                                     general schematic for the prototype together with the
    2) F-function (filter) is a function designed to filter     artifacts necessary to solve the problem posed: in
CST nodes relative to contexts described by the end user by     accordance with the initial model (Fig. 1), the generator
calling the auxiliary function for assessing whether the node   utility is a module producing an intermediate (optionally
belongs to the context and passing it the collected nodes.      cached) description of instrumentation instructions in a
    3) R-functions (refine) are functions designed to refine    format that can be used by the second part of the two, i.e.,
the context to a limited subset of some required syntactic      the transformation system. The generator utility acts as a
structures of the target programming language in accordance     module producing the transformation rules in this case,
with one of the implemented modifiers:                          while the TXL utility performs the function of the program
       a) "first" searches and processes only the first         transformation module. The colors of the arrows in the
encountered node from the current subtree in accordance         figure correspond to different frequencies of analysis and
with the type specified in the grammar annotation;              processing of artifacts by the generator and the TXL utility
       b) "all" searches and processes all nodes in             (orange is more frequent, purple is less frequent), the colors
accordance with the type specified in the annotation;           of the input artifacts characterize the degree to which it is
       c) "level" searches and processes the nodes located at   difficult for the user to create them (red is very difficult,
the same (first) nesting level. A schematic example             blue is moderately difficult, green is easy).
illustrating how this modifier works is given in Fig. 3:
different nesting levels of syntactic structures are shown
from the bottom-up, a sequence of structures from the
standpoint of source code is shown from left to right; the
colors correspond to different types of CST nodes (nodes of
the required type are colored in red; nodes to be processed
are colored in dark red); the numbers indicate the order in
which the pass is performed in the TXL environment.
    4) I-function (instrument) are functions designed
directly for instrumentation in accordance with the patterns
(search and replace) specified in the annotation and the
operation algorithm.




                                                                               Fig. 4. General schematic of the prototype.
               IV.     Prototype testing
    The applicability of the developed instrumentation                        V. Comparison with counterparts
method and the functionality of the implemented prototype
were verified using several industrial open source projects.          There is a wide range of different software systems
The projects and the source code samples were chosen based        offering options for automating the instrumentation process
on the capabilities of the TXL grammars available at the          to some degree. Examples of such systems include tools for
time of this study [15]; notably, some elements of the            test coverage analysis (GCC Gcov [19], Froglogic Squish
grammars were slightly modified to better suit the described      Coco Coverage Scanner [20]), code analysis (Testwell
instrumentation approach (increased separation of syntactic       CTC++ Preprocessor [21], Bullseye Coverage [22]), tracing
structures). As a result, four projects were chosen for the       and statistics calculations (Google Web Tracing Framework
experiments, written in four different programming                [23]), vulnerability assessment (see [2] and [4]).
languages:                                                            Each of these projects only works with a very small
   ● AspectJ [12] is a system and DSL designed to                 subset of programming languages, and they need to be
     implement aspect-oriented programming principles             further developed and adapted to work with new languages.
     within the Java language. Version 1.9.5 was chosen           For open source projects, this can be done by forking the
     for the experiment.                                          main code base but this takes a lot of time and resources.
                                                                  Commercial tools (Testwell CTC++, Bullseye Coverage and
   ● Keras library [16] is an add-on library for high-level       Froglogic Squish Coco) can only be expanded by the
     processing and construction of deep learning neural          developers.
     network models for the Python language. Version
     2.3.1 was chosen for the experiment.                             Compared to such instrumentation methods as, for
                                                                  example, bytecode processing [5], application of the
   ● Boost library [14] is a multifunctional modular              aspect-oriented programming paradigm [24], or reduction to
     library for building software products using the C++         a single intermediate representation with subsequent
     language. Version 1.72.0 was chosen for the                  reconstruction [3], the main difference of the approach
     experiment.                                                  proposed our study is that the user can flexibly control the
                                                                  instrumentation process and adapt the instrumentation for
   ● Lazarus IDE [17] is a graphical cross-platform               other programming languages by specifying instrumentation
     environment for rapid application development in the         rules in terms of the target programming language, also
     Object Pascal language and its dialects. Version 2.0.8       independently creating and annotating grammars in terms of
     was chosen for the experiment.                               parsing texts in formal languages.
    Different syntactic structures of the languages were
instrumented in the experiments, such as import sections,
class bodies, methods, branch operators, and loops, in                                VI. Conclusion
particular using such means as different nesting levels of
                                                                      The study presents an approach to software
programming language structures. The source codes of the
                                                                  instrumentation, with a prototype developed for a generator
performed experiments are available in [18].
                                                                  for automated instrumentation tools, for which the DSL of
    The approach was found to be efficient the most for           instrumentation rules and the format for writing annotations
instrumenting an AspectJ project written in Java. As for          for formal grammars were described. We tested the
other projects, we found both minor limitations associated        prototype on several industrial open-source projects,
with multivariate forms of some syntactic structures (for         confirming that it was functioning properly and that the
example, the import statement in Python), and major               approach could be applied successfully.
difficulties when the descriptive capabilities of the
                                                                      If a program is represented as a text in some formal
grammars of languages provided by the TXL developers
                                                                  language with a developed grammar describing the
and/or the community were found to be insufficient. In
                                                                  structures of this language, this approach can be considered
particular, the Object Pascal grammar covered about 80% of
                                                                  sufficiently universal for solving the instrumentation
the code base of the Lazarus IDE project at the time of the
                                                                  problem. This, however, implies that the capabilities of such
study, while the C++ grammar (specifically designed for the
                                                                  a generator mainly depend on the capabilities of the
older version of the C++ standard) covered only 21% of the
                                                                  transformation system applied and the grammar of the target
Boost library base. Simplified language grammars should be
                                                                  language, which was confirmed experimentally. High
further refined for the developed prototype to be used
                                                                  performance is the most crucial factor for program
industrially, providing full support for the standards of these
                                                                  execution: it can be achieved by making the program as
languages.
                                                                  close as possible to a machine-generated semi-structured
   We can conclude from our findings that the proposed            format, removing most of the information that does not
approach and the developed prototype generator are largely        improve this indicator (i.e., information about the original
applicable for the tasks described. There are certain             structure of the program). All of this limits the potential
limitations because existing grammars of the programming          applications of the approach to an environment with access
languages are imperfect; moreover, the prototype                  to structural information about the program.
constructed has some drawbacks yet to be eliminated.
                                                                      The technique can be further developed by expanding
                                                                  the DSLs constructed, which can allow overcoming the
                                                                  existing context constraints, and conducting more focused
                                                                  experimental studies for a wider range of programming
                                                                  languages.
                                                                                [22] BullseyeCoverage Measurement Technique. Bullseye Testing
                                                                                     Technology [online]. [viewed 10.02.2021]. Available from:
                         VII. References                                             https://bullseye.com/ measurementTechnique.html
                                                                                [23] Instrumenting Code with tracing-framework by Google. Google Inc
                                                                                     [online].       [viewed        10.02.2021].      Available      from:
[1] Ломакина Л., Вигура А. Тестирование программных систем на                        https://google.github.io/tracing-framework/instrumenting-code.html
    основе компьютерной алгебры. Международный журнал
    экспериментального образования. 2016, 10-1, 132–134.                        [24] Mahrenholz D., Spinczyk O., Schroder-Preikschat W. Program
                                                                                     instrumentation for debugging and monitoring with AspectC++.
[2] Li H. et al. Automated source code instrumentation for verifying                 Proceedings Fifth IEEE International Symposium on Object-Oriented
    potential vulnerabilities. IFIP International Conference on ICT                  Real-Time Distributed Computing. ISIRC 2002. 2002, 249-256.
    Systems Security and Privacy Protection. Springer, Cham. 2016,
    211-226.
[3] Chittimalli P. K., Shah V. GEMS: a generic model based source code
    instrumentation framework. IEEE Fifth International Conference on
    Software Testing, Verification and Validation. 2012, 909-914.
[4] Chander A., Mitchell J. C., Shin I. Mobile code security by Java
    bytecode instrumentation. Proceedings DARPA Information
    Survivability Conference and Exposition II. DISCEX'01. 2001, Т.2.,
    27-40.
[5] Binder W., Hulaas J., Moret P. Advanced Java bytecode
    instrumentation. Proceedings of the 5th international symposium on
    Principles and practice of programming in Java. 2007, 135-144.
[6] The Meta-Environment. Centrum Wiskunde & Informatica [online].
    [viewed           13.02.2021].       Available          from:
    http://www.meta-environment.org/
[7] Stratego Program Transformation Language [online]. [viewed
    13.02.2021]. Available from: http://strategoxt.org/
[8] Rascal MPL - About. Centrum Wiskunde & Informatica [online].
    [viewed            13.02.2021].    Available          from:
    https://www.rascal-mpl.org/about/
[9] About TXL. Queen’s University and the Txl Project [online]. [viewed
    10.02.2021]. Available from: http://txl.ca/txl-abouttxl.html
[10] Cordy J. Excerpts from the TXL cookbook. International Summer
     School on Generative and Transformational Techniques in Software
     Engineering. Springer. 2009, 27–91.
[11] Annotation File Format Specification. Веб-сайт проекта The Checker
     Framework [online]. [viewed 10.02.2021]. Available from:
     https://checkerframework.org/annotation-file-utilities/annotation-file-f
     ormat.html
[12] The AspectJ Project. The Eclipse Foundation [online]. [viewed
     10.02.2021]. Available from: http://www.eclipse.org/aspectj/
[13] Lee Thomason. TinyXML2 is a simple, small, efficient, C++ XML
     parser that can be easily integrated into other programs [online].
     [viewed             10.02.2021].         Available          from:
     https://github.com/leethomason/tinyxml2
[14] Boost C++ Libraries. Boost.org [online]. [viewed 10.02.2021].
     Available from: https://www.boost.org/
[15] TXL World. Queen’s University and the Txl Project [online]. [viewed
     10.02.2021]. Available from: https://txl.ca/txl-resources.html
[16] Keras Team. Keras: the Python deep learning API [online]. [viewed
     10.02.2021]. Available from: https://keras.io/
[17] Lazarus and Free Pascal Team. Lazarus Homepage [online]. [viewed
     10.02.2021]. Available from: https://www.lazarus-ide.org/
[18] Onischuck Mikhail. Source Code Instrumentation System [online].
     [viewed             14.02.2021].          Available          from:
     https://github.com/dog-m/SCIS/tree/master/example/experiments_wit
     h_production_level_projects
[19] Gcov (Using the GNU Compiler Collection (GCC)). Free Software
     Foundation, Inc [online]. [viewed 10.02.2021]. Available from:
     https://gcc.gnu.org/onlinedocs/gcc/Gcov.html
[20] Squish Coco 4.3.2. Froglogic GmbH [online]. [viewed 10.02.2021].
     Available                                                   from:
     https://doc.froglogic.com/squish-coco/latest/squishcoco.pdf
[21] Testwell CTC++ Description. Testwell [online]. [viewed 10.02.2021].
     Available from: http://www.testwell.fi/ctcdesc.html