=Paper=
{{Paper
|id=Vol-3071/paper17
|storemode=property
|title=Developing and Applying Custom Static Analysis Tools for Industrial Multi-Language Code Bases
|pdfUrl=https://ceur-ws.org/Vol-3071/paper17.pdf
|volume=Vol-3071
|authors=Dennis R. Dams,Jeroen Ketema,Pepijn Kramer,Arjan J. Mooij,Andrei Radulescu
|dblpUrl=https://dblp.org/rec/conf/benevol/DamsKKMR21
}}
==Developing and Applying Custom Static Analysis Tools for Industrial Multi-Language Code Bases==
<pdf width="1500px">https://ceur-ws.org/Vol-3071/paper17.pdf</pdf>
<pre>
Developing and Applying Custom Static Analysis
Tools for Industrial Multi-Language Code Bases
Dennis R. Dams1 , Jeroen Ketema1 , Pepijn Kramer2 , Arjan J. Mooij1 and
Andrei Rădulescu2
1
    ESI (TNO), Eindhoven, The Netherlands
2
    Thermo Fisher Scientific, Eindhoven, The Netherlands


                                         Abstract
                                         Maintaining large, multi-language code bases is challenging because of their size and complexity. Hence,
                                         tool support is desirable. Unfortunately, off-the-shelf tools fall short by aiming for genericity instead of
                                         exploiting characteristics of the specific code bases and maintenance tasks. Our objective is to support
                                         software maintenance by facilitating the development of custom tools for static code analysis.
                                             We report on a case study in which we developed and applied a custom static analysis tool to verify
                                         2441 build dependencies between Visual Studio projects with C++ and IDL code.


1. Introduction
Embedded software of advanced industrial products often consists of large, multi-language
code bases that reflect not only the complexity of the product but also the accumulated effect of
decades of development. Maintaining the software is challenging due to its size and complexity.
Empirical evidence [1] also indicates that multi-language software development is problematic
for program understanding.
   Off-the-shelf analysis tools often fall short by aiming for genericity instead of exploiting
characteristics of specific code bases and maintenance tasks. In our experience, customization
is key to getting useful results (cf. [2, 3]). Even general software maintenance tasks may require
custom analysis tools due to the use of technologies that are either developed in-house or do not
come with good tool support. Our objective is to support software maintenance by facilitating
the development of custom tools for static code analysis.
   We present an exploratory case study [4] in which we developed and applied custom static
analysis tools to analyze build dependencies between Visual Studio projects with C++ and IDL
code (see Sect. 3). Initially, we developed the custom tools for a specific code base. Later on, we
investigated the reusability on another code base (see Sect. 6).
   Our analysis is split in a model extraction phase and a model analysis phase (cf. [5]). Our model
(or knowledge base) is represented as a directed graph (cf. [6]), and contains information on
high-level concepts that developers use to reason about their code; lower-level concepts used

BENEVOL’21: The 20th Belgium-Netherlands Software Evolution Workshop, December 07–08, 2021, ’s-Hertogenbosch
(virtual), NL
$ dennis.dams@tno.nl (D. R. Dams); jeroen.ketema@tno.nl (J. Ketema); pepijn.kramer@thermofisher.com
(P. Kramer); arjan.mooij@tno.nl (A. J. Mooij); andrei.radulescu@thermofisher.com (A. Rădulescu)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: Interplay between IDL and C++


by compilers are omitted. Our main contribution is a compositional approach to multi-language
and multi-archive model extraction (see Sects. 4 and 5). The results obtained by applying the
developed custom static analysis tools are presented in Sect. 7.


2. Preliminaries
We discuss Visual Studio projects and Microsoft’s Interface Definition Language (IDL), as far as
relevant to our case study.

Visual Studio Projects Visual Studio projects form the basis of the Visual Studio build system
and contain information on dependencies on other projects, files to compile, libraries to link
against, files that will be generated (source files and binaries), and compiler flags.
  A build dependency occurs when building one project results in the generation of a source
or binary file needed to build another project. When one project has a build dependency on
another, the developer should declare this.

IDL The Component Object Model (COM) is an interface technology for software compo-
nents [7]. To define interfaces, IDL is used. IDL language elements include libraries, which
contain other (non-library) elements, interfaces, which are collections of method signatures,
and coclasses, which are objects implementing interfaces. Libraries, interfaces, and coclasses are
identified by UUIDs (Universally Unique IDentifiers). IDL also has a cpp_quote construct that
allows for inserting C++ code fragments into the C++ files generated from the IDL files.
   Given an IDL file, the Microsoft IDL compiler (MIDL) generates both C++ files and a type
library binary (TLB). TLBs can be imported in C++ and IDL files via, respectively, the #import
and importlib statements. When the C++ compiler encounters a #import, it generates a type
library header (TLH) based on the imported TLB and #includes that header for further processing.
Fig. 1 depicts the interplay.


3. Industrial Application: Motivation
Our use case was driven by a desire for software architectural improvements and removing
technical debt. Architectural improvements are an enabler for incremental and distributed
builds, assuming correctly specified dependencies. Dependencies can be incorrectly specified in
one of two ways [8]:
    • under-declared dependencies are needed but undeclared dependencies, and may lead to
      builds that fail occasionally due to missing files (depending on the build order);
    • over-declared dependencies are unneeded but declared dependencies, and may restrict
      the build order unnecessarily, reducing build performance, or may prohibit the build
      altogether (due to dependency cycles).
  Verification of dependencies can be time-consuming [9]. Several tools exist that visualize
dependencies between Visual Studio projects [10, 11, 12, 13], but these do not verify correctness.

Analysis To verify correctness, we need to compare declared and actual build dependencies.
To identify actual dependencies, we look for evidence. A build dependency between two projects
is evidenced by a file that is generated by one project, and used by another project. In turn, the
need for a file is induced by another kind of dependency, which can also be over- or under-
declared. This latter dependency is evidenced by element usage: a dependency on a file is only
needed when an element defined in the file is used. Thus, we desire checking the consistency
between declared build dependencies, used files, and used elements.

Knowledge Base To perform the analysis, our knowledge base will need to contain several
types of relations. Some of these can be extracted from project files, e.g., build dependency
declarations, or files needed, compiled, and generated. Other relations can be extracted from
C++ and IDL files, e.g., #include/#import relations, or the definition and use of elements. Hence,
we need to handle multiple languages and combine information from various build stages.


4. Compositional Model Extraction
We next describe our compositional approach to multi-language and multi-archive model
– or knowledge base – extraction, as summarized in Fig. 2. Each extraction block in the
figure represents an extraction for a single language applied to a single archive. Merge Graphs
and Semantic Linking integrate the results step-wise into a knowledge base. Finally, Finalize
represents a finalization step that removes information that is used during integration, but is
unneeded otherwise.

Single Languages and Archives At the single language and single archive level, we extract
information from each individual file. As seen in Fig. 2, some of the extraction blocks also yield
information that is fed into other extraction blocks, e.g., Extract Project yields a list of C++ and
IDL files that is used to drive the extraction in Extract C++ and Extract IDL.

Multiple Languages and Archives Merge Graphs takes the (non-disjoint) union of the
graphs extracted during the earlier stages, while Semantic Linking adds cross-language edges
(cf. [14]). The cross-language edges represent the higher-level relations that developers use
to reason about their code. For example, relations between uses in C++ files – via TLBs or
generated C++ files – of elements whose definitions find their origin in IDL files (see Fig. 1).
Figure 2: Compositional model extraction


Finalization Extraction also yields information that is used for integration but not needed
for our maintenance task. When integration has been completed, finalization removes the
additional information, which ensures we do not have to deal with it during analysis. For
example, some links to C++ files generated from IDL files are removed, which are no longer
needed, as links to the IDL files have been added.
  As mentioned, we remove information from the knowledge base that is not needed for our
maintenance task. This means, e.g., that we do not include full parse trees in our knowledge
base, as they contain a lot of information that we generally do not need. We also take an iterative
approach with regard to deciding what information to include in the knowledge base, as we
generally do not know beforehand what information is needed for our maintenance task.


5. Model Extraction Challenges
Our main observation is that it is non-trivial to get the extraction details right, mostly due
to information originating from different sources. However, we did notice that imperfect
approximations can already provide useful information that is not easily available otherwise.
   Developing custom tools is both a challenge and relief. It is a challenge when trying to cover
everything in full generality. It is a relief, as the customized context does not always require
full generality, and provides opportunities to exploit specific knowledge of the considered code.

5.1. Finding Appropriate Parsers
We discuss some specific concerns regarding parsing.

C++ Although commercial C++ parsers exists, we cannot readily experiment with them.
Furthermore, open source parsers usually do not handle all C++ dialects we encounter. For this
reason we select a parser with good error recovery, like Eclipse CDT [15]. This parser is also
used by [16, 2, 17], and, additionally, offers reference resolving [17].
Visual Studio Projects We initially developed our own parser. As project interpretation
turned out to be difficult, we migrated to the APIs offered by MSBuild (the tool driving the build
in Visual Studio). Although our custom parser was instrumental in moving forward during the
early stages of our case study, we consider the use of MSBuild to be preferred.

IDL No open source parser exists that can parse IDL. Hence, we wrote our own. What makes
extraction from IDL tricky is the cpp_quote construct (see Sect. 2). In our case, the use of
cpp_quote was limited to constant definitions, where it was desired to link the definitions with
their uses in C++ code.

5.2. Unique Naming Schemes
As our knowledge base is a flat graph, we have to uniquely name nodes. A proper naming
scheme also simplifies merging multiple graphs. Our scheme is based on the following, where
names are prefixed with a node type to avoid name clashes between elements with identical
names (such as archives and symbols, or C++ definitions and their (forward) declarations):

    • file paths relative to the root of the archive (e.g., for files);
    • names (e.g., for archives, symbols);
    • UUIDs (e.g., for IDL interfaces, coclasses, libraries);
    • hierarchical names relative to files (e.g., for C++ elements, IDL data types);
    • hierarchical names relative to UUIDs (e.g., for IDL data types);

The scheme is stable under small changes of the code base, which is convenient when continuing
or rerunning an analysis after updating the extracted models. Stability is also useful for assessing
code changes by comparing the graphs extracted before and after changes.

5.3. Binary File Formats
To extract information from (generated) binary files (such as TLBs), we can either (1) decompile
the binary, (2) use available APIs to read the binary, (3) use textual artifacts derived from the
binary during compilation, or (4) use the source code from which the binary was constructed.
  We use the third approach for TLBs imported in C++. To this end, we first build the considered
archive, generating all TLHs (see Fig. 1), and then run our extraction on the TLHs.
  We use the fourth approach for TLBs imported in IDL files, as no textual artifact are generated.
The sources of the TLBs may become available when extracting other archives. Hence, we store
extra information on IDL files and element uses, which is removed during finalization.

5.4. Dealing with the Preprocessor
To be able to properly parse C++ and IDL files, the C/C++ preprocessor needs to process the
files. The result of preprocessing may depend on compiler settings. Even for a simple analysis
question such as showing the #include relations between files, we can either (1) analyze the
code for a specific build configuration, or (2) analyze the code for all build configurations. We
use the first option, as it provides enough information for our maintenance task.
  Applying the preprocessor may expand simple, well-recognizable macros into complicated
code fragments. This may confuse developers, as they often reason about macros as they do
about functions. Hence, we include macro definitions and all references to them in our model.


6. Model Extraction Reusability
As mentioned in Sect. 5, customized model extraction provides opportunities to avoid hard
problems. However, this may complicate reuse. To evaluate this, we applied our model extraction
tool to a Visual Studio code base from a different company. We discuss the observed differences.
Insight in these helps to separate generic and specific aspects of the extraction.

Build Infrastructure Both code bases have their own custom build infrastructure on top of
Visual Studio projects. The configuration files from these infrastructures are easy to parse, but
specific to the code base.

File Locations The code bases use different conventions for the folders containing files shared
between projects and archives. However, in the case of sharing between archives we do observe
that the shared files always live in the same folder relative to the root of the archive, which
means that multi-archive merging can be kept generic.

Code Patterns The code bases use COM differently in relation to the two paths through the
middle layer of Fig. 1, which we both support:

    • each IDL file is compiled by one project, which generates a TLB that can be imported by
      other projects;
    • each project that wants to use a IDL element from a given IDL file compiles that IDL file.

  In the case of cpp_quote, the code bases use slightly different patterns to define constants.
These are typically generic, but occasionally depend on code base-specific macros.


7. Industrial Application: Results
We describe the model analysis phase and the results obtained for the case from Sect. 3. We
illustrate the general line of the analysis and do not aim to be complete. We focus on the part of
the knowledge base represented by the schema of Fig. 3, where projects 𝑝 are related by:

    • ProjectDependsOn(𝑝, 𝑞): 𝑝 declares a build dependency on project 𝑞;
    • MidlGenerates(𝑝, 𝑓 ): 𝑝 invokes the MIDL compiler generating a TLB or C++ file 𝑓 ;
    • Compiles(𝑝, 𝑓 ): 𝑝 compiles file 𝑓 .

The other relations relate to IDL and C++ files 𝑓 :

    • Includes(𝑓, 𝑔): 𝑓 #includes an IDL or C++ file 𝑔;
    • Imports(𝑓, 𝑡): 𝑓 #imports a TLB 𝑡;
Figure 3: Partial schema of the knowledge base from our case study


    • Defines(𝑓, 𝑒): 𝑓 defines an IDL element 𝑒;
    • Uses(𝑓, 𝑒): 𝑓 uses an IDL element 𝑒.
The relations can be combined via composition (;), inversion (−1 ), reflexive closure (?), and
reflexive, transitive closure (* ).

7.1. Verification Rules for Dependencies
Like [8], we first consider under-declared dependencies. Unlike [8], we distinguish file- and
element-level analyses.

Declared Dependencies vs. File References Declared build dependencies should be con-
sistent with file references, which requires relating projects 𝑝 with generated files 𝑓 :

  MidlReferences(𝑝, 𝑓 ) = ∃𝑞.MidlGenerates(𝑞, 𝑓 ) ∧ (Compiles; Includes* ; Imports?)(𝑝, 𝑓 )

Consistency can now be expressed as an inference rule:
                          MidlReferences(𝑝, 𝑓 ) MidlGenerates(𝑞, 𝑓 )
                                   ProjectDependsOn(𝑝, 𝑞)

To detect under-declared dependencies we read the rule top-down, i.e., if we find a MidlReferences
and MidlGenerates pair, we expect a ProjectDependsOn. To detect over-declared dependencies
we read the rule bottom-up. This latter reading is also used to establish evidence for a declared
dependency.

File References vs. Element References File references should be consistent with element
references. This requires relating IDL elements 𝑒 to projects 𝑝 and files 𝑓 :

          ProjectUses(𝑝, 𝑒) = (Compiles; Includes* ; Uses)(𝑝, 𝑒)
         MidlDefines(𝑓, 𝑒) = (MidlGenerates−1 ; Compiles; Includes* ; Defines)(𝑓, 𝑒)

   When a project declares a dependency on a MIDL-generated file, it is expected that some file
from the project uses an IDL element from the generated file. This is captured by:
                             ProjectUses(𝑝, 𝑒) MidlDefines(𝑓, 𝑒)
                                     MidlReferences(𝑝, 𝑓 )
Figure 4: Evidence for the combined dependency analysis


Combining the Rules We can relate declared build dependencies and element references as
follows:
             ProjectUses(𝑝, 𝑒) MidlDefines(𝑓, 𝑒)
                     MidlReferences(𝑝, 𝑓 )         MidlGenerates(𝑞, 𝑓 )
                                 ProjectDependsOn(𝑝, 𝑞)
An under-declared dependency evidenced by this rule is shown in Fig. 4, where 𝑝 and 𝑞 are
linked via 𝑓 and 𝑒, but where no ProjectDependsOn edge exists. An over-declared dependency
may, e.g., be the result of a developer not removing a project dependency after having removed
the use of an IDL element.

7.2. Implementation and Results
Based on the above rules and several others, we developed Depanneur, a tool for analyzing build
dependencies. The tool queries the graph created during model extraction, and produces both
textual output and pictures like the one in Fig. 4.
   The code base we considered consists of about 1080 Visual Studio projects with 2441 build
dependency declarations (before any fixes). Depanneur reports 498 under-declared depen-
dencies via MIDL-generated files. This relatively high number may point to over-declared
MidlReferences, where no elements are actually used. We corrected all under-declared depen-
dencies by adding them using a custom tool.
   As noted by [8], the removal of over-declarations is best done after fixing under-declarations
to avoid a temporary increase of build failures. Depanneur reports 622 over-declarations. Based
on our discussions with the developers, these indeed seem to be over-declarations.


8. Related Work
Overviews of static analysis techniques for multi-language code bases can be found in [14, 18].
A generic approach to cross-language analysis and refactoring is described in [14]; as in our case,
language-specific meta-models are used. Our extraction approach resembles that of [19] for
finding JNI dependencies between Java and C/C++. First, languages are treated independently;
only later is integration considered.
   An overview of dependency analysis techniques can be found in [9]. Some more recent
approaches are [20, 21, 8, 22].
  The Bauhaus Tool Suite [23] focuses on typical kinds of program analyses and reverse
engineering, with professional services for customer-specific tailoring of the analyses. Our
focus is fully on customizability using open source tools.


9. Conclusions
We presented a case study around the industrial challenge of build dependencies. As no off-the-
shelf analysis tools were available, we proposed to develop and apply custom tools. To facilitate
the development of custom tools, our approach is three-fold: (1) compositional model extraction
to handle multi-language and multi-archive code bases, (2) exploiting specific characteristics of
code bases, and (3) graph-based analysis.


Acknowledgments
The authors wish to thank Piërre van de Laar from ESI (TNO) for valuable feedback.
  The research was carried out as part of the Renaissance program under the responsibility of ESI
(TNO) with Thermo Fisher Scientific as the carrying industrial partner. The Renaissance program
was supported by the Netherlands Ministry of Economic Affairs (Toeslag voor Topconsortia
voor Kennis en Innovatie).


References
 [1] P. Mayer, M. Kirsch, M. A. Le, On multi-language software development, cross-language
     links and accompanying tools: a survey of professional software developers, J. Software
     Eng. R&D 5 (2017) 1. doi:10.1186/s40411-017-0035-z.
 [2] D. Dams, A. J. Mooij, P. Kramer, A. Radulescu, J. Vanhara, Model-based software restruc-
     turing: Lessons from cleaning up COM interfaces in industrial legacy code, in: 25th
     International Conference on Software Analysis, Evolution and Reengineering, SANER
     2018, 2018, pp. 552–556. doi:10.1109/SANER.2018.8330258.
 [3] E. Hajiyev, M. Verbaere, O. de Moor, codeQuest: Scalable source code queries with datalog,
     in: 20th European Conference on Object-Oriented Programming, ECOOP 2006, volume
     4067 of LNCS, 2006, pp. 2–27. doi:10.1007/11785477_2.
 [4] S. Easterbrook, J. Singer, M.-A. Storey, D. Damian, Selecting empirical methods for software
     engineering research, in: Guide to Advanced Empirical Software Engineering, Springer,
     2008, pp. 285–311.
 [5] Y. Zhao, G. Chen, C. Liao, X. Shen, Towards ontology-based program analysis, in: 30th
     European Conference on Object-Oriented Programming, ECOOP 2016, volume 56 of LIPIcs,
     2016, pp. 26:1–26:25. doi:10.4230/LIPIcs.ECOOP.2016.26.
 [6] J. Ebert, V. Riediger, A. Winter, Graph technology in reverse engineering: The TGraph
     approach, in: 10th Workshop Software Reengineering, WSR 2008, volume 126 of LNI, 2008,
     pp. 67–81. URL: http://subs.emis.de/LNI/Proceedings/Proceedings126/article2088.html.
 [7] D. Box, Essential COM, Addison-Wesley Professional, 1998.
 [8] J. D. Morgenthaler, M. Gridnev, R. Sauciuc, S. Bhansali, Searching for build debt: experi-
     ences managing technical debt at Google, in: 3rd International Workshop on Managing
     Technical Debt, MTD 2012, 2012, pp. 1–6. doi:10.1109/MTD.2012.6225994.
 [9] T. B. C. Arias, P. van der Spek, P. Avgeriou, A practice-driven systematic review of
     dependency analysis solutions, Empirical Software Engineering 16 (2011) 544–586. doi:10.
     1007/s10664-011-9158-8.
[10] J. Wilmans, depcharter, accessed December 2021. URL: https://github.com/janwilmans/
     depcharter.
[11] S. Dahlbacka, dependencyvisualizer, accessed June 2021. URL: https://archive.codeplex.
     com/?p=dependencyvisualizer.
[12] J. Penny, Viewing dependencies between projects in Visual Studio, 2009. URL:
     http://www.jamiepenney.co.nz/2009/02/10/viewing-dependencies-between-projects-in-
     visual-studio/.
[13] iLFiS, Project dependency graph generator, 2004. URL: https://www.codeproject.com/
     Articles/8384/Project-dependency-graph-generator.
[14] P. Mayer, A. Schroeder, Cross-language code analysis and refactoring, in: 12th IEEE
     International Working Conference on Source Code Analysis and Manipulation, SCAM
     2012, 2012, pp. 94–103. doi:10.1109/SCAM.2012.11.
[15] Eclipse, C++ development tools (CDT), accessed December 2021. URL: http://www.eclipse.
     org/cdt/.
[16] R. Aarssen, J. J. Vinju, T. van der Storm, Concrete syntax with black box parsers, Program-
     ming Journal 3 (2019) 15. doi:10.22152/programming-journal.org/2019/3/15.
[17] D. Piatov, A. Janes, A. Sillitti, G. Succi, Using the Eclipse C/C++ development tooling as a
     robust, fully functional, actively maintained, open source C++ parser, in: 8th IFIP WG 2.13
     International Conference on Open Source Systems, OSS 2012, 2012, p. 399. doi:10.1007/
     978-3-642-33442-9_45.
[18] Z. Mushtaq, G. Rasool, B. Shehzad, Multilingual source code analysis: A systematic litera-
     ture review, IEEE Access 5 (2017) 11307–11336. doi:10.1109/ACCESS.2017.2710421.
[19] D. L. Moise, K. Wong, Extracting and representing cross-language dependencies in diverse
     software systems, in: 12th Working Conference on Reverse Engineering, WCRE 2005,
     2005, pp. 209–218. doi:10.1109/WCRE.2005.19.
[20] LLVM Team, include-what-you-use, accessed December 2021. URL: https://include-what-
     you-use.org/.
[21] B. Cossette, R. J. Walker, Dsketch: lightweight, adaptable dependency analysis, in: 18th
     ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE
     2010, ACM, 2010, pp. 297–306. doi:10.1145/1882291.1882335.
[22] P. Wang, J. Yang, L. Tan, R. Kroeger, J. D. Morgenthaler, Generating precise dependencies
     for large software, in: 4th International Workshop on Managing Technical Debt, MTD
     2013, 2013, pp. 47–50. doi:10.1109/MTD.2013.6608678.
[23] A. Raza, G. Vogel, E. Plödereder, Bauhaus – A tool suite for program analysis and reverse
     engineering, in: 11th Ada-Europe International Conference on Reliable Software Technolo-
     gies, Ada-Europe 2006, volume 4006 of LNCS, 2006, pp. 71–82. doi:10.1007/11767077_6.

</pre>