<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Integration of Incremental Build Systems Into Software Comprehension Tools</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Máté Cserép</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anett Fekete</string-name>
          <email>afekete@inf.elte.hu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Eötvös Loránd University, Faculty of Informatics</institution>
          ,
          <country country="HU">Hungary</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>29</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>Standalone code comprehension tools and similar features of integrated development environments (IDE) both aim to support the development and the maintenance of large (legacy) software. When applied to actively developed projects, it is essential to process the most recent revision of the source code in real time. Since a complete analysis of the codebase might take up significant time (even hours), the inclusion of incremental parsing is indispensable. However, the utilized build system of a software project is tightly coupled with the source code: over the time not only the content of the source ifles can be amended, but translation units can be added or removed and the parameters of the existing build instructions might also change. This paper intends to describe how the incremental update of the build system of a software facilitates the maintenance of the software workspace database in a code comprehension tool by completing the workflow of incremental parsing. We describe why including the build system in incremental parsing is relevant as well as the actual method of parsing build commands. We show that updating the build system is more cost-efective to a ratio than disposing of the existing build command database. The paper also compares the incremental parsing of build systems to that of actual source code.</p>
      </abstract>
      <kwd-group>
        <kwd>code comprehension</kwd>
        <kwd>build system</kwd>
        <kwd>incremental parsing</kwd>
        <kwd>software maintenance</kwd>
        <kwd>static analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>When programmers change jobs, they usually join a long-running project with a
large codebase instead of starting a new project. This might cause a hard time
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
for the programmer since there can be millions of lines of code (LOC) that they
have to understand in order to be able to execute their programming tasks and
interactively join the project. A significant part of the software maintenance process
is program comprehension, which is an activity to understand software architectural
and programming decisions, choices and logic along with the source code itself.</p>
      <p>
        Although code comprehension is already a hard task, the lack of documentation
and other external representation of the code can make it even more dificult.
Several software tools exist in order to facilitate software understanding. Numerous
code editor features such as find all references or go to definition / declaration can
be used by the programmer for code comprehension tasks, but these
functionalities cannot replace standalone comprehension and software visualization tools like
CodeSurveyor [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], OpenGrok [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], CityVR [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], or CodeCompass [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>Standalone tools provide various textual and visual data about the source code
after processing them utilizing static and/or dynamic analysis. In order to be able
to make information available for the programmer, the comprehension software
needs to process – parse – the source code unit by unit where a unit represents a
source file or one line in a file. An actively developed project changes every day,
which means the information provided by a software comprehension tool on a day
might be incorrect the next day. This is why comprehension tools require frequent
reparsing in order to provide correct information. However, frequent reparsing of a
whole software project is quite costly, both in computational time and complexity.
Hence, incremental parsing, which means parsing only those parts of the project
that have been actually changed, is inevitable. Single code lines or entire files can
be parsed incrementally. Our method deals with the latter approach.</p>
      <p>
        By applying incremental parsing, severe computational cost can be spared. In
case of an actively developed project, the daily usage of the parsed code can be
significantly facilitated by incrementally parsing nightly builds, but real-time
incremental parsing can be achieved as well by integrating it with code editors [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>As a case-study we realized incremental parsing of build systems through
CodeCompass, a standalone code comprehension tool developed by Ericsson Hungary
and Eötvös Loránd University which already encompassed the incremental parsing
of source code. In order to test our method, we used the open-source LLVM project,
which is under continuous development and has a frequently changing build system.</p>
      <p>The rest of the paper is strcutured as follows. Section 2 describes how modern
build systems address the issue of incremental building. Section 3 introduces the
CodeCompass code comprehension tool, then Section 4 presents our methodology
of incremental parsing of build systems and its integration into CodeCompass.
Afterwards, Section 5 showcases a performance test on LLVM. Finally, Section 6
discusses the results and concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>
        Build systems [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in general are standalone software whose goal is to provide
language-independent solution to define build rules and actions. Incremental
building is applied by several build systems such as Ant [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], Maven [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], or GNU Make
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, most build systems do not determine changes in files by detailed,
lineby-line static analysis on the file content, nor do they integrate with any version
control system’s repository.
      </p>
      <p>For example, GNU Make builds the source files based on a file called Makefile ,
which contains rules by which the compilation is executed. We assume that the
project has already been built. In this case, we already have binaries that were
compiled from the source code using the Makefile. The build system examines the
last modified timestamp of every source file and the according target. Based on the
comparison, the source files are classified either as unbuilt (recently added),
obsolete (recently updated) or up-to-date. Other build rules include the examination
of dependencies between source files such as include relations. After this, target
binaries are built thus old binaries are updated. Figure 1 illustrates the incremental
build workflow executed by GNU Make.</p>
      <p>Another example for a build system with such an incremental approach is
CMake1, which is practically a generator for other build systems like GNU Make.
Its main purpose is to define the build rules and list them in the Makefile (or other
similar file). However, CMake is not only capable of timestamp comparison like
GNU Make but also examines compilation parameters. A source file is not only
recompiled when its content has been directly modified but when its build rules
(e.g. dependencies) have been changed as well.</p>
      <sec id="sec-2-1">
        <title>1CMake: Build with CMake: https://cmake.org/</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. CodeCompass</title>
      <p>
        CodeCompass [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is an open source, scalable code comprehension tool developed
by Ericsson Ltd. and the Eötvös Loránd University, Budapest to help
understanding large legacy software systems. Its web user interface provides rich textual
search and navigation functionalities and also a wide range of rule-based
visualization features [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. The web user interface of CodeCompass is presented in
Figure 2, showcasing a special, interactive CodeBites diagram, which supports the
understanding of large call chains and type hierarchies. The code comprehension
capabilities of CodeCompass is not restricted to the existing code base, but
important architectural information is also gained from the build system by processing
the compilation database of the project [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and the version control history of the
project when available.
      </p>
      <p>The C/C++ static analyzer component is based on the LLVM2 compiler
infrastructure and the Clang3 tooling infrastructure for C/C++. The parser stores
the position and type information of specific AST nodes in the project workspace
database together with further information collected during the parsing process
(e.g. the relations between files).</p>
      <sec id="sec-3-1">
        <title>2The LLVM Compiler Infrastructure: https://llvm.org/ 3Clang: a C language family frontend for LLVM: shttps://clang.llvm.org/</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Workflow</title>
      <p>
        In our previous work [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] we have already described how CodeCompass was extended
with the concept of incremental parsing. The implemented feature successfully
distinguished added, deleted and modified files (including the traversal of indirectly
changed files through header inclusions) and carried out maintenance operations
for the database of the code comprehension tool in only the required cases.
Therefore, the required costs of parsing (both time and computational resources) of the
reanalysis was reduced by multiple magnitudes, by omitting unchanged files in the
project. We have also presented [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] how incremental parsing is crucial to provide
real-time reanalysis of the modified source code inside integrated development
environments (IDEs) by embedding the feature set of code comprehension tools into
code editors through remote procedure calls defined by the open-source Language
Server Protocol (LSP).
      </p>
      <p>For software projects under active development, not only the content of the
source files may change frequently, but the configuration or rules of the utilized
build system are also modified regularly. While the initial tests evaluated on the
Xerces-C++ and LLVM go-to test projects of CodeCompass were promising, the
challenge of detecting changes in the build system or the compilation commands
were yet unhandled.</p>
      <p>The methodology of incremental parsing in CodeCompass was inspired by the
incremental strategy of build systems presented in Section 2. However, while build
systems usually depend on comparing source file and target binary timestamps
along the defined ruleset, the static analyzer component in CodeCompass can only
use the source files as input, since the goal of the process is not building the
analyzed software and the previously built binaries are not required to be available –
if built at all in the first place. The incremental parsing workflow of CodeCompass
is depicted in Figure 3. Our approach depends on two input sources:  ) the actual
version of the software, labeled as the project folder ; and  ) the last analyzed state
of the software stored in a relational database, labeled as the workspace database.
In order to detect changes in the build system, compilation commands are stored
upon the analysis of a software project and compared between the current and the
last analyzed revision. Compilation commands can be defined in the JSON
export format of CMake. To support other build systems, CodeCompass includes a
logger component which can embed the building process of the target project,
logging all compilation commands in the required format, thus making CodeCompass
compatible with all C/C++ projects, regardless of the build system utilized.</p>
      <p>Based on their content and the compilation commands, source files are classified
into five categories as follows:
1. Files with diferent content on in the workspace database and the project
folder were classified as content modified . Files with indirectly modified
content through header inclusion also fall into this category.
2. Files with no content modification, but with a changed compilation
command (even a flag) in the workspace database and the project folder were
categorized as build action changed.
3. Files present in the workspace database, but not in the project folder were
grouped as deleted.
4. Files present in the project folder, but not in the workspace folder were
marked as new.
5. Files present both in the workspace database and in the project folder, with an
unchanged content and compilation command were classified as unchanged.</p>
      <p>The maintenance of the workspace database is carried out in two phases. First,
a cleanup process is executed, removing all related information from the database
for the source files categorized as deleted, content modified or build action changed.
Afterwards, the new content is parsed during the update process, which will include
the source files marked as either new, content modified or build action changed. For
unchanged files, no task has to be executed.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>To evaluate the performance of the incremental parsing methodology described in
Section 4, the prototype implementation in CodeCompass was tested against the
open-source LLVM project. LLVM is a large-scale go-to test project for
CodeCompass, which contains over 5000 C/C++ source files with more than 1.6 million
lines of code; and is actively developed with usually multiple dozens of commits
daily afecting hundreds of source files. The testing environment was an average
personal computer with 4 CPU cores and 16GB RAM, the incremental parsing was
performed as a single thread process.</p>
      <p>The goal of the test was to compare the required build and parse time for LLVM
in case of  ) a full clean build and  −  ) on-demand incremental rebuild after code
changes were carried out on the original codebase. The  −  ) cases were executed
for three changesets: containing 1, 7 and 17 changed C++ source files respectively.
Table 1 describes the results in detail. Number of files written in the Afected files
column include both the directly or indirectly content modified C/C++ files and
the files afected by compilation command alteration.</p>
      <sec id="sec-5-1">
        <title>Afected files</title>
      </sec>
      <sec id="sec-5-2">
        <title>Execution time Task</title>
        <p>build
parse
build
parse
build
parse
build
parse
140 min
387 min
129 sec
89 sec
112 sec
110 sec
183 sec
169 sec</p>
      </sec>
      <sec id="sec-5-3">
        <title>Test case</title>
        <p>Full clean build
Incremental rebuild #1
Incremental rebuild #2
Incremental rebuild #3
all
1
7
17</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion and conclusion</title>
      <p>In our paper we presented our method on how to determine the changes in a
software project’s source code or in its compilation commands, refining our previous
approach on incremental parsing. Our solution described in Section 4 eliminates
the required presence of target binaries, on which build systems with incremental
building capability usually depend on, and solely depends on the comparison of
the source of the project at two diferent revisions, categorizing which files have to
be reanalyzed. The implementation was carried out as part of the CodeCompass
code comprehension tool.</p>
      <p>The results presented in Section 5 proved that implementing incremental
parsing in a code comprehension tool is capable of processing code changes on the
magnitude of a usual version control commit size or local modification in a couple
of minutes even on a standard desktop computer. By delivering results in such a
significantly reduced time, our approach enables the integration of IDEs and code
comprehension tools and utilizing incremental parsing in everyday code
development.</p>
      <p>Acknowledgements. This work was supported by the construction
EFOP-3.6.3VEKOP-16-2017-00002. The project was supported by the European Union,
coifnanced by the European Social Fund.</p>
      <p>Computer Code Availability. The source code of the CodeCompass Project
is available on GitHub at https://github.com/Ericsson/CodeCompass. A live
demonstration of the parsed master branch of the LLVM project is showcased at
https://codecompass.zolix.hu/.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Brunner</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Cserép</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Rule based graph visualization for software systems</article-title>
          .
          <source>In Proceedings of the 9th International Conference on Applied Informatics</source>
          (
          <year>2014</year>
          ), pp.
          <fpage>121</fpage>
          -
          <lpage>130</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Cserép</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Krupp</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <article-title>Visualization Techniques of Components for Large Legacy C/C++ software</article-title>
          .
          <source>Studia Universitatis Babes-Bolyai, Informatica</source>
          <volume>59</volume>
          (
          <year>2014</year>
          ),
          <fpage>59</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Fekete</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Cserép</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Incremental Parsing of Large Legacy C/C++ Software</article-title>
          .
          <source>In 21th International Multiconference on Information Society (IS)</source>
          ,
          <source>Collaboration, Software and Services in Information Society (CSS)</source>
          (
          <year>2018</year>
          ), vol. G, pp.
          <fpage>51</fpage>
          -
          <lpage>54</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Free</given-names>
            <surname>Software Foundation. GNU Make</surname>
          </string-name>
          <article-title>Manual</article-title>
          .
          <source>Tech. Rep. 0</source>
          .
          <issue>74</issue>
          , 5
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Hawes</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marshall</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Anslow</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>Codesurveyor: Mapping large-scale software to aid in code comprehension</article-title>
          .
          <source>In 2015 IEEE 3rd Working Conference on Software Visualization (VISSOFT)</source>
          (
          <year>2015</year>
          ), IEEE, pp.
          <fpage>96</fpage>
          -
          <lpage>105</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Jameson</surname>
            ,
            <given-names>K. W.</given-names>
          </string-name>
          <article-title>Collection makefile generator</article-title>
          ,
          <source>Feb. 21 2006. US Patent 7</source>
          ,
          <issue>003</issue>
          ,
          <fpage>759</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>McIntosh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adams</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Hassan</surname>
            ,
            <given-names>A. E.</given-names>
          </string-name>
          <article-title>The evolution of ant build systems</article-title>
          .
          <source>In 2010 7th IEEE Working Conference on Mining Software Repositories (MSR</source>
          <year>2010</year>
          )
          <article-title>(</article-title>
          <year>2010</year>
          ), IEEE, pp.
          <fpage>42</fpage>
          -
          <lpage>51</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Merino</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghafari</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anslow</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Nierstrasz</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <article-title>Cityvr: Gameful software visualization</article-title>
          .
          <source>In 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME)</source>
          (
          <year>2017</year>
          ), IEEE, pp.
          <fpage>633</fpage>
          -
          <lpage>637</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Mészáros</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cserép</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Fekete</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>Delivering comprehension features into source code editors through lsp</article-title>
          .
          <source>In 2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)</source>
          (
          <year>2019</year>
          ), IEEE, pp.
          <fpage>1581</fpage>
          -
          <lpage>1586</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Oracle</surname>
          </string-name>
          .
          <article-title>OpenGrok: A wicked fast source browser</article-title>
          . https://oracle.github.io/ opengrok/.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Porkoláb</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brunner</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krupp</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Csordás</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Codecompass: An open software comprehension framework for industrial usage</article-title>
          .
          <source>In Proceedings of the 26th Conference on Program Comprehension</source>
          (New York, NY, USA,
          <year>2018</year>
          ),
          <source>ICPC '18</source>
          , ACM, pp.
          <fpage>361</fpage>
          -
          <lpage>369</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Raemaekers</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Deursen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Visser</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>Semantic versioning versus breaking changes: A study of the maven repository</article-title>
          .
          <source>In 2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation</source>
          (
          <year>2014</year>
          ), IEEE, pp.
          <fpage>215</fpage>
          -
          <lpage>224</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Szalay</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Porkoláb</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Krupp</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <article-title>Towards better symbol resolution for C/C++ programs: A cluster-based solution</article-title>
          .
          <source>In IEEE 17th International Working Conference on Source Code Analysis and Manipulation (SCAM)</source>
          (
          <year>2017</year>
          ), IEEE, pp.
          <fpage>101</fpage>
          -
          <lpage>110</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>