<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Applied static firmware anlysis in an CI environment for increased safety and code review insights</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paul Würtz</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julius Kolb</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Eckardt</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Although formal tests of program logic matching functional requirements are often automated into the CI of a project with i.e. unit test, this is often not the case for other non-functional constraints, like memory usage and process time[1]. Not activly watching these constraints during development can end in mission-critical failures if not detected before testing on-device, and delay the insight that a controller's computational and memory capacity might not be suficient for a given application. Evaluating these constraints at runtime demands a physical setup of all possible built targets with triggers for all possible events. Such setup, it's maintainance and throughput-limits add time- and require many resources to a project. Static analysis can yield a good approximation of these metrics for task timing and memory use without the costs of a physical setup and can be integrated into a continuous integration environment, similar to a functional test. Tools for static firmware analysis are mostly proprietary and behind big pricewalls limited to few security critical industries, there is a big gap to open and free tools. One open and free candidate in that category that shows a lot of potential is puncover. In this work we want to present an extended use-case of how we can track memory constraints, especially in stack usage and other insights over the lifetime of a firmware, for various targets at build time and increase sensibility for firmware size metrics during code reviews and continuous integration runs. Furthermore, we want to take an outlock for further applications for RTOS and library development and take an outlock to more fields to analyze.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;code review</kwd>
        <kwd>continous integration</kwd>
        <kwd>firmware</kwd>
        <kwd>static analysis</kwd>
        <kwd>stack size</kwd>
        <kwd>RTOS</kwd>
        <kwd>tooling</kwd>
        <kwd>Zephyr</kwd>
        <kwd>ZephyrRTOS</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Manual memory layout and management in resource-constrained embedded devices, like
microcontrollers with only a few kilobytes of Flash and RAM, using real-time operating systems is a typical use
case. For flash memory - storing the program code - and global static RAM usage, the developer gets
feedback after the firmware is compiled and linked as shown in figure 1. If any of the global memories
are not suficient in size the build will fail with a linker error. Then for cutting down on memory usage
a more detailed overview can be helpful. On the other hand, if a task configured a too small stack size
is configured for a task, this will often be discovered painfully during runtime. With static analysis and
the knowledge of the configured stack size a stack overflow can be detected on the build process and an
early failure could avoid a long error search. Since on-device runtime errors can be less clear, an early
build error with an explicit error message of exceeded stack capacity supports the development process
and also reduces later execution test eforts.</p>
      <p>A considerable part of RAM can be used by tasks stack sizes. Knowing an exact limit of how much
memory a task consumes on the stack lets the developer allocate the least amount of RAM possible
configuring a suficient but not wasteful stack size. A viable stack size is often determined on the target
device at run-time. For runtime evaluation, it is important to enter all program branches. Triggers to all
possible events to enter all use cases of the controller are needed. This process needs to be repeated
for continuous feedback on each task memory footprint. Typically, the complexity to reproduce all
programm conditions increases with the development time. Often this elaborate setup and experiment
is avoided, and stack sizes are chosen as defaults, by experience, or raised by arbitrary amounts after
crashes caused by stack overflows. This either leads to wasted memory or can cause crashes in the field.</p>
      <p>In this paper we want to explore how to tackle both issues, detecting stack overflows early in the
build process and giving the developer feedback about sensible stack size by the results of static analysis.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Current static analysis in puncover</title>
      <p>Static program analysis is a wide field with diverse applications. Without executing a program, but by
just analyzing the program source code or binary data, combined with knowledge of the programming
languague and/or target computing platform, certain statements can be proven and runtime conditions
can be formally verified, like correctness, runtime or memory consumption.</p>
      <p>For this work we are interested in the memory footprint of the program - especially the space a task
consumes on a stack. On an embedded controller running a RTOS usually several tasks run concurrently.
Each task execution state is saved on a seperate thread with an associated stack in the controllers RAM.
An illustration of a controllers RAM layout with such a memory layout is shown in figure 2. In this
section, methods used to analyze a firmware build that give us the possibility to get estimation for each
tasks stack usage are described.</p>
      <sec id="sec-2-1">
        <title>2.1. Binary analysis by dissassembly</title>
        <p>To use puncover the firmware binary with symbols as an ELF file and a compiler toolchain for the
target are needed. The ELF is parsed and broken down into individual variables and functions and the
program flow by all static calls between functions is extracted from the ELF. This is done by iterating
line-by-line and do pattern-matching on each line of the ELF file’s disassembly. Functions and variables
are saved in a list of symbols. If they cannot be reconstructed by the regex-pattern the symbol is not
added currently.</p>
        <p>For each symbol its name, the address in memory, whether it is a variable or a function, the assembly
instructions, the associated source file and linenumber where the symbol was defined are extracted by
one regex. For these the compiler toolchain obtains the disassembly by calling objdump -dslw ELF_FILE.</p>
        <p>The symbol sizes in program memory (typically flash memory) are extracted with another regular
expression, applied on the output of a call to the compiler toolchain nm -Sl ELF_FILE. Each line-entry in
the output of nm is iterated and matched and added to existing symbols or creates a new entry if new
symbols are yielded by this step.</p>
        <p>Further steps of enhancing the symbol data are:
• the disassembly calls from one function to others are found by also trying to match them with
regular expressions to already found symbols - constructing a call tree
• if a build directory is given, .su files are searched and if found, each functions stack-usage and
type is extracted - an example of a .su file is given in listing 1 - this requires build options and
artifacts, described in the next section
• normalization of file paths and demangling of C++ symbol names
• finding neighbouring symbols - that are next to each other in the firmware address space and
linking them on a list by association of the next_symbol</p>
        <p>This results in a list of symbols - functions and variables, and a mapping of the elements in the source
ifles to their compiled equivalents by address and size in the binary with a static call graph between
function.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Additional compiler options for functions static worst-case stack-usage</title>
        <p>The stack-usage of a function in the current implementation of puncover is not retreived by analyzing
the ELF file assembly. It would be possible, but as a simpler method the output of the compiler is used.
This needs specific compiler options and the user to provide the path to the build directory.</p>
        <p>
          Listing 1: Stack usage file as implemented in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
f s u . c : 4 : f o o 1 0 4 0 s t a t i c
f s u . c : 7 : b a r 1 6 dynamic
f s u . c : 1 0 : main 3 2 dynamic , bounded
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] the introduction of the -fstack-usage flag is described. During the compilation, the size of
all variables that get pushed onto the stack to generate the instructions that pass variables between
functions and initialize stack-local variables must be known. With the said compiler option, this internal
calculation can be output to files. Stack usage can be either static, dynamic,bounded or dynamic - to
come to a worst-case memory usage everything but dynamic gives us a finite worst-case limit. For each
compilation unit a .su file with one line for each contained function with its name, the stack size a call
to this function consumes and the type of stack usage.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Known limitations and sources of errors</title>
        <p>Missmatches in regular expressions are not handled or raised as warnings. A test regression suit, against
a variety of firmwares and tracking the number of warnings already generated remains an open point.</p>
        <p>
          The analysis time for certain projects can be very long: https://github.com/HBehrens/puncover/
issues/101. In the original paper[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] -fcallgraph-info is introduced to get all the information out of GCC to
calculate worst-case stack usage just by the output of GCC. In puncover the results of the dissassembly
of the ELF file is used so far. Using the GCC provided output in puncover might be a way to improve
analysis runtime and could in the future be added as an alternative provider for call information, in
case the complete build environment is available additionally to just the ELF file.
        </p>
        <p>Path matching is not working reliably with all forms of relative paths, and some symbols result with
unknown stack size. Executing puncover this yields in warning messages WARNING Couldn’t find
symbol for FILE LINENUMBER SYMBOL_NAME.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Stack Worst-Case Scenarios calculation</title>
      </sec>
      <sec id="sec-2-5">
        <title>3.1. RTOS awareness - defining task entries</title>
        <p>Calculating the worst-case stack scenarios on any firmware is already useful feedback for the developer.
In the context of a multitasking operating system, this can be refined and applied to each task and the
defined stack size of it. The memory layout of the RAM in such a system is shown in figure 2. If any
thread stacks overflow during runtime, data in other stacks or content of variables can be over-written
as an undesired consequence and the application state be damaged. This leads to undefined behavior
and most often to a reboot by some hardware fault. Discovering the reason of the fault can be non-trivial
as not the overflow itself but the efect of over-written memory can cause the reset. By knowing the
size and entry point of each task by calculating the stack’s worst-case size a tasks stack overflow can be
detected at build time and the user can be informed directly with a descriptive error message.
To calculate the worst or biggest use of stack,
that a function is involved in a recursive search
through the functions callers and callees is
performed. The sum of both maxima and the
function’s own stack size yields the worst case.</p>
        <p>An examplary vizualisation of some task’s
call-tree is shown in figure 3. For the right
branches of task C stack-size are rendered with
each leave representing decending callee paths
branching out from the task.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Additions to puncover</title>
      <p>A positive of puncover is it can be called with
few prerequisits. When in a more defined
programming environment, we can add more
information to get more context out of the
analysis results. Some additions where implemented
in this work and are presented in this
section. A config via yaml, taking already existing
and newly added arguments to puncover was
added. An example of such config is shown in
listing 2. elf_file, build_dir, src_root and port
were already options, the others were added.</p>
      <p>To handle this kind of warning and errors related to stack overflows a list of task entries and configured
stack sizes can be passed in the puncover conig as report-max-static-stack-usage added as an argument
to puncover. For it to be easily usable on the command line the name of the function is serpated by :::
to the tasks stack size.</p>
      <p>Listing 2: Puncover config file for the canectivity firmware</p>
      <sec id="sec-3-1">
        <title>3.2. Early warnings and errors in continous integration</title>
        <p>To enable an error when any task exceeds it’s configured stack size the
error_on_exceeded_stack_usageoption was added to throw an error with returncode 1. To warn developers earlier
warn_threshold_size_for_max_static_stack_usage can set to a value in bytes. When in any task this
amount or less of space is left as the diference of the static analysis worst-case stack-usage and the
configured stack size an warning is triggered by returncode 79. To leaverage this to an CI pipeline an
example step for a gitlab CI file is given listing 3.</p>
        <p>Listing 3: Fragment of a gitlab-ci file to configure warnings and and errors after the build step
1 t e s t _ j o b _ 1 :
2 s c r i p t :
3 − p u n c o v e r − c c o n f i g . yaml
4 a l l o w _ f a i l u r e :
5 e x i t _ c o d e s : 79
6 c a c h e :
7 − key :
8 f i l e s :
9 − s t a t i c R e p o r t . j s o n</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. User-defined additional information - dynamic callbacks</title>
        <p>In many places dynamic function calls like callbacks in driver stacks are common in firmware. Especially
in the zephyrRTOS driver stack this is a common abstraction interface for implementation stacks. The
lack of dynamic callbacks limits the result of static analysis, hiding many call paths, that the firmware
will be executing on the device. The developer configuring these callbacks can supply the static analysis
with this information to improve the analysis.</p>
        <p>To do so add-dynamic-calls was added as an argument to the puncover config. Each entry to this
list is a string with the calling functions, followed by a ’-&gt;’ and the callbacks function name like:
z_impl_can_send-&gt;can_mcan_send from listing 2 line 16. This call relationship will then be added
by puncover and explored for the worst case stack calculation.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.4. Seperating data generation and presentation</title>
        <p>The development loop present in puncover started the analysis and served the presentation of the result
on a live webserver. This made testing new presentational improvements time intensive in testing,
since disassembly had to be repeated and took upto 30 minutes for some ELF files. Therefore a symbol
and analysis result export to JSON was defined, allowing for running an analysis once and iterate on
the presentation quicker with svelte as a frontend that supports also hot-reloading, minimizing the
restart time to see code changes refelct on the UI representation.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Analysis presentation in pexplorer</title>
      <sec id="sec-4-1">
        <title>4.1. Compare memory usage diferences across build versions</title>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Detailed memory reports and trends</title>
        <p>As shown in figure 1 overall memory usage is an essential health feedback developers get when building
a firmware. This metric is important to understand during the development process, since the choice for
a particular controller is a reflection of a projects requirements and choosing a controller with suficient
program and RAM memory is not a decision not fully informed based on assumptions and experience.
To compare two build
direct changes on the
binary level a dif-view
was added. All changes
between the two
compared versions global
lfash, RAM and
cumulative stack-usage are
listed. The changes to
each tasks worst-stack
size usages are listed as
well. And a detailed list
of symbols that were,
added, delted or changed
in size of flash or stack
size between the two
versions are rendered in
two sortable tables, one
for functions and another
for variables.</p>
        <p>If these planning assumptions for the right controller turn out while development processes and
a product matures. Also changing requirements can increase compute and memory prerequisits. If
code reuse or libraries and RTOSes are used for the project knowing more detailed flash and RAM
requirements is key for good early estimations. For later stage size optimization and code triage knowing
the memory footprint of each software component is essential data for informed choices.</p>
        <p>To give development and project managment a better overview of memory footprints an hierarchical
representation in a sunburst chart is proposed, as can be seen in figure 5. This reflects the current
memory consumtion for flash and RAM usage.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Other stats and trends</title>
        <p>For progressing
development and release trends
of memory usage trends
of the memory usage over
a series of firmware
releases is displayed. In
ifgure 6 a trend of static
stack usage over some
releases is shown in the
upper plot. On the lower
plot a trend for dynamic
allocation are displayed,
as these were of
engineering interest, since
dynamic allocations were
undesired in the project
where the idea for this
paper came from and many
dynamic allocation were
included involuntarily by
the C++ std-library. The trend of these and other metrics are possible by analyzing several static analysis
results together and can help engineers to keep track on development goals by defining simialar metrics,
tracking them and integrate the analysis into the CI pipeline for releases and merge requests.</p>
        <p>As a addition to the textual worst-stack usage of the current firmware a historical overview of all
loaded firmware analysos in one graphic for each task and the stacked changes of it’s worst case call
path with changing size information of each function is displayed - an example is shown in figure 7.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Outlock</title>
      <p>This paper only scratches the possibilities of using static analysis to benefit firmware development.</p>
      <p>Some issues are raised in Section 2.3.</p>
      <p>The quality and correctness of produced analysis with puncover is though interesting and useful up
for formal debate, as no reference test suit and comparison with measured dynamic stack usage has
formally been done and remains an open task.</p>
      <p>The introduced user-defined dynamic callbacks could be automated partially in build environments
like Zephyr’s West for parts of driver abstractions in various stacks - so that e.g. configured driver
implementations known at compile time but implemented as dynamic function calls could generate a
configuration file some dynamic callbacks present this way. Also, the user could be guided and warned
about jump instructions leading outside of functions that could not be mapped to any static functions
to be potential dynamic calls and a user interface to associate them with a function, to make the manual
part easier and less error-prone.</p>
      <p>Extending and integrating stack usage into the twister build helper, especially the –footprint-report
and –footprint-threshold options are interesting candidates, so that the reports for many build targets
could be generated in one call and the key results be evaluated automatically by elevating existing
program options.</p>
      <p>Further ideas to process and condition the data for data vizualisation for further insight over the
development cycle are open to be explored.</p>
      <p>
        Adding Embedded Software Timing as another analysis item could be an attractive option. The
results of further static analysis tools such as platin [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] could be compiled into a common extended
report format to be displayed in the presented pexplorer as a single interface.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>Thanks to the sonnen GmbH for the self-study time to think of ways to cut debugging time and search
for ideas how to automatically identify memory problems early, to enhance puncover and to watch
certain firmware metrics more easily.</p>
      <p>Thank you Tobi and Stephan for encouraging to further work and present the topic.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Gliwa</surname>
          </string-name>
          , Embedded software timing (
          <year>2021</year>
          ). URL: https://www.gliwa.com/book/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Botcazou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Comar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hainque</surname>
          </string-name>
          ,
          <article-title>Compile-time stack requirements analysis with gcc motivation, development, and experiments results (</article-title>
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Maroun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dengler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dietrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hepp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Herzog</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Huber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Knoop</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wiltsche-Prokesch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Puschner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rafeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schoeberl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schuster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wägemann</surname>
          </string-name>
          ,
          <article-title>The Platin Multi-Target Worst-Case Analysis Tool</article-title>
          , in: T. Carle (Ed.),
          <source>22nd International Workshop on Worst-Case Execution Time Analysis (WCET</source>
          <year>2024</year>
          ), volume
          <volume>121</volume>
          of Open Access Series in Informatics (OASIcs),
          <source>Schloss Dagstuhl - Leibniz-Zentrum für Informatik</source>
          , Dagstuhl, Germany,
          <year>2024</year>
          , pp.
          <volume>2</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          :
          <fpage>14</fpage>
          . URL: https://drops.dagstuhl. de/entities/document/10.4230/OASIcs.WCET.
          <year>2024</year>
          .
          <article-title>2</article-title>
          . doi:
          <volume>10</volume>
          .4230/OASIcs.WCET.
          <year>2024</year>
          .
          <volume>2</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>