Inferring The Best Static Analysis Tool for Null Pointer
Dereference in Java Source Code
Midya Alqaradaghi1,2,∗ , Tamás Kozsik1
1
    Department of Programming Languages and Compilers, ELTE Eötvös Loránd University, Budapest, Hungary
2
    Technical College of Kirkuk, Northern Technical University, Kirkuk, Iraq


                                         Abstract
                                         Finding software bugs and security vulnerabilities using static source code analysis is a viable method.
                                         Static source code analysis techniques are already sufficiently matured for industrial use, and numerous
                                         tools have been developed to aid in the automatic detection of software faults. In this paper, the
                                         capabilities of three static source code analysis tools are investigated with respect to identifying null
                                         pointer dereference in Java source code. Our research uses artificial test cases as a benchmark. The
                                         study reports performance results based on five metrics. The experiments show that Facebook Infer
                                         outperforms the other tools in identifying null pointer dereference.

                                         Keywords
                                         Static analysis, Facebook Infer, SonarQube, SpotBugs, null pointer dereference, CWE476


1. Introduction
Static analysis technologies uncover security flaws early in the software development phase,
which saves time and cost and ensures quick feedback for the responsible programmer. The
security vulnerabilities and bugs that can be identified using static analysis techniques are in
wide ranges, starting from simple programming errors, and ending with more complex ones
like access control issues [1].
   Null pointer dereference is a memory access error [2]. It arises when a program follows a
pointer that is supposed to refer to a valid object but is actually null; this results in a program
crash or an exception. In the Java programming language, this issue causes a NullPointerExcep-
tion to be thrown. Null pointer dereferences typically originate from some faulty assumptions
made by the programmer. The majority of null pointer problems result in general software
reliability issues, but if a hacker can provoke it on purpose, they may be able to use the resulting
exception to get around security checks or to make the application reveal debugging data that
will be valuable for planning future attacks [3]. Null pointer dereference is considered one of
the most common programming errors in the Java programming language [2].
   There are many static analysis tools – with varying capabilities – which can detect null
pointer dereference. It is known that static analysis for an “interesting” problem is undecidable,
SQAMIA 2022: Workshop on Software Quality, Analysis, Monitoring, Improvement, and Applications, September 11–14,
2022, Novi Sad, Serbia
∗
    Corresponding author.
Envelope-Open alqaradaghi.midya@inf.elte.hu (M. Alqaradaghi); kto@inf.elte.hu (T. Kozsik)
Orcid 0000-0001-9881-5854 (M. Alqaradaghi); 0000-0003-4484-9172 (T. Kozsik)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                         1:1
i.e., it is not possible to build an algorithm that produces an accurate answer in each and every
case [1]. As a result, static analysis tools are prone to reporting findings in source code that
are not vulnerabilities (false positives), and they may miss to detect some of the vulnerabilities
(false negatives). Static analysis tools would, in an ideal world, uncover as many vulnerabilities
as possible, optimally all, with as few false positives as possible, ideally none.
   With the goal of better understanding their strengths and shortcomings, this paper focuses
on an empirical assessment of three static analysis tools for their performance in identifying
null pointer dereference in Java source code.
   For this purpose, we used the CWE476 Null Pointer Dereference test cases of the Juliet Test
Suite [4] to benchmark the Facebook Infer [5], SonarQube [6] and SpotBugs [7] tools.
   The main contributions of this work are as follows.

    • We present quantitative results on the performance of three well-known, free, and open-
      source static analysis tools in identifying null pointer dereference in Java source code,
      based on the Juliet Test Suite.
    • We present the numbers and types of shared, unique, and missed detections of these tools.
    • We report on five performance metrics – recall, false alarm rate, precision, G-score, and
      F-measure – for the analyzed test cases.

  The main empirical observations of this work are the following.

    • None of the used tools were able to detect all null pointer dereference errors in the test
      cases of the Juliet test suite, specifically 8% of the flawed constructs have been missed by
      the tools.
    • 14% of the vulnerabilities were detected only by Facebook Infer – these have been com-
      pletely missed by the other two tools. Moreover, Infer gave the highest calculated recall,
      G-score, and F-measure.
    • SpotBugs gave the highest calculated precision and the lowest (best) false alarm rate.

  The remainder of the paper is structured as follows. Information on the static analysis
tools and the Juliet benchmark, as well as some preliminaries of the research, are presented in
Section 2. Then, Section 3 introduces the applied research method. The results of analyzing
null pointer dereference test cases by the three tools are given in Section 4. We discuss related
work in Section 5. Finally, Section 6 draws the conclusions.


2. Background
Null pointer dereference is a major source of bugs and vulnerabilities in programming languages
without static typing support for nullable and non-nullable references [8]. It is also the main
memory-related vulnerability in Java, a strongly typed, garbage-collected language.
  The Common Weakness Enumeration (CWE) “is a community-developed list of software and
hardware weakness types. It serves as a common language, a measuring stick for security tools,
and as a baseline for weakness identification, mitigation, and prevention efforts” [9]. It lists
null pointer dereference under the identifier CWE476 [10].


                                               1:2
   Juliet Test Suite [4] is a set of artificial test cases with predetermined outcomes for evaluating
the effectiveness of software-assurance techniques, including static analysis tools in discovering
various software faults and vulnerabilities. These test cases were developed in order to enable the
automatic evaluation of static analysis tools. The Java test cases are divided into 112 weakness
categories, including CWE476 Null Pointer Dereference, which has positive test cases (flawed
constructs with the word bad in their names; these are supposed to be reported) and negative
test cases (unflawed constructs with the word good in their names; these are supposed to be
not reported). These two groups of test cases are called positives and negatives, respectively,
throughout the paper.
   When a static analysis tool correctly reports one or more positives, this is referred to as a true
positive. Consequently, when a static analysis tool mistakenly reports a negative, this is referred
to as a false positive. A false negative is the case when the tool misses one of the positives. Finally,
a true negative is the case when a negative is, correctly, not reported.
   The versions of the tools used in this study are Infer 1.1.0, SonarQube 8.8.0.42792 (Community
Edition) with its SonarScanner 4.6.0.2311, and Spotbugs 4.2.3. Moreover, we have used the most
up-to-date version of Juliet Java, which is version 1.3.


3. Research Method
Let us first explain the design of the experiment, including the used metrics. Then we present
the details of the implementation of the experiment.

3.1. Experiment Design
Our study follows a two-factorial experiment design aimed to explore the abilities of three
selected static analysis tools in the detection of null pointer dereference in the Juliet Java test
suite. The two factors (i.e., independent variables) are the following: (1) static analysis tool, and
(2) type of vulnerability.
   The levels of the first factor are the specific tools used for evaluation. We started the tool
selection process by compiling a survey that consisted of twelve commercial tools and eight open-
source tools. The main characteristics of each tool were extracted from their documentation.
The criteria for making the selection were as follows: (1) it has to be free and widely used;
(2) it should specifically identify null pointer dereference; (3) it should support Java. Therefore,
we excluded all the commercial tools, and from the list of the eight open-source tools, we
first excluded the tools that did not support Java and tools that did not include specific rules
for identification of null pointer dereference. Hence we ended up choosing Facebook Infer,
SonarQube, and SpotBugs.
   For the second factor, there is a single level: we investigate null pointer dereference only,
which is CWE476.
   As response variables (i.e. dependent variables) we used five metrics: recall, false alarm rate,
precision, G-score, and F-measure; these reflect several dimensions of the performance of the
tools. To compute these response variables, we start by calculating true positives (TP) and false
positives (FP) from the reports of each tool. These, along with the number of positives and
negatives are then used to compute the above metrics.


                                                  1:3
                                                    𝑇𝑃
                                     Recall =                                                     (1)
                                                  𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
                                                     𝐹𝑃
                          False alarm rate =                                                      (2)
                                                  𝑁 𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠

                                                    𝑇𝑃
                                  Precision =                                                     (3)
                                                  𝑇𝑃 + 𝐹𝑃
                                                  2 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 ∗ 𝑆𝑝𝑒𝑐𝑖𝑓 𝑖𝑐𝑖𝑡𝑦
                                    G score =                                                     (4)
                                                    𝑅𝑒𝑐𝑎𝑙𝑙 + 𝑆𝑝𝑒𝑐𝑖𝑓 𝑖𝑐𝑖𝑡𝑦

                                                  2 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
                                 F measure =                                                      (5)
                                                    𝑅𝑒𝑐𝑎𝑙𝑙 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
   Recall describes the capability of accurately detecting vulnerabilities; it is defined as the ratio
of true positives to the total number of positives (Eq. (1)). Hence, Recall is restricted to faulty
constructs and shows the percentage of flawed constructs successfully recognized by a tool.
   False Alarm Rate is exclusively concerned with unflawed test cases, and it reflects the per-
centage of unflawed test cases that have been misidentified as flawed ones. It is the ratio of
false positives to the total number of negatives, or the ratio of mistakenly reporting unflawed
constructions as flawed ones. It is given by Eq. (2).
   Specificity (used in the computation of G-score) is directly related to false alarm rate. It is the
ratio of true negatives to all negatives (hence it is equal to 1 − False Alarm Rate).
   Precision, given with Eq. (3), provides the ratio of true positives to the sum of true positives
and false positives. Hence Precision is concerned with all the reported test cases; it shows the
percentage of successfully detected flawed constructs to the number of (flawed and unflawed)
constructs reported by a tool.
   Eq. (4) defines G-score, which is the harmonic mean of recall and specificity. It enables us
to merge two key metrics into a single one. Similarly, the F-measure is the harmonic mean of
recall and precision (Eq. (5)). They both describe the accuracy of the analysis.
   High recall, precision, G-score, and F-measure, as well as low false alarm rate values, indicate
higher performance, and they are all in the interval [0, 1].

3.2. Experiment Execution
The execution of our experiment can be structured into six major steps, as described in the
flowchart shown in Figure 1.

Step 1: Evaluate CWE476 test cases. According to previous research [11], Juliet’s test cases
are not perfect, they may contain some issues. Therefore, we manually reviewed all the positive
and negative test cases of the CWE476 group of Juliet Java in order to evaluate and validate
their effectiveness. Only the valid test cases were passed to the next step.


                                                 1:4
Figure 1: Flow chart of our experiment.


Step 2: Determine the analyzer(s) of Facebook Infer which target null pointer derefer-
ence. Based on the documentation of Infer, we determined the analyzers which target null
pointer dereference. This step is important because there may be some checkers that are not
turned on by default. The analyzers are Pulse and Biabduction. The latter works by default when
running Infer, while Pulse should be explicitly activated. Note that SonarQube and SpotBugs
related analyzers are all activated by default.

Step 3: Run Infer on the Java CWE476 test cases and get the output report. Both
analyzers of Infer mentioned in Step 2 have been run on all the test cases that were compiled in
Step 1. This resulted in an output report which contains all detections. Find more details about
the used command and other technicalities in the author’s GitHub account [12].

Step 4: Calculate TP, FP, and FN. True positives, false positives, and false negatives were
counted for Facebook Infer, and the findings were verified with manual review.

    • If the tool reported a positive test case (a bad method), it was counted as a true positive
      (TP). Only one TP was counted for each detected bad method, regardless of the number
      of reports on a single test case.
    • If the tool reported a negative test case (a good method), then it was counted as a false
      positive (FP).
    • If the tool did not report a positive test case (a bad method), then it was counted as a false
      negative (FN).


                                                1:5
Table 1
TP and FP results of running the three tools on CWE476 of Juliet Test Suite.
                        Positives   Negatives          Tool         TP    FP
                           181         466            Infer        166     12
                                                   SonarQube       118     80
                                                  SpotBugs(h)      106     0
                                                 SpotBugs(h&n)     129      0
                                                  SpotBugs(all)    166    264


Step 5: Calculate and analyze shared, unique & missed detections. For calculating
the shared, unique, and missed detections, we used the results obtained for Infer from Step 4,
and we also brought here the analysis results for SonarQube and SpotBugs from our previous
study [13] – with some adjustments (see Section 4).

Step 6: Calculate metrics: Recall, False alarm rate, Precision, G-score, and F-measure.
We computed true positives and false positives in Step 3 as listed in Table 1. Those values
together with positives and negatives from Step 1 are used in this step to compute the metrics
mentioned in Section 3.1.


4. Results and Discussion
The quantitative results of the execution of the experiment described in Section 3 and their
qualitative analysis are presented now.
   In Step 1 of the experiment, the test cases in Juliet Java 1.3 were thoroughly investigated. It
turned out that although documentation of Juliet mentions that the number of positives is 198
for CWE476, 17 of them are incorrectly labeled as positives (they do not result in a null pointer
dereference). Consequently, we have excluded them from the positive test cases. Moreover,
since the number of negatives is not mentioned in the documentation, we could simply count
the number of “good” methods (which is 496) – however, after the manual review 30 of them
turned out to be incorrectly labeled as good, and hence they must be excluded from negatives.
The final number of positives and negatives used in this study are 181 and 466, respectively.
   More details about the excluded test cases and the script used for counting negatives are
provided in the author’s GitHub account [12].

4.1. True Positives and False Positives
The results of running the three tools on Juliet’s null pointer dereference test cases are shown
in Table 1. The figures presented here for SonarQube and SpotBugs originate from our previous
research [13], but they are adjusted to the slightly different methodology applied in this study;
true positives and false positives have been recalculated for SonarQube and SpotBugs to ensure
that a uniform data collection principle is followed for all three tools. More information can be
found on this in the author’s GitHub account [12].


                                                 1:6
Figure 2: Shared and unique detections. Left: true positives; Right: false positives.


    In SpotBugs, detected bugs have priorities and they can be shown at three levels: (i) high,
(ii) high & normal, and (iii) all priorities. These priorities present confidence levels. They have
been named as SpotBugs(h), SpotBugs(h&n), and SpotBugs(all) in Table 1. In our analysis below
we rely on the SpotBugs(h&n) configuration because it gives the best results among the three
options. The best result in this context means the one with a good balance of high true positives
and low false positives.
    We can see from Table 1 that currently, the best performing tool for this particular problem
is Facebook Infer. This tool gave the highest number of true positives with a reasonably low
number of false positives. The low rate of false positives in Infer is due to its ability to avoid
infeasible control paths.

4.2. Shared, Unique and Missed Detections.
For a better understanding of how well the tools are performing in the detection of CWE476, it
is important to further analyze the results of Section 4.1 with the help of a manual review, and
to investigate the number and characteristics of detected test cases shared among some, or all,
of the tools. It is also instructive to see those true and false positives which are detected by a
single tool, and those cases which have been missed by all the tools (viz. false negatives). For
more details see the author’s GitHub account [12]).

Shared and Unique Detections. Figure 2 presents the number of shared and unique detec-
tions by the tools. The most important observations are the following:

   1. There are 25 positives (15% of the 166 true positive detections, and 14% of all 181 positives)
      which were detected by Infer only, and not by the other tools. These test cases require
      data flow analysis among methods located in different classes and files (as an example,
      see Listing 1).
   2. The three tools detected 106 positives in common (64% of the 166 true positive detections,
      and 59% of the 181 positives). These test cases of Juliet Test Suite are straightforward
      cases of null pointer dereference, and the dereference occurs in clearly feasible execution


                                                  1:7
           paths. According to the terminology of the Juliet Test Suite, these test cases belong to the
           baseline and control flow category, and they require control flow and data flow analysis
           between different methods of the same class definition.
        3. There is not a single tool that is able to detect all of the positives.
        4. The tools have detected 92% of the positives. In particular, Infer was able to detect this 92%
           of positives itself, i.e. it could detect everything that other tools could detect (disregarding
           Spotbugs(all) here, as that configuration yields an unacceptable amount of false positives).
        5. The tools have reported 92 false positives. Infer and SonarQube have 8 in common. These
           negatives rely on the value of a modifiable instance field; the tools behave conservatively
           by assuming that such a field might be changed somewhere. One can also observe that
           Spotbugs(h&n) performed extremely well in this respect by not giving any false positives.

                 Listing 1: An example for unique detections by Infer in Juliet Java 1.3
1    public class CWE476_NULL_Pointer_Dereference__int_array_22a extends AbstractTestCase {
2        /∗ The public static variable below is used to drive control flow in the sink function .
3          ∗ The public static variable mimics a global variable in the C/C++ language family. ∗/
4        public static boolean badPublicStatic = false ;
5        public void bad() throws Throwable {
6             int [] data = null ;
7             data = null ; /∗ POTENTIAL FLAW: data is null ∗/
8              badPublicStatic = true ;
9             (new CWE476_NULL_Pointer_Dereference__int_array_22b()).badSink(data );
10       }
11   }
12
13   public class CWE476_NULL_Pointer_Dereference__int_array_22b {
14       public void badSink( int [] data ) throws Throwable {
15           if (CWE476_NULL_Pointer_Dereference__int_array_22a.badPublicStatic) {
16               IO. writeLine ( ”” + data . length ) ; /∗ POTENTIAL FLAW: null dereference will occur
                      if data is null ∗/
17           } else {
18               /∗ INCIDENTAL: CWE 561 Dead Code, the code below will never run
19                ∗ but ensure data is initialized before the Sink to avoid compiler errors ∗/
20               data = null ;
21           }
22       }


     Missed Test Cases (False Negatives) We mentioned earlier that the tools of this study
     detected 92% of the positives. This means that only 8% of the positives was missed by Infer, and
     these have been missed by all the other tools as well. The missed test cases all rely on passing a
     data structure (an array, a Vector, or a LinkedList) storing a null value, or a serialized object
     having a field with a null value as a parameter to a method. The detection of vulnerabilities
     of this type may require built-in (lexical) knowledge about the standard library (the logic of
     data structures and the behavior of serialization), which – considering the volume of the Java
     standard library – is indeed a really demanding requirement.


                                                     1:8
        Listing 2 presents a test case missed by Infer and the other two tools. The test case illustrates a
     vulnerability that involves more than one method. A static analyzer is expected to detect the bug
     either at line 10, or line 19, or both. Identifying this bug requires analysis that crosses method
     boundaries, which may be achieved by global analysis approaches. Infer can successfully detect
     such positive test cases in general. However, this test case requires maintaining (and passing
     around) information about an element of an array, which proved to be too sophisticated for
     Infer. Similar missed positives occur in Infer when an element of some other data structure is
     set to null.

                        Listing 2: An example of missed test cases in Juliet Java 1.3
1    import java . util . Vector ;
2    public class CWE476_NULL_Pointer_Dereference__int_array_72a extends AbstractTestCase {
3        public void bad() throws Throwable {
4             int [] data ;
5            data = null ; /∗ POTENTIAL FLAW: data is null ∗/
6            Vector<int []> dataVector = new Vector<int []>(5) ;
7            dataVector . add (0, data ) ;
8            dataVector . add (1, data ) ;
9            dataVector . add (2, data ) ;
10           (new CWE476_NULL_Pointer_Dereference__int_array_72b()).badSink(dataVector ) ;
11       }
12   }
13
14   import java . util . Vector ;
15   public class CWE476_NULL_Pointer_Dereference__int_array_72b {
16       public void badSink(Vector<int []> dataVector ) throws Throwable {
17            int [] data = dataVector . remove(2);
18            /∗ POTENTIAL FLAW: null dereference will occur if data is null ∗/
19           IO. writeLine ( ”” + data . length ) ;
20       }
21   }


     4.3. Performance of the tools
     The last step of our experiment described in Section 3 is to calculate the specified metrics.
     Figure 3 presents these for the three tools, characterizing their performance in the classification
     of the null pointer dereference test cases of the Juliet Test Suite. As mentioned in Section 4.1, in
     the case of SpotBugs, results of high & normal priority have been considered in this research
     (i.e., SpotBugs(h&n)).
         The Recall values are shown in the first column. It represents how good the identification of
     the positives is, without taking into account the faulty reports of negatives. Higher recall indicates
     better performance. The findings revealed that SonarQube and SpotBugs had comparable recall
     values, whereas Infer had the best recall.
         Column 2 displays the values for the false alarm rate. Smaller values here indicate better
     performance since the false alarm rate involves unflawed structures being mistakenly classified


                                                      1:9
Figure 3: Calculated metrics for the tools.


as faulty. It can be observed from Figure 3 that SpotBugs had the lowest false alarm rate among
the three tools, but Infer performed also quite well in this respect.
   The third column in Figure 3 presents the precision of the three tools in identifying null
pointer dereference. The precision measures how good the identification of faulty methods is
compared to all the reports. Higher precision indicates better performance. SpotBugs has the
highest performance for precision, followed closely by Infer.
   The G-score and the F-measure enable us to combine two metrics into one, as discussed
in Section 3. Recall & false alarm rate are combined for G-score and recall & precision are
combined for F-measure. For both metrics, values close to 1 indicate that the tool can detect a
specific weakness well, with no or very few false positives. This can be observed in the case
of Infer (0.95 and 0.92, respectively). SpotBugs and SonarQube performed significantly worse:
both G-score and F-measure for SpotBugs(h&n) are 0.83, while SonarQube yields 0.73 and 0.62,
respectively.


5. Related Work
A variety of methodologies for evaluating static analysis tools are in use today. For benchmarking
purposes, some researchers work with real code bases, while others rely on synthetic ones, or
even both. Our study falls in the second category.
   In a prior study [13], we compared the performance of four state-of-the-art static analysis tools
(SonarQube, SpotBugs, Find Security Bugs, and PMD) in detecting six security vulnerabilities in
Java source code, including null pointer dereference. The study found that SonarQube had the
best aggregated performance and that all of the evaluated tools need to be improved to address
better the highlighted weaknesses. The current research reveals the superiority of Facebook
Infer for a very important vulnerability, null pointer dereference. We have excluded Find
Security Bugs and PMD from this investigation because they provide no support for detecting
null pointer dereference (although PMD can detect null pointer assignments).
   Stephan Lipp and others [14] evaluated the vulnerability detection capabilities of six static C
code analyzers, including Infer, against 192 real-world vulnerabilities in free and open-source


                                               1:10
programs. When applied to real-world software projects, the evaluated static analyzers were
shown to be inefficient; most of the known vulnerabilities were missed. Our research suggests
that Infer is fairly (while SpotBugs and SonarQube are moderately) efficient when analysing
a particular weakness, null pointer dereference, on Java code. This may be due to a better
analyzability of Java compared to C, or to the fact that we have carried out the experiment on a
synthetic test suite.
   Goseva-Popstojanova and Perhinschi [15] used 19 weakness categories, including null pointer
dereference, to evaluate three tools with statistical methods. They used both some real-world
programs and (a previous version of) the Juliet Test Suite (for Java and C/C++) to benchmark the
tools. They also claim that the tools were ineffective at detecting security flaws. They measured
approx. 0.48 and 0.64 as G-score for Juliet/Java/CWE476 using two (unspecified) tools, and their
third tool was not able to detect CWE476 at all. Compared to our results, this suggests that the
capabilities of static analysis tools improved significantly in the past 7 years.
   Wouter Stikkelorum [11] provided a thorough assessment of Infer (version v0.8.1), which
included benchmarking it against the Juliet test suite, running it against a large number of
open-source projects, and against industrial code. According to his research, Infer performed
well on industrial software, and the results are promising – which is in line with the opinion of
other researchers [16, 17]. In particular, the results for the CWE476 category in the Java test
cases of Juliet are quite similar to ours, although he used earlier versions of both Infer and Juliet:
recall, precision and F-measure was measured 0.85, 0.95 and 0.90 in his experiment, and 0.92,
0.93 and 0.92. These values show a moderate increase in sensitivity, and a slight decrease in
precision for Infer.


6. Conclusion
The results of this research revealed that Facebook Infer is quite reliable when used for identify-
ing null pointer dereference in the Juliet Java test suite. We can also conclude that SpotBugs
is fairly good in this vulnerability with the “high & normal” configuration, while SonarQube
performed worst mostly due to the high number of false positives.
    • 14% of the flawed constructs were detected only by Infer, and not the other tools.
    • All the flawed constructs that have been detected by SonarQube and SpotBugs have also
      been detected by Infer, with a negligible number of false positives.
    • SpotBugs is another very good static analysis tool, which, with well-chosen settings, gives
      highly accurate results.
    • There is still room for improvement in static analysis tools.


Acknowledgments
Midya Alqaradaghi has been supported by the Stipendium Hungaricum program. The research
of Tamás Kozsik has been supported by project no. TKP2021-NVA-29, which is implemented
with the support provided by the Ministry of Innovation and Technology of Hungary from
the National Research, Development, and Innovation Fund, financed under the TKP2021-NVA
funding scheme.


                                                1:11
References
 [1] B. Chess, J. West, Secure Programming with Static Analysis, Software Security Series,
     Addison-Wesley, 2007.
 [2] D. Hovemeyer, J. Spacco, W. Pugh, Evaluating and tuning a static analysis to find null
     pointer bugs, SIGSOFT Softw. Eng. Notes 31 (2005) 13–19. doi:10.1145/1108768.1108798 .
 [3] The OWASP Foundation, Null Dereference, https://owasp.org/www-community/
     vulnerabilities/Null_Dereference, Accessed: July 2022.
 [4] National Institute of Standards and Technology, Juliet Java 1.3, https://samate.nist.gov/
     SRD/test-suites/111, 2017.
 [5] Facebook Infer, A tool to detect bugs in Java and C/C++/Objective-c code, https://fbinfer.
     com/., Accessed: July 2022.
 [6] SonarQube, Code quality and code security, https://www.sonarqube.org/, Accessed: July
     2022.
 [7] SpotBugs, Find bugs in Java programs, https://spotbugs.github.io/, Accessed: July 2022.
 [8] T. Hoare, Null references: The billion dollar mistake, Talk at QCon London, 2009.
 [9] The MITRE Corporation, The Common Weakness Enumeration Initiative, https://cwe.
     mitre.org/, Accessed: July 2022.
[10] The MITRE Corporation, CWE476 Null Pointer Dereference, https://cwe.mitre.org/data/
     definitions/476.html, Accessed: July 2022.
[11] W. Stikkelorum, M. Bruntink, Challenges of using sound and complete static code analysis
     tools in industrial software, Master’s thesis, University of Amsterdam, Faculty of Science,
     Mathematics and Computer Science, 2016.
[12] M. Alqaradaghi, Technical report of null pointer dereference analysis, https://github.
     com/Midya-ELTE/Technical_Report_of_NullPointerDereference_Analysis, Accessed: July
     2022.
[13] M. Alqaradaghi, G. Morse, T. Kozsik, Detecting security vulnerabilities with static analysis
     – a case study, Pollack Periodica 17 (2021). doi:10.1556/606.2021.00454 .
[14] S. Lipp, S. Banescu, A. Pretschner, An empirical study on the effectiveness of static C code
     analyzers for vulnerability detections, in: 31st ACM SIGSOFT International Symposium
     on Software Testing and Analysis (ISSTA ’22), 2022, p. 12.
[15] K. Goseva-Popstojanova, A. Perhinschi, On the capability of static code analysis to detect
     security vulnerabilities, Information and Software Technology (2015). doi:10.1016/j.
     infsof.2015.08.002 .
[16] D. Distefano, M. Fahndrich, F. Logozzo, P. W. O’Hearn, Scaling static analyses at Facebook,
     Communications of the ACM 62 (2019) 62–70. doi:10.1145/3338112 .
[17] L. H. Newman, How Facebook catches bugs in its 100 million lines of code, Communications
     of the ACM (2019).


                                              1:12