=Paper= {{Paper |id=Vol-2070/paper-03 |storemode=property |title=Assessing Test Suite Effectiveness Using Static Metrics |pdfUrl=https://ceur-ws.org/Vol-2070/paper-03.pdf |volume=Vol-2070 |authors=Paco van Beckhoven,Ana Oprescu,Magiel Bruntink }} ==Assessing Test Suite Effectiveness Using Static Metrics== https://ceur-ws.org/Vol-2070/paper-03.pdf
       Assessing Test Suite E↵ectiveness Using Static
                          Metrics

                  Paco van Beckhoven1,2 , Ana Oprescu1 , and Magiel Bruntink2
                                           1
                                            University of Amsterdam
                                      2
                                          Software Improvement Group




                     Abstract                                   on tests can be difficult and generate risks if done
                                                                incorrectly [22]. Typically, such risks are related
    With the increasing amount of automated                     to the growing size and complexity which conse-
    tests, we need ways to measure the test                     quently lead to incomprehensible tests. An impor-
    e↵ectiveness. The state-of-the-art tech-                    tant risk is the occurrence of test bugs i.e., tests
    nique for assessing test e↵ectiveness, mu-                  that fail although the program is correct (false pos-
    tation testing, is too slow and cumber-                     itive) or even worse, tests that do not fail when the
    some to be used in large scale evolution                    program is not working as desired (false negative).
    studies or code audits by external compa-                   Especially the latter is a problem when breaking
    nies. In this paper we investigated two al-                 changes are not detected by the test suite. This
    ternatives, namely code coverage and as-                    issue can be addressed by measuring the fault de-
    sertion count. We discovered that code                      tecting capability of a test suite, i.e., test suite
    coverage outperforms assertion count by                     e↵ectiveness Test suite e↵ectiveness is measured
    showing a relation with test suite e↵ec-                    by the number of faulty versions of a System Un-
    tiveness for all analysed project. Asser-                   der Test (SUT) that are detected by a test suite.
    tion count only displays such a relation in                 However, as real faults are unknown in advance,
    only one of the analysed projects. Further                  mutation testing is applied as a proxy measure-
    analysing this relationship between asser-                  ment. It has been shown that mutant detection
    tion count coverage and test e↵ectiveness                   correlates with real fault detection [26].
    would allow to circumvent some of the                           Mutation testing tools generate faulty versions
    problems of mutation testing.                               of the program and then run the tests to determine
                                                                if the fault was detected. These faults, called mu-
1    Introduction                                               tants, are created by so-called mutators which mu-
                                                                tate specific statements in the source code. Each
Software testing is an important part of the soft-
                                                                mutant represents a very small change to pre-
ware engineering process. It is widely used in
                                                                vent changing the overall functionality of the pro-
the industry for quality assurance as tests can
                                                                gram. Some examples of mutators are: replacing
tackle software bugs early in the development pro-
                                                                operands or operators in an expression, removing
cess and also serve for regression purposes [20].
                                                                statements or changing the returned values. A mu-
Part of the software testing process is covered by
                                                                tant is killed if it is detected by the test suite, ei-
developers writing automated tests such as unit
                                                                ther because the program fails to execute (due to
tests. This process is supported by testing frame-
                                                                exceptions) or because the results are not as ex-
works such as JUnit [19]. Monitoring the quality
                                                                pected. If a large set of mutants survives, it might
of the test code has been shown to provide valu-
                                                                be an indication that the test quality is insufficient
able insight when maintaining high-quality assur-
                                                                as programming errors may remain undetected.
ance standards [18]. Previous research shows that
as the size of production code grows, the size of
                                                                1.1   Problem statement
test code grows along [43]. Quality control on test
suites is therefore important as the maintenance                Mutation analysis is used to measure the test suite
                                                                e↵ectiveness of a project [26]. However, mutation
Copyright c by the paper’s authors. Copying permitted for       testing techniques have several drawbacks, such as
private and academic purposes.
                                                                limited availability across programming languages
Proceedings of the Seminar Series on Advanced Techniques
and Tools for Software Evolution SATToSE 2017 (sat-
                                                                and being resource expensive [46, 25]. Further-
tose.org).                                                      more, it often requires compilation of source code
07-09 June 2017, Madrid, Spain.                                 and it requires running tests which often depend




                                                            1
on other systems that might not be available, ren-             Outline. Section 2 revisits background con-
dering it impractical for external analysis. Exter-         cepts. Section 3 introduces the design of the static
nal analysis is often applied in industry by compa-         metrics that will be investigated together with an
nies such as Software Improvement Group (SIG) to            e↵ectiveness metric and a mutation tool. Section 4
advise companies on the quality of their software.          describes the empirical method of our research.
All these issues are compounded when performing             Results are shown in Section 5 and discussed in
software evolution analysis on large-scale legacy or        Section 6. Section 7 summarises related work and
open source projects. Therefore our research goal           Section 8 presents the conclusion and future work.
has both industry and research relevance.
                                                            2     Background
1.2   Research questions and method
                                                            First, we introduce some basic terminology. Next,
To tackle these issues, our goal is to understand           we describe a test quality model used as input for
to what extent metrics obtained through static              the design of our static metrics. We briefly in-
source code analysis relate to test suite e↵ective-         troduce mutation testing and compare mutation
ness as measured with mutation testing.                     tools. Finally, we summarize test e↵ectiveness
   Preliminary research [40] on static test metrics         measures and describe mutation analysis.
highlighted two promising candidates: assertion
count and static coverage. We structure our anal-           2.1   Terminology
ysis on the following research questions:
                                                            We define several terms used in this paper:
RQ 1 To what extent is assertion count a good               Test (case/method) An individual JUnit test.
   predictor for test suite e↵ectiveness?                   Test suite A set of tests.
RQ 2 To what extent is static coverage a good               Test suite size Number of tests in a test suite.
   predictor for test suite e↵ectiveness?                   Master test suite All tests of a given project.
We select our test suite e↵ectiveness metric and            Dynamic metrics Metrics that can only be
mutation tool based on state of the art literature.             measured by, e.g., running a test suite. When
Next, we study existing test quality models to in-              we state that something is measured dynam-
spect which static metrics can be related to test               ically, we refer to dynamic metrics.
suite e↵ectiveness. Based on these results we im-           Static metrics Metrics measured by analysing
plement a set of metrics using only static analysis.            the source code of a project. When we state
   To answer the research questions, we implement               that something is measured statically, we re-
a simple tool that reads a project’s source files and           fer to static metrics.
calculates the metrics scores using static analysis.
   Finally, we evaluate the individual metrics’             2.2   Measuring test code quality
suitability as indicators for e↵ectiveness by per-
                                                            Athanasiou et al. introduced a Test Quality Model
forming a case study using our tool on three
                                                            (TQM) based on metrics obtained through static
projects: Checkstyle, JFreeChart and JodaTime.
                                                            analysis of production and test code [18]. This
The projects were selected from related research,
                                                            TQM consists of the following static metrics:
based on size and structure of their respective test
suites. We focus on Java projects as Java is one            Code coverage is percentage of code tested, im-
of the most popular programming languages [15]                  plemented via static call graph analysis [16].
and forms the subject of many recent research pa-           Assertion-McCabe ratio indicates tested deci-
pers surrounding test e↵ectiveness. We rely on                  sion points in the code; computed as the
JUnit [7] as the unit testing framework. JUnit is               total number of assertion statements in the
the most used unit testing framework for Java [44].             test code divided by the McCabe’s cyclomatic
                                                                complexity score [33] of the production code.
                                                            Assertion Density indicates the ability to de-
1.3   Contributions
                                                                tect defects; computed as the number of asser-
In an e↵ort to tackle the drawbacks of using mu-                tions divided by Lines Of Test Code (TLOC).
tation testing to measure test suite e↵ectiveness,          Directness indicates the ability to detect the lo-
our research makes the following contributions:                 cation a defect’s cause when a test fails. Sim-
1. In-depth analysis on the relation between test               ilar to code coverage, except that only meth-
e↵ectiveness, assertion count and coverage as mea-              ods directly called from a test are counted.
sured using static metrics for three large real-world       Maintainability based on an existing maintain-
projects. 2. A set of scenarios which influence the             ability model [21], adapted for test suites.
results of the static metrics and their sources of              The model consists of the following metrics
imprecision. 3. An tool to measure static cover-                for test code: Duplication, Unit Size, Unit
age and assertion count using only static metrics.              Complexity and Unit Dependency.




                                                        2
2.3     Mutation testing                                    was the most e↵ective, followed by Major and PIT.
                                                            The ordering in terms of application cost was dif-
Test e↵ectiveness is measured by the number of
                                                            ferent: PIT required the least test cases and gen-
mutants that were killed by a test suite. Recent
                                                            erated the smallest set of equivalent mutants.
research introduced a variety of e↵ectiveness mea-
                                                               Marki and Lindstrom performed similar re-
sures and mutants. We describe di↵erent types
                                                            search on the same mutation tools [32]. They used
of mutants, mutation tools, types of e↵ectiveness
                                                            three small Java programs popular in literature.
measures, and work on mutation analysis.
                                                            They found that none of the mutation tools sub-
                                                            sumed each other. muJava generated the strongest
2.3.1    Mutant types                                       mutants followed by Major and PIT, however, mu-
Not all mutants are equally easy to detect. Easy            Java generated significantly more equivalent mu-
or weak mutants are killed by many tests and thus           tants and was slower than Major and PIT.
often easy to detect. Hard to kill mutants can only            Laurent et al. introduced PIT+, an improved
be killed by very specific tests and often subsume          version of PIT with an extended set of muta-
other mutants. Below is an overview of the di↵er-           tors [31]. They combined the test suites generated
ent types of mutants in the literature:                     by Kintis et al. [27] into a mutation adequate test
                                                            suite that would detect the combined set of mu-
Mutant represents a small change to the pro-                tants generated by PIT, muJava and Major. A
   gram, i.e., a modified version of the SUT.               mutation adequate test suite was also generated
Equivalent mutants do not change the outcome                for PIT+. The set of mutants generated by PIT+
   of a program, i.e., they cannot be detected.             was equally strong as the combined set of mutants.
   Given a loop that breaks if i == 10, and i
   increments by 1. A mutant changing the con-              2.3.3   E↵ectiveness measures
   dition to i >= 10 remains undetected as the
   loop still breaks when i becomes 10.                     We found three types of e↵ectiveness measures:
Subsuming mutants are sole contributors to                  Normal e↵ectiveness calculated as the number
   the e↵ectiveness scores [36]. If mutants are                of killed mutants divided by the total number
   subsumed, they are often killed “collaterally”              of non-equivalents.
   together with the subsuming mutant. Killing              Normalised e↵ectiveness calculated as the
   these collateral mutants does not lead to more              number of killed mutants divided by the
   e↵ective tests, but they influence the test ef-             number of covered mutants, i.e., mutants
   fectiveness score calculation.                              located in code executed by the test suite.
                                                               Intuitively, test suites killing more mutants
2.3.2    Comparison of mutation tools                          while covering less code are more thorough
                                                               than test suites killing the same number of
Three criteria were used to compare mutation
                                                               mutants in a larger piece of source code [24].
tools for Java: 1. E↵ectiveness of the mutation
                                                            Subsuming e↵ectiveness is the percentage of
adequate test suite of each tool. A mutation ade-
                                                               killed subsuming mutants. Intuitively, strong
quate test suite kills all the mutants generated by
                                                               mutants, i.e., subsuming mutants, are not
a mutation tool. Each test of this test suite con-
                                                               equally distributed [36], which could lead to
tributes to the e↵ectiveness score, i.e., if one test
                                                               skewed e↵ectiveness results.
is removed, less than 100% e↵ectiveness score is
achieved. A cross-testing technique is applied to
                                                            2.3.4   Mutation analysis
evaluate the e↵ectiveness each tool’s mutation ad-
equate test suite. The adequate test suite of each          In this section, we describe research conducted on
tool is run on the set of mutants generated by the          mutation analysis that underpins our approach.
other tools. If the mutation adequate test suite for           Mutants and real faults. Just et al. in-
tool A would detect all the mutants of tool B, but          vestigated whether generated faults are a correct
the suite of tool B would not detect all the mu-            representation of real faults [26]. Statistically sig-
tants of tool A, then tool A would subsume tool             nificant evidence shows that mutant detection cor-
B. 2. Tool’s application cost in terms of the num-          relates with real fault detection. They could relate
ber of test cases that need to be generated and the         73% of the real faults to common mutators. Of the
number of equivalent mutants that would have to             remaining 27%, 10% can be detected by enhanc-
be inspected. 3. Execution time of each tool.               ing the set of commonly used mutators. They used
   Kintis et al. analysed and compared the e↵ec-            Major for generating mutations. Equivalent mu-
tiveness of PIT, muJava and Major [27]. Each tool           tants were ignored as mutation scores were only
was evaluated using the cross-testing technique on          compared for subsets of a project’s test suite.
twelve methods of six Java projects. They found                Code coverage and e↵ectiveness. Inozemt-
that the mutation adequate test suite of muJava             seva and Holmes analysed the correlation between




                                                        3
code coverage and test suite e↵ectiveness [24] on             is directly available. This TQM consists of the fol-
twelve studies. They found three main shortcom-               lowing static metrics: Code Coverage, Assertion-
ings: 1. Studies did not control the suite size. As           McCabe ratio, Assertion Density, Directness and
code coverage relates to the test suite size (more            Test Code Maintainability (see also Section 2.2).
coverage is achieved by adding more tests), it re-               Test code maintainability relates to code read-
mains unclear whether the correlation with e↵ec-              ability and understandability, indicating how eas-
tiveness was due to size or coverage of the test              ily we can make changes. We drop maintainability
suite. 2. Small or synthetic programs limit gen-              as a candidate metric as we consider it the least
eralisation to industry. 3. Comparing only test               related to completeness or e↵ectiveness of tests.
suites that fully satisfy a certain coverage criterion.          The model also contains two assertion- and two
They argue that these results can be generalised to           coverage based metrics. Based on preliminary re-
more realistic test suites. Eight studies showed a            sults we found that the number of assertions had
correlation between some coverage type and e↵ec-              a stronger correlation with test e↵ectiveness than
tiveness independently of size; the strength varied,          the two assertion based TQM metrics for all anal-
in some studies appearing only for high coverage.             ysed projects. Similarly, the static code coverage
   They also conducted an experiment on five large            performed better than directness in the correlation
open source Java projects. All mutants undetected             test with test e↵ectiveness. To get a more quali-
by the master test suite were marked equivalent.              tative analysis, we focus on one assertion based
To control for size, fixed size test suites are gener-        metric and one coverage based metric, respectively
ated by randomly selecting tests from the master              assertion count and static coverage.
test suite. Coverage was measured using Code-                    Furthermore, coverage was shown to be related
Cover [3] on statement, decision and modified con-            to test e↵ectiveness [24, 35]. Others found a rela-
dition levels. E↵ectiveness was measured using                tion between assertions and fault density [28] and
normal and normalised e↵ectiveness. They found                between assertions and test suite e↵ectiveness [45].
a low to moderate correlation between coverage
and normal e↵ectiveness when controlling for size.            3.2     Tool implementation
The coverage type had little impact on the cor-
                                                              In this section, we explain the foundation of the
relation strength and only a weak correlation was
                                                              tool and the details of the implemented metrics.
found for normalised e↵ectiveness.
   Assertions and e↵ectiveness. Zhang and
                                                              3.2.1    Tool architecture
Mesbah studied the relationship between asser-
tions and test suite e↵ectiveness [45]. Their exper-          Figure 1 presents the analysis steps. The rectan-
iment used five large open source Java projects,              gles are artefacts that form the in/output for the
similarly to Inozemtseva and Holmes [24]. They                two processing stages.
found a strong correlation between assertion count               The first processing step is performed by the
and test e↵ectiveness, even when test suite size              Software Analysis Toolkit (SAT) [29], it constructs
was controlled for. They also found that some as-             a call graph using only static source code analysis.
sertion types are more e↵ective than others, e.g.,            Our analysis tool uses the call graph to measure
boolean and object assertions are more e↵ective               both assertion count and static method coverage.
than string and numeric assertions.                              The SAT analyses source code and computes
                                                              several metrics, e.g., Lines of Code (LOC), Mc-
3     Metrics and mutants                                     Cabe complexity [33] and code duplication, which
                                                              are stored in a source graph. This graph contains
Our goal is to investigate to what extent static              information on the structure of the project, such
analysis based metrics are related to test suite ef-          as which packages contain which classes, which
fectiveness. First, we need to select a set of static         classes contain which methods and the call rela-
metrics. Secondly, we need a tool to measure these            tions between these methods. Each node is an-
metrics. Thirdly, we need a way to measure test               notated with information such as lines of code.
e↵ectiveness.                                                 This graph is designed such that it can be used for
                                                              many programming languages. By implementing
3.1   Metric selection                                        our metrics on top of the SAT, we can do mea-
                                                              surements for di↵erent programming languages.
We choose two static analysis-based metrics that
could predict test suite e↵ectiveness. We analyse
                                                              3.2.2    Code coverage
the state of the art TQM by Athanasiou et al. [18]
because it is already based on static source code             Alves and Visser designed an algorithm for mea-
analysis. Furthermore, the TQM was developed in               suring method coverage using static source code
collaboration with SIG, the host company of this              analysis [16]. The algorithm takes as input a call
thesis, which means that knowledge of the model               graph obtained by static source code analysis. The




                                                          4
         Figure 1: Analysis steps to statically measure coverage and assertion count.
calls from test to production code are counted by     found a significant di↵erence between the e↵ective-
slicing the source graph and counting the methods.    ness of assertions and the type of objects they as-
This includes indirect calls, e.g., from one produc-  sert [45]. Four assertion content types were clas-
tion method to another. Additionally, the con-        sified: numeric, string, object and boolean. They
structor of each called method’s class is included.   found that object and boolean assertions are more
They found a strong correlation between static and    e↵ective than string and numeric assertions. The
dynamic coverage. (The mean of the di↵erence be-      type of objects in an assertion can give insights in
tween static and dynamic coverage was 9%). We         the strength of the assertion. We will include the
use this algorithm with the call graph generated by   distribution of these content types in the analysis.
the SAT to calculate the static method coverage.          We use the SAT to analyse the type of objects in
    However, the static coverage algorithm has four   an assertion. The SAT is unable to detect the type
sources of imprecision [16]. The first is conditional of an operator expression used inside a method in-
logic, e.g., a switch statement that for each case    vocation, e.g., assertTrue(a >= b);, resulting in
invokes a di↵erent method. Second is dynamic dis-     unknown assertion content types. Also, fail state-
patch (virtual calls), e.g., a parent class with two  ments are put in a separate category as these are a
subclasses both overriding a method that is called    special type of assertion without any content type.
on the parent. Third, library/framework calls,
e.g., java.util.List.contains() invoke the .equals()  3.3 Mutation analysis
method of each object in the list. The source code    In this section we discuss our choice for the muta-
of third party libraries is not included in the anal- tion tool and test e↵ectiveness measure.
ysis making it impossible to trace which methods
are called from the framework. And fourth, the use    3.3.1 Mutation tool
of Java reflection, a technique to invoke methods
dynamically during runtime without knowledge of       We presented four candidate mutation tools for
these methods or classes during compile time.         our experiment in Section 2.3.2: Major, muJava,
    For the first two sources of imprecision, an op-  PIT and PIT+. MuJava has not been updated
timistic approach is chosen i.e., all possible paths  in the last two years and does not support JU-
are considered covered. Consequently, the cover-      nit 4 and Java versions above 1.6 [9]. Conforming
age is overestimated. Invocations by the latter two   to these requirements would decrease the set of
sources of imprecision remain undetected, leading     projects we could use in our experiment as both
to underestimating the coverage.                      JUnit 4 and Java 1.7 have been around for quite
                                                      some time. Major does support JUnit 4 and has
3.2.3 Assertions                                      recently been updated [8]. However, it only works
                                                      in Unix environments [32]. PIT targets indus-
We measure the number of assertions using the         try [27], is open source and actively developed [12].
same call graph as the static method coverage al-     Furthermore, it supports a wide scale of build tool-
gorithm. For each test, we follow the call graph      ing and is significantly faster than the other tools.
through the test code to include all direct and       PIT+ is based on a two-year-old branched version
indirect assertion calls. Indirect calls are impor-   of PIT and was only recently made available [10].
tant because often tests classes contain some util-   The documentation is very sparse, the source code
ity method for asserting the correctness of an ob-    is missing. However, PIT+ generates a stronger
ject. Additionally, we take into account the num-     set of mutants than the other three tools whereas
ber of times a method is invoked to approximate       PIT generates the weakest set of mutants.
the number of executed assertions. Only assertions        Based on these observations we decided that
that are part of JUnit are counted.                   PIT+ would be the best choice for measuring test
    Identifying tests. By counting assertions         e↵ectiveness. Unfortunately, PIT+ was not avail-
based on the number of invocations from tests, we     able at the start of our research. We first did
should also be able to identify these tests stati-    the analysis based on PIT and then later switched
cally. We use the SAT to identify all invocations     to PIT+. Because we first used PIT, we selected
to assertion methods and then slice the call graph    projects that used Maven as a build tool. PIT+
backwards following all call and virtual call edges.  is based on an old version, 1.1.5, not yet support-
All nodes within scope, that have no parameters       ing Maven. To enable using the features of PIT’s
and have no incoming edges, are marked as tests.      new version we merged the mutators provided by
    Assertion content types. Zhang and Mesbah         PIT+ into the regular version of PIT [11].




                                                    5
3.3.2   Dealing with equivalent mutants                     tive tests will only detect a small portion of the
                                                            mutants. As a result, a large percentage will be
Equivalent mutants are mutants that do not
                                                            marked as equivalent. This increases the chances
change the outcome of the program. Manually re-
                                                            of false positives which decrease the reliability of
moving equivalent mutants is time-consuming and
                                                            the normalised e↵ectiveness score.
generally undecidable [35]. A commonplace so-
                                                                Given a project of which only a portion of the
lution is to mark all the mutants that are not
                                                            code base is thoroughly tested. There is a high
killed by the project’s test suite as equivalent.
                                                            probability that the equivalent mutants are not
The resulting non-equivalent mutants are always
                                                            equally distributed among the code base. Code
detected by at least one test. The disadvantage
                                                            covered by poor tests is more likely to contain false
of this approach is that many mutants might be
                                                            positives than thoroughly tested code. The poor
falsely marked as equivalent. The number of false
                                                            tests scramble the results e.g., a test with no asser-
positives depends for example on the coverage of
                                                            tions can be incorrectly marked as very e↵ective.
the tests: if the mutated code is not covered by
any of the tests, it will never be detected and con-            Normalised e↵ectiveness is intended to compare
sequently be marked as equivalent. Another cause            the thoroughness of two test suites, i.e., penalise
of false positives could be the lack of assertions          the test suites that cover lots of code but only a
in tests, i.e., not checking the correctness of the         small number of mutants. We believe that it is less
program’s result. The percentage of equivalent              suitable as a replacement for normal e↵ectiveness
mutants expresses to some extent the test e↵ec-                 We consider normal e↵ectiveness scores more
tiveness of the project’s test suite.                       reliable when studying the relation with our met-
   With this approach, the complete test suite              rics. Normal e↵ectiveness is positively influenced
of each project will always kill all the remaining          by the breadth of a test and penalises small test
non-equivalent mutants. As the number of non-               suites as a score of 1.0 can only be achieved if all
equivalent mutants heavily relies on the quality of         mutants are found. However, this is less of a prob-
a project’s test suite, we cannot use these e↵ective-       lem when comparing test suites of equal sizes.
ness scores to compare between di↵erent projects.               Subsuming e↵ectiveness.             Current algo-
To compensate for that, we will compare sub test            rithms for identifying subsuming mutants are in-
suites within the same project.                             fluenced by the overlap between tests. Suppose
                                                            there are five mutants, M u1..5 , for method M1 .
                                                            There are 5 tests, T1..5 , that kill M u1..4 and one
3.3.3   Test e↵ectiveness measure
                                                            test, T6 , that kills all five mutants.
Next, we evaluate both normalised and subsuming                 Amman et al. defined subsuming mutants as
e↵ectiveness in the subsections below and describe          follows: “one mutant subsumes a second mutant if
our choice for an e↵ectiveness measure.                     every test that kills the first mutant is guaranteed
   Normalised e↵ectiveness. Normalised e↵ec-                also to kill the second [17].” According to this
tiveness is calculated by dividing the killed mu-           definition, M u5 subsumes M u1..4 because the set
tants with the number of non-equivalent mutants             of tests that kill M u5 is a subset of the tests that
that are present in the code executed by the test.          kill M u1..4 : {T6 } ⇢ {T1..5 }. The tests T1..5 will
   Given the following example in which there are           have a subsuming e↵ectiveness score of 0.
two Tests T1 and T2 for Method M1 . Suppose M1                  Our goal is to identify properties of test suites
is only covered by T1 and T2 . In total, there are          that determine their e↵ectiveness. If we would
five mutants M u1..5 generated for M1 . T1 detects          measure the subsuming e↵ectiveness, T1..5 would
M u1 and T2 detects M u2 . As T1 and T2 are the             be significantly less e↵ective. This would sug-
only tests to kill M1 , the mutants M u3..5 remain          gest that the assertion count or coverage of these
undetected and are marked as equivalent. Both               tests did not contribute to the e↵ectiveness, even
tests only cover M1 and detect 1 of the two mu-             though they still detected 80% of all mutants.
tants resulting in a normal e↵ectiveness score of               Another vulnerability of this approach is that
0.5. A test suite consisting of only the above tests        it is vulnerable to changes in the test set. If we re-
would detect all mutants in the covered code, re-           move T6 , the mutants previously marked as “sub-
sulting in a normalised e↵ectiveness score of 1.            sumed” are now subsuming because M u5 is no
   We notice that the normalised e↵ectiveness               longer detected. Consequently, T1..5 now detect
score heavily relies on how mutants are marked              all the subsuming mutants. In this scenario, we
as equivalent. Suppose the mutants marked as                decreased the quality of the master test suite by
equivalent were valid mutants but the tests failed          removing a single test, which leads to a signifi-
to detect them (false positive), e.g., due to miss-         cant increase in the subsuming e↵ectiveness score
ing assertions. In this scenario, the (normalised)          of tests, T1..5 . This can lead to strange results over
e↵ectiveness score suggests that a bad test suite is        time, as the addition of tests can lead to drops in
actually very e↵ective. Projects that have ine↵ec-          the e↵ectiveness of others.




                                                        6
    Choice of e↵ectiveness measure. Nor-                     coverage as input for our analysis to: a) inspect
malised e↵ectiveness loses precision when large              the accuracy of the static methods coverage al-
amounts of mutants are incorrectly marked as                 gorithm and b) to verify if a correlation between
equivalent. Furthermore, normalised e↵ectiveness             method coverage and test suite e↵ectiveness exists.
is intended as a measurement for the thoroughness
of a test suite which is di↵erent from our definition        4.2     Case study setup
of e↵ectiveness. Subsuming e↵ectiveness scores
change when tests are added or removed which                 We study our selected projects using an experi-
makes the measure very sensitive to change. Fur-             ment design based on work by Inozemtseva and
thermore, subsuming e↵ectiveness penalises tests             Holmes [24]. They surveyed similar studies on
that do not kill a subsuming mutant.                         the relation between test e↵ectiveness and cover-
    We choose to apply normal e↵ectiveness as this           age and found that most studies implemented the
measure is more reliable. It also allows for com-            following procedure: 1. Create faulty versions of
paring with similar research on e↵ectiveness and             one or more programs. 2. Create or generate many
assertions/coverage [24, 45]. We refer to test suite         test suites. 3. Measure the metric scores of each
e↵ectiveness also as normal e↵ectiveness.                    suite. 4. Determine the e↵ectiveness of each suite.
                                                             We describe our approach for each step in the fol-
                                                             lowing subsections.
4     Are static metrics related to test
      suite e↵ectiveness?                                    4.2.1    Generating faults
Mutation tooling is resource expensive and re-
                                                             We employ mutation testing as a technique for
quires running the test suites i.e., dynamic analy-
                                                             generating faulty versions, mutants, of the di↵er-
sis. To address these problems, we investigate to
                                                             ent projects that will be analysed. We employ PIT
what extent static metrics are related to test suite
                                                             as a mutation tool. Mutants are generated using
e↵ectiveness. In this section, we describe how we
                                                             the default set of mutators 1 . All mutants that are
will measure whether static metrics are a good pre-
                                                             not detected by the master test suite are removed.
dictor for test suite e↵ectiveness.

4.1     Measuring the relationship between                   4.2.2    Project selection
        static metrics and test e↵ectiveness                 We have chosen three projects for our analysis
We consider two static metrics, assertion count              based on the following set of requirements: The
and static method coverage, as candidates for pre-           projects had in the order of hundreds of thousands
dicting test suite e↵ectiveness.                             LOC and thousands of tests.
                                                                 Based on these criteria we selected a set of
4.1.1    Assertion count                                     projects: Checkstyle[1], JFreeChart[5] and Joda-
                                                             Time [6]. Table 1 shows properties of the projects.
We hypothesise that assertion count is related to            Java LOC and TLOC are generated using David
test e↵ectiveness. Therefore, we first measure as-           A. Wheeler’s SLOCCount [14].
sertion count by following the call graph from all               Checkstyle is a static analysis tool that checks
tests. As our context is static source code analysis,        if Java code and Javadoc comply with some
we should be able to identify the tests statically.          coding rules, implemented in checker classes.
Thus, we next compare the following approaches:              Java and Javadoc grammars are used to gen-
Static approach we use static call graph slicing             erate Abstract Syntax Trees (ASTs).              The
    (Section 3.2.3) to identify all tests of a project       checker classes visit the AST, generating mes-
    and measure the total assertion count for the            sages if violations occur. The core logic is in
    identified tests.                                        the com.puppycrawl.tools.checkstyle.checks
Semi-dynamic approach we use Java reflection                 package, representing 71% of the project’s size.
    (Section 4.3) to identify all the tests and mea-         Checkstyle is the only project that used contin-
    sure the total assertion count for these tests.          uous integration and quality reports on GitHub
Finally, we inspect the type of the asserted ob-             to enforce quality, e.g., the build that is triggered
ject as input for the analysis of the relationship           by a commit would break if coverage or e↵ective-
between assertion count and test e↵ectiveness.               ness would drop below a certain threshold. We
                                                             decided to use the build tooling’s class exclusion
4.1.2    Static method coverage                              filters to get more representative results. These
                                                             quality measures are needed as there are several
We hypothesise that static method coverage is re-            developers that contributed to the project. The
lated to test e↵ectiveness. To test this hypothesis,         project currently has five active team members [2].
we measure the static method coverage using static
call graph slicing. We include dynamic method                  1 http://pitest.org/quickstart/mutators/




                                                         7
   JFreeChart is a chart library for Java. The               4.2.4    Measuring metric scores and e↵ec-
project is split into two parts: the logic used for                   tiveness
data and data processing, and the code focussed
                                                             For each test suite, we measure the e↵ectiveness,
on construction and drawing of plots. Most no-
                                                             assertion count and static method coverage. The
table are the classes for the di↵erent plots in the
                                                             dynamic equivalents of both coverage metrics are
org.jfree.chart.plot package, which contains
                                                             included to evaluate their comparison. We obtain
20% of the production code. JFreeChart is build
                                                             the dynamic coverage metrics using JaCoCo [4].
and maintained by one developer [5].
   JodaTime is a very popular date and time li-              4.2.5    Statistical analysis
brary. It provides functionality for calculations
with dates and times in terms of periods, durations          To determine how we will calculate the correla-
or intervals while supporting many di↵erent date             tion with e↵ectiveness we analyse related work on
formats, calendar systems and time zones. The                the relation between test e↵ectiveness and asser-
structure of the project is relatively flat, with only       tion count [45] and coverage [24]. Both works have
five di↵erent packages that are all at the root level.       similar experiment set-ups in which they generated
Most of the logic is related to either formatting            sub test suites of fixed sizes and calculated metric
dates or date calculation. Around 25% of the code            and e↵ectiveness scores for these suites. Further-
is related to date formatting and parsing. Joda-             more, both studies used a parametric and non-
Time was created by two developers, only of them             parametric correlation test, respectively Pearson
is maintaining the project [6].                              and Kendall. We will also consider the Spearman
                                                             rank correlation test, another nonparametric test,
                                                             as it is commonly used in literature. A parametric
4.2.3   Composing test suites                                test assumes the underlying data to be normally
It has been shown that test suite size influences the        distributed whereas nonparametric tests do not.
relation with test e↵ectiveness [35]. When a test               The Pearson correlation coefficient is based on
is added to a test suite it can never decrease the           the covariance of two variables, i.e., the metric
e↵ectiveness, assertion count or coverage. There-            and e↵ectiveness scores, divided by the product of
fore, we will only compare tests suites of equal sizes       their standard deviations. Assumptions for Pear-
similar to previous work [24, 45, 35].                       son include the absence of outliers, the normality
   We compose test suites of relative sizes, i.e.,           of variables and linearity. The Kendall’s Tau rank
test suites that contain a certain percentage of all         correlation coefficient is a rank based test used to
tests in the master test suite. For each size, we            measure the extent to which rankings of two vari-
generate 1000 test suites. We selected the follow-           ables are similar. Spearman is a rank based ver-
ing range of relative suite sizes: 1%, 4%, 9%, 16%,          sion of the Pearson correlation tests, commonly
25%, 36%, 49%, 64% and 81%. Larger test suite                used as its computation is more lightweight than
were not included because the di↵erences between             Kendall’s. However, our data set leads to similar
the generated test suites would become too small.            computation time for Spearman and Kendall.
Additionally, we found that this sequence had the               We discard Pearson because we cannot make
least overlap in e↵ectiveness scores for the di↵er-          assumptions on our data distribution. Moreover,
ent suite sizes while still including a wide spread          Kendall “is a better estimate of the correspond-
of the test e↵ectiveness across di↵erent test suites.        ing population parameter and its standard error is
                                                             known [23]”. As the advantages of Spearman over
   Our approach di↵ers from existing research [24]
                                                             Kendall do not apply in our case and Kendall has
in which they used suites of sizes: 3, 10, 30, 100,
                                                             advantages over Spearman, we choose Kendall’s
300, 1000 and 3000. A disadvantage of this ap-
                                                             Tau rank correlation test. The correlation coeffi-
proach is that the number of test suites for Jo-
                                                             cient is calculated with R’s “Kendall” package [13].
daTime is larger than for the others because Jo-
                                                             We use the Guilford scale (Table 2) for verbal de-
daTime is the only project that has more than
                                                             scriptions of the correlation strength [35].
3000 tests. Another disadvantage is that a test
suite with 300 tests might be 50% of the master
                                                             4.3     Evaluation tool
test suite for one project and only 10% of another
project’s test suite. Additionally, most composed            We compose 1000 test suites of nine di↵erent sizes
tests suites in this approach represent only a small         for each project. Running PIT+ on the master
portion of the master test suite. With our ap-               test suite took from 0.5 to 2 hours depending on
proach, we can more precisely study the behaviour            the project. As we have to calculate the e↵ec-
of the metrics as the suites grow in size. Further-          tiveness of 27,000 test suites, this approach would
more, we found that test suites with 16% of all              take too much time. Our solution is to measure
tests already dynamically covered 50% to 70% of              the test e↵ectiveness of each test only once. We
the methods covered by the master test suite.                then combine the results for di↵erent sets of tests




                                                         8
Table 1: Characteristics of the selected projects. Total Java LOC is the sum of the pro-
duction LOC and TLOC
          Property                          Checkstyle             JFreeChart           JodaTime
          Total Java LOC                          73,244                  134,982             84,035
          Production LOC                          32,041                   95,107             28,724
          TLOC                                    41,203                   39,875             55,311
          Number of tests                           1875                    2,138              4,197
          Method Coverage                           98%                      62%                90%
          Date cloned from GitHub               4/30/17                  4/25/17            3/23/17
          Citations in literature                [43, 39]     [45, 24, 31, 26, 16]   [24, 31, 26, 39]
          Number of generated mutants                95,185              310,735            100,893
          Number of killed mutants                   80,380               80,505             69,615
          Number of equivalent mutants               14,805              230,230             31,278
          Equivalent mutants (%)                     15.6%                74.1%              31.0%

        Table 2: Guilford scale for the verbal description of correlation coefficients.
             Correlation coefficient below 0.4 0.4 to 0.7 0.7 to 0.9 above 0.9
              Verbal description               low         moderate       high       very high




 Figure 2: Overview of the experiment set-up to obtain the relevant metrics for each test.
to simulate test suites. To get the scores for a      or @after annotations. However, the SAT does
test suite with n tests, we combine the coverage      not provide information on the used annotations.
results, assertion counts and killed mutants of its   A common practice is to still name these methods
tests. Similarly, we calculate the static metrics and setUp or tearDown. We include methods that are
dynamic coverage only once for each test.             named setUp or tearDown and are located in the
   Detecting individual tests. We use a reflec-       same class as the tests in the coverage results.
tion library to detect both JUnit 3 and 4 tests for      Aggregating metrics. To aggregate e↵ective-
each project according to the following definitions:  ness, we need to know which mutants are detected
JUnit 3 All methods in non-abstract subclasses        by each test as the set of detected mutants could
     of JUnit’s TestCase class. Each method           overlap. However, PIT does not provide a list of
     should have a name starting with “test”, be      killed mutants. We solved this issue by creating
     public, void and have no parameters.             a custom reporter using PIT’s plug-in system to
JUnit 4 All public methods annotated with JU-         export the list of killed mutants.
     nit’s @Test annotation.                             The coverage of two tests can also overlap.
We verified the number of detected tests with         Thus,  we need information on the methods covered
the number of executed tests reported by each         by each test. JaCoCo exports this information in
project’s build tool.                                 a jacoco.exec report file, a binary file containing
   We also need to include the set-up and tear-       all the information required for aggregation. We
down logic of each test. We use JUnit’s test run-     aggregate these files via JaCoCo’s API. For the
ner API to execute individual tests. This API en-     static coverage metric, we export the list of cov-
sures execution of the corresponding set-up and       ered methods in our analysis tool.
tear-down logic. This extra test logic should also       The assertion count of a test suite is simply cal-
be included in the static coverage metric to get      culated as the sum of each test’s assertion count.
similar results. With JUnit 3 the extra logic            Figure 2 provides an overview of the involved
is defined by overriding TestCase.setUp() or          tools used and the data they generate. The eval-
TestCase.tearDown(). JUnit 4 uses the @before         uation tool’s input is raw test data and the sizes




                                                       9
of the test suites to create. We then compose test            count. These coefficients for each set of test suites
suites by randomly selecting a given number of                of a given project and relative size are shown in
tests from the master test suite. The output of the           Table 4. We highlight statistically significant cor-
analysis tool is a data set containing the scores on          relations that have a p-value < 0.005 with two
the dynamic and static metrics for each test suite.           asterisks (**), and results with a p-value < 0.01
                                                              with a single asterisk (*).
5     Results                                                    We observe a statistically significant, low to
                                                              moderate correlation for nearly all groups of test
We first present the results of our analysis on the           suites for JFreeChart. For JodaTime and Check-
assertion count metric, followed by the results of            style, we notice significant but weaker correlations:
our analysis on code coverage.                                0.08-0.2 compared to JFreeChart’s 0.14-0.4.
   Table 3 provides an overview of the assertion                 Table 5 shows the results of the two test identi-
count, static and dynamic method coverage, and                fication approaches for the assertion count metric
the percentage of mutants that were marked as                 (see Section 4.1.1). False positives are tests that
equivalent for the master test suite of each project.         were incorrectly marked as tests. False negatives
                                                              are tests that were not detected.
5.1   Assertion count                                            Figure 5 shows the distribution of asserted ob-
Figure 3 shows the distribution of the number of              ject types. Assertions for which we could not de-
assertions for each test of each project.                     tect the content type are categorised as unknown.
   We notice some tests with exceptionally high
assertion counts. We manually checked these tests             5.2     Code coverage
and found that the assertion count was correct for            Figure 6 shows the relation between static method
the outliers. We briefly explain a few outliers:              coverage and normal e↵ectiveness. A dot repre-
 TestLocalDateTime Properties.testPropertyRoun                sents a test suite and its colour, the relative test
    dHour (140 asserts), checks the correctness               suite size. Table 6 shows the Kendall correlation
    of rounding 20 times, with for each check 7               coefficients between static coverage and normal ef-
    assertions on year, month, week, etc.                     fectiveness for each set of test suites. We highlight
 TestPeriodFormat.test wordBased pl regEx (140                statistically significant correlations that have a p-
    asserts) calls and asserts the results of the pol-        value < 0.005 with two asterisks (**), and results
    ish regex parser 140 times.                               with a p-value < 0.01 with a single asterisk (*).
 TestGJChronology.testDurationFields (57 as-
    serts), tests for each duration field whether             5.2.1    Static vs. dynamic method coverage
    the field names are correct and if some flags             To evaluate the quality of the static method cov-
    are set correctly.                                        erage algorithm, we compare static coverage with
 CategoryPlotTest.testEquals (114 asserts), in-               its dynamic counterpart for each suite (Figure 7).
    crementally tests all variations of the equals            A dot represents a test suite, colours represent the
    method of a plot object. The other tests with             size of a suite relative to the total number of tests.
    more than 37 assertions are similar tests for             The black diagonal line illustrates the ideal line:
    the equals methods of other types of plots.               all test suites below this line overestimate the cov-
    Figure 4 shows the relation between the asser-            erage and all the test suites above underestimate
tion count and normal e↵ectiveness. Each dot rep-             the coverage. Table 7 shows the Kendall correla-
resents a generated test suite; and its colour of             tions between static and dynamic method coverage
the dot represents the size of the suite relative             for the di↵erent projects and suite sizes. Each cor-
to the total number of tests. The normal e↵ec-                relation coefficient maps to a set of test suites of
tiveness, i.e., the percentage of mutants killed by           the corresponding suite size and project. Coeffi-
a given test suite is shown on the y-axis. The                cients with one asterisk (*) have a p-value < 0.01
normalised assertion count is shown on the x-axis.            and coefficients with two asterisks (**) have a p-
We normalised the assertion count as the percent-             value < 0.005. We observe a statistically signif-
age of the total number of assertions for a given             icant, low to moderate correlation for all sets of
project. For example, as Checkstyle has 3819 as-              test suites for JFreeChart and JodaTime.
sertions (see Table 3), a test suite with 100 asser-
tions would have a normalised assertion count of              5.2.2    Dynamic coverage and test suite ef-
 100                                                                   fectiveness
       ⇤ 100 ⇡ 2.6%.
3819
    We observe that test suites of the same rela-             Figure 8 shows the relation between dynamic
tive suite are clustered. For each group of test              method coverage and normal e↵ectiveness. Each
suites, we calculated the Kendall correlation coef-           dot represents a test suite; its colour represents
ficient between normal e↵ectiveness and assertion             the size of that suite relative to the total number




                                                         10
                                                       Table 3: Results for the master test suite of each project.
   Project                                         Assertions                                                                              Static coverage                                                                           Dynamic coverage                                   Equivalent mutants
   Checkstyle                                                                           3,819                                                                                                               85%                                                                98%                                15.6%
   JFreeChart                                                                           9,030                                                                                                               60%                                                                62%                                74.1%
   JodaTime                                                                            23,830                                                                                                               85%                                                                90%                                31.0%


   JodaTime                                                            ●   ●   ●   ●   ●   ●   ●   ●   ●   ●       ●   ●   ●   ●   ●   ●   ●   ●   ●   ●        ●   ●   ●       ●   ●   ●   ●   ●       ●   ●   ●   ●   ●        ●        ●   ●         ●   ●   ●    ●         ●                     ●                         ●




  JFreeChart                                               ●   ●   ●   ●   ●   ●   ●   ●   ●   ●   ●   ●   ●   ●   ●       ●   ●   ●   ●   ●               ●    ●           ●                       ●   ●           ●                         ●    ●                               ●        ●        ●




  Checkstyle   ●   ●   ●   ●   ●   ●   ●   ●   ●   ●   ●           ●                                                       ●                                                                ●           ●




               0               5               10                  15                  20                  25                  30              35              40               45          50              55              60       65       70       75       80   85    90      95 100 105 110 115 120 125 130 135 140
                                                                                                                                                                                                                    Number of assertions
     Figure 3: Distribution of the assertion count among individual tests per project.




               Figure 4: Relation between assertion count and test suite e↵ectiveness.

    Table 4: Kendall correlations between assertion count and test suite e↵ectiveness.
         Project                                                                                                                                                                                            Relative test suite size
                                                                       1%                                              4%                                       9%                                          16%     25%     36%      49%                                                         64%        81%

         Checkstyle                                                    -0.04                                           0.08**                                   0.13**                                      0.18**                        0.20**                 0.16**            0.16**        0.12**     0.10**
         JFreeChart                                                    0.03                                            0.14**                                   0.23**                                      0.32**                        0.34**                 0.35**            0.39**        0.40**     0.36**
         JodaTime                                                      0.05                                            0.11**                                   0.13**                                      0.13**                        0.07**                 0.09**            0.07**        0.10**     0.06*

Table 5: Comparison of di↵erent approaches to identify tests for the assertion count metric.
 Project                               Semi-static approach                                                                                                                                                                                                Static approach
                                       Number Assertion                                                                                                                                     Number of                                                  Assertion       False                                 False
                                       of tests   count                                                                                                                                     tests (di↵ )                                               count (di↵ )    positives                             negatives
 CheckStyle                            1,875                                                                           3,819                                                                1,821 (-54)                                                3,826 (+0.18%)                       5                59
 JFreeChart                            2,138                                                                           9,030                                                                2,172 (+34)                                                9,224 (+2.15%)                       39               7
 JodaTime                              4,197                                                                           23,830                                                               4,180 (-17)                                                23,943 (+0.47%)                      15               32

                                                                                                                                                                                                                                                                                                               Assertion
                                                                                                                                                                                                                                                                                                               content type
   JodaTime            1%                                  18%                                                                                                                              47%                                                                      13%               14%         7%
                                                                                                                                                                                                                                                                                                                     fail
                                                                                                                                                                                                                                                                                                                     boolean
  JFreeChart           2%                          12%                                                                                 24%                                          2%                                                                          58%                                  1%              string
                                                                                                                                                                                                                                                                                                                     numeric
  Checkstyle               5%                              7%                              7%                                                                               36%                                                                                          39%                        5%               object
                                                                                                                                                                                                                                                                                                                     unknown
                       0%                                                                                                      25%                                                                                      50%                                              75%                         100%
                                                                                                                                                               Percentage of total assertion count
      Figure 5: The distribution of assertion content types for the analysed projects.




                                                                                                                                                                                                                             11
           Figure 6: Relation between static coverage and test suite e↵ectiveness.

Table 6: Kendall correlations between static method coverage and test suite e↵ectiveness.
        Project                                   Relative test suite size
                      1%        4%       9%       16%     25%      36%           49%       64%      81%

        Checkstyle    -0.05     -0.01    -0.02    -0.02        0.00     -0.04    -0.01     0.00     0.01
        JFreeChart    0.49**    0.28**   0.23**   0.26**       0.27**   0.28**   0.31**    0.31**   0.26**
        JodaTime      0.13**    0.28**   0.32**   0.28**       0.24**   0.25**   0.23**    0.20**   0.21**




Figure 7: Relation between static and dynamic method coverage. Static coverage of test
suites below the black line is overestimated, above is underestimated.

         Table 7: Kendall correlation between static and dynamic method coverage.
        Project                                   Relative test suite size
                      1%        4%       9%       16%     25%      36%           49%       64%      81%

        Checkstyle    -0,03     -0,01    0,01     -0,02        0,00     0,00     0,05      0,10**   0,15**
        JFreeChart    0,67**    0,33**   0,28**   0,31**       0,33**   0,35**   0,43**    0,45**   0,44**
        JodaTime      0,35**    0,44**   0,48**   0,47**       0,51**   0,51**   0,52**    0,54**   0,59**
of tests. Table 8 shows the Kendall correlations              cality of this research and the threats to validity.
between dynamic method coverage and normal ef-
fectiveness for the di↵erent groups of test suites for        6.1   Assertions and test suite e↵ectiveness
each project. Similarly to the other tables, two as-
terisks indicate that the correlation is statistically        We observe that test suites of the same relative
significant with a p-value < 0.005.                           size form groups in the plots in Figure 4, i.e., the
                                                              assertion count and e↵ectiveness score of same size
                                                              test suites are relatively close to each other.
6    Discussion
                                                                 For JFreeChart, groups of test suites with a rel-
We structure our discussion as follows: First, for            ative size >=9% exhibit a diagonal shape. This
each metric, we compare the results across all                shape is ideal as it suggests that test suites with
projects, perform an in-depth analysis on some of             more assertions are more e↵ective. These groups
the projects and then answer to the corresponding             also show the strongest correlation between asser-
research question. Next, we describe the practi-              tion count and e↵ectiveness (Table 4).




                                                         12
     Figure 8: Relation between dynamic method coverage and test suite e↵ectiveness.

Table 8: Kendall correlation between dynamic method coverage and test suite e↵ectiveness.
         Project                                      Relative test suite size
                        1%        4%        9%        16%     25%      36%            49%      64%       81%

         Checkstyle     0.67**    0.71**    0.68**    0.59**       0.45**    0.36**   0.33**   0.31**    0.36**
         JFreeChart     0.65**    0.59**    0.52**    0.48**       0.44**    0.47**   0.47**   0.49**    0.45**
         JodaTime       0.48**    0.49**    0.53**    0.51**       0.48**    0.52**   0.48**   0.47**    0.44**
   We notice that the normalised assertion count                  tect only two assertions for tests which might ex-
of a test suite is close to the relative suite size, e.g.,        ecute many assertions at runtime. In addition to
suites with a relative size of 81% have a normalised              the verify method, we found 60 tests that directly
assertion count between 77% and 85%. The di↵er-                   applied assertions inside for loops.
ence between the relative suite size and normalised
assertion count is directly related to the variety in                Finding 1: Assertions within in an iter-
assertion count per test. More variety means that                    ation block skew the estimated assertion
a test suite could exist with only below average                     count. These iterations are a source of im-
assertion counts, resulting in a ¡80% normalised                     precision because the actual number of as-
assertion count.                                                     sertions could be much higher than the as-
   We analyse each project to find to what extent                    sertion count we measured.
assertion count could predict test e↵ectiveness.
                                                                     Another consequence of the high usage of
6.1.1    Checkstyle                                               verify is that these 1156 tests all have the same
We notice a very low, statistically significant corre-            assertion count. Figure 3 shows similar results for
lation between assertion count and test suite e↵ec-               the distribution of assertions for Checkstyle’s tests.
tiveness for most of Checkstyle’s test suite groups.                 The e↵ectiveness scores for these 1156 tests
   Most of the Checkstyle’s tests target the dif-                 range from 0% to 11% (the highest e↵ectiveness
ferent checks in Checkstyle. Out of the 1875                      score of an individual test). This range shows that
tests, 1503 (80%) tests belong to a class that                    the group of tests with two assertions include both
extends the BaseCheckTestSupport class. The                       the most and least e↵ective tests. There are ap-
BaseCheckTestSupport class contains a set of                      proximately 1200 tests for which we detect exactly
utility methods for creating a checker, executing                 two assertions. As this concerns 64% of all tests,
the checker and verifying the messages generated                  we state there is too little variety in the assertion
by the checker. We notice a large variety in test                 count to make predictions on the e↵ectiveness.
suite e↵ectiveness among the tests that extend this
class. Similarly, we expect the same variety in as-                  Finding 2: 64% of Checkstyle’s tests have
sertion counts. However, the assertion count is the                  identical assertion counts. Variety in the
same for at least 75% of these tests.                                assertion count is needed to distinguish be-
   We found that 1156 of these tests (62% of                         tween the e↵ectiveness of di↵erent tests.
the master test suite) use the BaseCheckTestSup-
port.verify method for asserting the checker’s re-
sults. The verify method iterates over the ex-                    6.1.2     JFreeChart
pected violation messages which are passed as a
parameter. This iteration hides the actual num-                   JFreeChart is the only project exhibiting a low to
ber of executed assertions. Consequently, we de-                  moderate correlation for most groups of test suites.




                                                             13
   We found many strong assertions in                          In total, we found only 26 tests of the master
JFreeChart’s tests. By strong, we mean that                 test suite (0.6%) that were directly a↵ected by as-
two large objects, e.g., plots, are compared in an          sertions in for loops. Thus, for JodaTime, asser-
assertion. This assertion uses the object’s equals          tions in for loops do not explain the weak correla-
implementation. In this equals method, around               tion between assertion count and e↵ectiveness.
50 lines long, many fields of the plot, such as                Assertion strength.         JodaTime has sig-
Paint or RectangleInsets are compared, again                nificantly more assertions than JFreeChart and
relying on their consecutive equals implemen-               Checkstyle. We observe many assertions on nu-
tation. We also notice that most outliers for               meric values as one might expect from a library
JFreeChart in Figure 3 are tests for the equals             that is mostly about calculations on dates and
methods which suggests that the equals methods              times. For example, we noticed many utility meth-
contain much logic.                                         ods that checked the properties of Date, DateTime
                                                            or Duration objects. Each of these utility meth-
   Finding 3: Not all assertions are equally                ods asserts the number of years, months, weeks,
   strong. Some only cover a single property,               days, hours, etc. This large number of numeric as-
   e.g., a string or a number, whereas others               sertion corresponds with the observation that 47%
   compare two objects, potentially covering                of the assertions are on numeric types (Figure 5).
   many properties. For JFreeChart, we no-                     However, the above is not always the case. For
   tice a large number of assertions that com-              example, we found many tests, related to parsing
   pare plot objects with many properties.                  dates or times from a string or tests for formatters,
                                                            that only had a 1 or 2 assertions while still being
   Next, we searched for the combination of loops           in the top half of most e↵ective tests.
and assertions that could skew the results, and                We distinguish between two types of tests: a)
found no such occurrences in the tests.                     tests related to the arithmetic aspect with many
                                                            assertions and b) tests related to formatting with
                                                            only a few assertions. We find that assertion count
6.1.3   JodaTime
                                                            does not work well as a predictor for test suite
The correlations between assertion count and test           e↵ectiveness since the assertion count of a test does
suite e↵ectiveness for JodaTime are similar to              not directly relate to how e↵ective the test is.
that of Checkstyle, and much lower than those of
JFreeChart. We further analyse JodaTime to find                Finding 4: Almost half of JodaTime’s as-
a possible explanation for the weak correlation.               sertions are on numeric types. These as-
   Assertions in for loops. We searched for test               sertions often occur in groups of 3 or more
utility methods similar to the verify method of                to assert a single result. However, a large
Checkstyle, i.e., a method that has assertions in-             number of e↵ective tests only contains a
side an iteration and is used by several tests. We             small number of mostly non-numeric asser-
observe that the four most e↵ective tests, shown in            tions. This mix leads to poor predictions.
Table 9, all call testForwardTransitions and/or
testReverseTransitions, both are utility meth-
ods of the TestBuilder class. The rank columns              6.1.4   Test identification
contain the rank relative to the other tests of to
provide some context in how they compare. Ranks             We measure the assertion count by following the
are calculated based on the descending order of             static call graph for each test. As our context is
e↵ectiveness or assertion count. If multiple tests          static source code analysis, we also need to be able
have the same score, we show the average rank.              to identify the individual tests in the test code.
Note that the utility methods are di↵erent from             We compare our static approach with a semi-static
the tests in the top 4 that share the same name.            approach that uses Java reflection to identify tests.
The top 4 tests are the only tests calling these               Table 5 shows that the assertion count ob-
utility methods. Both methods iterate over a                tained with the static-approach is closer to the dy-
two-dimensional array containing a set of approx-           namic approach than the assertion count obtained
imately 110 date time transitions. For each tran-           through the semi-static approach.
sition, 4 to 7 assertions are executed, resulting in           For all projects the assertion count of the static
more than 440 executed assertions.                          approach is higher. If the static algorithm does
   Additionally, we found 22 tests that combined            not identify tests, there are no call edges between
iterations and assertions. Out of these 22 tests,           the tests and the assertions. The absence of edges
at least 12 tests contained fix length iterations,          implies that these tests either have no assertions
e.g., for(int i = 0; i < 10; i++), that could               or an edge in the call graph was missing. These
be evaluated using other forms of static analysis.          tests do not contribute to the assertion count.




                                                       14
                          Table 9: JodaTime’s four most e↵ective tests
        Test                                               Normal E↵ectiveness           Assertions
                                                            Score         Rank          Score   Rank
        TestCompiler.testCompile()                         17.23%                  1       13     361.5
        TestBuilder.testSerialization()                    14.61%                  2       13     361.5
        TestBuilder.testForwardTransitions()               12.94%                  3        7   1,063.5
        TestBuilder.testReverseTransitions()               12.93%                  4        4   1,773.0
   We notice that the methods that were incor-             corporate information about the strength of the as-
rectly marked as tests, false positives, are meth-         sertions, either by incorporating assertion content
ods used for debugging purposes or methods that            types, assertion coverage [45] or size of the asserted
were missing the @Test annotation. The latter              object. Furthermore, such a model should also in-
is most noticeable for JFreeChart. We identified           clude information about the size of a project.
39 tests that were missing the @Test annotation.              If assertion count would be used, we should
Of these 39 tests, 38 tests correctly executed when        measure the presence of its sources of impreci-
the @Test annotation was added. According to the           sion to judge the reliability. This measurement
repository’s owner, these tests are valid tests 2 .        should also include the intensity of the usage of
   Based on the results of these three projects, we        errornous methods. For example, we found hun-
also show that the use of call graph slicing gives         dreds of methods and tests with assertions in for-
accurate results on a project level.                       loops. However, only few methods that were often
                                                           used had a significant impact on the results.
6.1.5   Assertion count as a predictor for
                                                           6.2     Coverage and e↵ectiveness
        test e↵ectiveness
                                                           We observe a diagonal-like shape for most groups
We found that the correlation for Checkstyle and           of same size test suites in Figure 6. This shape
JodaTime is weaker than for JFreeChart. Our                is ideal as it suggests that within this group, test
analysis indicates that the correlation for Check-         suites with more static coverage are more e↵ective.
style is less strong because of a combination of           These groups also show the strongest correlation
assertions in for loops (Finding 1) and the asser-         between static coverage and test suite e↵ective-
tion distribution (Finding 2). However, this does          ness, as shown in Table 6.
not explain the weak correlation for JodaTime.                 Furthermore, we notice a di↵erence in the
As shown in Figure 3, JodaTime has a much larger           spread of the static coverage on the horizontal axis.
spread in the assertion count of each test. Fur-           For example, coverage for Checkstyle’s tests suites
thermore, we observe that the assertion-iteration          can be split into three groups: around 30%, 70%
combination does not have a significant impact             and 80% coverage. JFreeChart shows a relatively
on the relationship with test suite e↵ectiveness           large spread of coverage for smaller tests suites,
compared to Checkstyle. We notice a set of strong          ranging between 18% and 45% coverage, but the
assertions for JFreeChart (Finding 3) whereas              coverage converges as test suites grow in size. Jo-
JodaTime has mostly weak assertions (Finding 4).           daTime is the only project for which there is no
                                                           split in the coverage scores of same size test suites.
RQ 1: To what extent is assertion count a                  We consider these di↵erences in the spread of cov-
good predictor for test suite e↵ectiveness?                erage a consequence of the quality of the static
   Assertion count has potential as a predictor for        coverage algorithm. These di↵erences are further
test suite e↵ectiveness because assertions are di-         explored in Section 6.2.1. We perform an in-depth
rectly related to detection of mutants. However,           analysis on Checkstyle in Section 6.2.2 because it
more work on assertions is needed as the correla-          is the only project which does not exhibit either a
tion with test suite e↵ectiveness is often weak or         statistically significant correlation between static
statistically insignificant.                               coverage and test e↵ectiveness, or one between
   For all three projects, Table 3, we observe dif-        static coverage and dynamic method coverage.
ferent assertion counts. Checkstyle and Joda-
Time are of similar size and quality, but Check-           6.2.1    Static vs. dynamic method coverage
style only has 16% of the assertions JodaTime
                                                           When comparing dynamic and static coverage in
has. JFreeChart has more assertions than Check-
                                                           Figure 7, we notice that the degree of over- or
style, but the production code base that should be
                                                           underestimation of the coverage depends on the
tested is also three-times bigger. A test quality
                                                           project and test suite size. Smaller test suites tend
model that includes the assertion count should in-
                                                           to overestimate, whereas larger test suites under-
  2 https://github.com/jfree/jfreechart/issues/57          estimate. We observe that the quality of the static




                                                      15
coverage for the Checkstyle project is significantly                 We illustrate the overlap between over- and
di↵erent compared to the other projects. Check-                  underestimation with a small synthetic example.
style is discussed in Section 6.2.2.                             Given a project with 100 methods and test suite
   Overestimating coverage. The static cover-                    T. We divide these methods into three groups:
age for the smaller test suites is significantly higher          1. Group A, with 60 methods that are all cov-
than the real coverage, as measured with dynamic                 ered by T, as measured with dynamic coverage.
analysis. Suppose a method M1 has a switch state-                2. Group B, with 20 methods that are only called
ment that, based on its input, calls one of the                  through the Java Reflection API, all covered by T
following methods, M2 , M3 , M4 . There are three                similar to Group A. 3. Group C, with 20 methods
tests, T1 , T2 , T3 , that each call M1 , with one of the        that are not covered by T. The dynamic coverage
three options for the switch statement in M1 as                  for T consists of the 80 methods in groups A and
a parameter. Additionally, there is a Test suite                 B. The static method coverage for T also consists
T S1 that consists of T1 , T2 , T3 . Each test covers            of 80 methods. However, the coverage for Group C
M1 and one of M2 , M3 , M4 , all tests combined in               is overestimated as they are not covered, and the
T S1 cover all 4 methods. The static coverage al-                coverage for Group B is underestimated as they
gorithm does not evaluate the switch statement                   are not detected by the static coverage algorithm.
and detects for each test that 4 methods are cov-                    JFreeChart has a relatively low coverage score
ered. This shows that static coverage is not very                compared to the other projects. It is likely that the
accurate for individual tests. However, the static               parts of the code that are deemed covered by static
coverage for T S1 matches the dynamic coverage.                  and dynamic coverage will not overlap. However,
This example illustrates why the loss in accuracy,               it should be noted that low coverage does not im-
caused by overestimating the coverage, decreases                 ply more methods are overestimated. When parts
as test suites grow in size. The paths detected                  of the code base are completely uncovered, the
by the static and dynamic method coverage will                   static method coverage might also not detect any
eventually overlap once a test suite is created that             calls to the code base.
contains all tests for a given function. The amount
of overestimated coverage depends on how well the                   Finding 6: The degree of underestimation
tests cover the di↵erent code paths.                                by the static coverage algorithm partially
                                                                    depends on the number of overestimated
                                                                    methods, as this will compensate for the
    Finding 5: The degree of overestima-
                                                                    underestimated methods, and on the num-
    tion by the static method coverage algo-
                                                                    ber of methods that were called by reflec-
    rithm depends on the real coverage and the
                                                                    tion or external libraries.
    amount of conditional logic and inheritance
    in the function under test.
                                                                    Correlation between dynamic and static
                                                                 method coverage.              Table 4 shows, for
   Underestimating coverage. We observe                          JFreeChart and JodaTime, statistically significant
that for larger test suites the coverage is often un-            correlations that increase from a low correlation
derestimated, see Figure 7. Similarly, the under-                for smaller suites to a moderate correlation for
estimation is also visible in the di↵erence between              larger suites. One exception is the correlation for
static and dynamic method coverage of the dif-                   JFreeChart”s test suites with 1% relative size. We
ferent master test suites as shown in the project                could not find a explanation for this exception.
results overview in Table 3.                                        We expected that the tipping point between
   A method that is called through reflection or                 static and dynamic coverage would also be visible
by an external library is not detected by the static             in the correlation table. However, this is not the
coverage algorithm. Smaller test suites do not                   case. Our rank correlation test checks whether two
su↵er from this issue as the number of overesti-                 variables follow the same ordering, i.e., if one vari-
mated methods is often significantly larger than                 able increases, the other also increases. Underesti-
the amount of underestimated methods.                            mating the coverage does not influence the correla-
   We observe di↵erent tipping points be-                        tion when the degree of underestimation is similar
tween overestimating and underestimating for                     for all test suites. As test suites grow in size, they
JFreeChart and JodaTime. For JFreeChart the                      become more similar in terms of included tests.
tipping point is visible for tests suites with a rel-            Consequently, the chances of test suites forming
ative size of 81%, whereas JodaTime reaches the                  an outlier decrease as the size increases.
tipping point at a relative size of 25%. We as-
sume this is caused by the relatively low “real”                    Finding 7: As test suites grow, the corre-
coverage of JFreeChart. We notice that many of                      lation between static and dynamic method
JFreeChart’s methods that were overestimated by                     coverage increases from low to moderate.
the static coverage algorithm are not covered.




                                                            16
6.2.2    Checkstyle                                                 and the coverage of the included tests.
                                                                       The coverage of the individual tests is shown
Figures 6 and 7 show that the static coverage re-
                                                                    in Figure 9a. We notice a few outliers at 48%,
sults for Checkstyle’s test suites are significantly
                                                                    58%, 74% and 75% coverage. We construct test
di↵erent from JFreeChart and JodaTime. For
                                                                    suites by randomly selecting tests. A test suite’s
Checkstyle, all groups of test suites with a relative
                                                                    coverage is never lower than the highest coverage
size of 49% and lower are split into three subgroups
                                                                    among its individual tests. For example, every
that have around 30%, 70% and 80% coverage. In
                                                                    time a test with 74% coverage is included, the test
the following subsections, we analyse the quality
                                                                    suite’s coverage will jump to at least that percent-
of the static coverage for Checkstyle and the pre-
                                                                    age. As test suites grow in size, the chances of
dictability of test suite e↵ectiveness.
                                                                    including a positive outlier increases. We notice
   Quality of static coverage algorithm. To
                                                                    that the outliers do not exactly match with the
analyse the static coverage algorithm for Check-
                                                                    coverage of the vertical groups. The second verti-
style we compare the static coverage with the dy-
                                                                    cal for Checkstyle in Figure 7 starts around 71%
namic coverage for individual tests (Figure 9a),
                                                                    coverage. We found that if the test with 47.5%
and inspect the distribution of the static coverage
                                                                    coverage, AbstractChecktest.testVisitToken,
among the di↵erent tests (Figure 9b).
                                                                    is combined with a 30% coverage test (any
   We regard the di↵erent groupings of test suites
                                                                    of the checker tests), it results in 71% cov-
in the static coverage spread as a consequence of
                                                                    erage.     This shows that only 6.5% coverage
the few tests with high static method coverage.
                                                                    is overlapping between both tests.              We ob-
   Checker tests. Figure 9b shows 1104 tests                        serve that all test suites in the vertical group
scoring 30% to 32.5% coverage. Furthermore, dy-                     at 71% include at least one checker test and
namic coverage only varied between 31.3% and                        AbstractCheckTest.testVisitToken and that
31.6% coverage and nearly all tests are located in                  they do not include any of the other outliers with
the com.puppycrawl.tools.checkstyle.checks                          more than 58%. The most right vertical group
package. We call these tests checker tests, as they                 starts at 79% coverage. This coverage is achieved
are all focussed on the checks. A small experi-                     by combining any of the tests with more than 50%
ment where we combined the coverage of all 1104                     coverage with a single checker test.
tests, resulted in 31.8% coverage, indicating that
                                                                       The groupings in Checkstyle’s coverage scores
all these checker tests almost completely overlap.
                                                                    are a consequence of the few coverage outliers. We
   Listing 1 shows the structure typical for
                                                                    show that these outliers can have a significant im-
checker tests: the logic is mostly located in utility
                                                                    pact on a project’s coverage score. Without these
methods. Once the configuration for the checker is
                                                                    few outliers, the static coverage for Checkstyle’s
created, verify is called with the files that will be
                                                                    master test suite would only be 50%
checked and the expected messages of the checker.
                                                                       Test suites with low coverage. Figure 9b
@Test
public void t e s t C o r r e c t ( ) throws E x c e p t i o n {
                                                                    shows that more than half of the tests have at
  final DefaultConfiguration checkConfig =                          least 30% coverage. Similarly, Figure 7 shows that
          createCheckConfig (                                       all test suites cover at least 31% of the methods.
          AnnotationLocationCheck . class ) ;                       However, there are 763 tests with less than 30%
  f i n a l S t r i n g [ ] e x p e c t e d = CommonUtils .
         EMPTY STRING ARRAY;
                                                                    coverage, and no test suites with less than 30%
  v e r i f y ( c h e c k C o n f i g , getPath ( ”                 coverage. We explain this using probability the-
          InputCorrectAnnotationLocation . java ” ) ,               ory. The smallest test suite for Checkstyle has a
          expected ) ;                                              relative size of 1% which are 19 tests. The chance
}
                                                                    of only including tests with less than 31% cover-
                                                                          763   763 1              763 18          8
Listing 1: Test in AnnotationLocationCheckTest                      age 1875  ⇤ 1875  1 ⇤ . . . ⇤ 1875 18 ⇡ 3 ⇤ 10   . These
                                                                    chances are negligible, even without considering
                                                                    that a combination of the selected tests might still
    Finding 8: Most of Checkstyle’s tests are                       lead to a coverage above 31%.
    focussed on the checker logic. Although                            Missing coverage.                  We found that
    these tests vary in e↵ectiveness, they cover                    AbstractCheckTest.testVisitToken                   scores
    an almost identical set of methods as mea-                      47.5% static method coverage, although it only
    sured with the static coverage algorithm.                       tests the AbstractCheck.visitToken method.
                                                                    Therefore any test calling the visitToken method
   Coverage subgroups and outliers. We no-                          will have at least 47.5% static method coverage.
tice three vertical groups for Checkstyle in Figure 7                  160 classes extend AbstractCheck, of which
starting around 31%, 71% and 78% static coverage                    123 override the visitToken method.                  The
and then slowly curving to the right. These group-                  static method coverage algorithm includes 123
ings are a result of how test suites are composed                   virtual calls when AbstractCheck.visitToken is




                                                               17
  (a) Static and dynamic method coverage of               (b) Distribution of the tests over the di↵er-
  individual tests. Static coverage of tests be-          ent levels of static method coverage.
  low the black line is overestimated, above is
  underestimated.
          Figure 9: Static method coverage scores for individual tests of Checkstyle.
called.The coverage of all visitToken overrides     suite missed calls to 328 methods. Of these meth-
combined is 47.5%. Note that the static cover-      ods, 248 (7.5% of all methods) are setter meth-
age algorithm also considers constructor calls and  ods. Further inspection showed that checkers are
static blocks as covered when a method of a class   configured using reflection, based on a configura-
is invoked. We found that only 6.5% of the total    tion file with properties that match the setters of
method coverage overlaps with testVisitToken.       the checkers. This large group of methods missed
    This large overlap between both tests suggests  by the static coverage algorithm partially explains
that visitToken is not called by any of the         the di↵erence between static and dynamic method
check tests. However, we found that the verify      coverage of Checkstyle’s master test suite.
method indirectly calls visitToken. The call
process(File, FileText), is not matched                 Finding 10: The large gap between static
with      AbstractFileSetCheck.process(File,            and dynamic method coverage for Check-
List). The parameter of type FileText extends           style is caused by a significant amount of
AbstractList which is part of the java.util             setter methods for the checker classes that
package. During the construction of the static call     are called through reflection.
graph, it was not detected that AbstractList is
an implementation of the List interface because        Relation with e↵ectiveness. Checkstyle is
only Checkstyle’s source code was inspected.        the only project for which there is no statistically
If these calls were detected the coverage of all    significant correlation between static method cov-
checker tests would increase to 71%, filling the    erage and test suite e↵ectiveness.
gap between the two right-most vertical groups in      We notice a large distance, regarding invoca-
the plots for Checkstyle in both Figures 6 and 7.   tions in the call hierarchy, between most checkers
                                                          and their tests. There are 9 invocations between
   Finding 9: Our static coverage algorithm               visitToken and the much used verify method.
   fails to detect a set of calls in the tests for           In addition to the actual checker logic, a lot in-
   the substantial group of checker tests due             frastructure is included in each test. For example,
   to shortcomings in the static call graph. If           instantiating the checkers and its properties based
   these the calls were correctly detected, the           on a reflection framework, parsing the files and cre-
   static coverage for test suites of the same            ating an AST, traversing the AST, collecting and
   size would be grouped more closely possibly            converting all messages of the checkers.
   resulting in a more significant correlation.              These characteristics seem to match those of in-
                                                          tegration tests. Zaidman et al. studied the evolu-
   High reflection usage. Checkstyle applies a            tion of the Checkstyle project and arrived at sim-
visitor pattern on an AST for the di↵erent code           ilar findings: “Moreover, there is a thin line be-
checks. The AbstractCheck class forms the ba-             tween unit tests and integration tests. The Check-
sis of this visitor and is extended by 160 checker        style developers see their tests more as I/O inte-
classes. These classes contain the core function-         gration tests, yet associate individual test cases
ality of Checkstyle and consist of 2090 methods           with a single production class by name” [43].
(63% of all methods), according to SAT. Running              Directness. We implemented the directness
our static coverage algorithm on the master test          measure to inspect whether it would reflect the




                                                     18
presence of mostly integration like tests. The di-           6.2.4    Method coverage as a predictor for
rectness is based on the percentage of methods                        test suite e↵ectiveness
that are directly called from a test. The master
                                                             We found a statistically significant, low correlation
test suites of Checkstyle, JFreeChart and Joda-
                                                             between test suite e↵ectiveness and static method
Time cover respectively 30%, 26% and 61% of all
                                                             coverage for JFreeChart and JodaTime. We evalu-
methods directly. As Checkstyle’s static coverage
                                                             ated the static coverage algorithm and found that
is significantly higher than that of JFreeChart we
                                                             smaller test suites typically overestimate the cov-
observe that Checkstyle covers the smallest por-
                                                             erage (Finding 5), whereas for larger test suites the
tion of methods directly from tests. Given that
                                                             coverage is often underestimated (Finding 6). The
unit tests should be focused on small functional
                                                             tipping point depends on the real coverage of the
units, we expected a relatively high directness
                                                             project. We also found that static coverage cor-
measure for the test suites.
                                                             relates better with dynamic coverage as test suite
                                                             increase in size (Finding 7).
   Finding 11: Many of Checkstyle’s tests
                                                                An exception to these observations is Check-
   are integration-like tests that have a large
                                                             style, the only project without a statistically sig-
   distance between the test and the logic un-
                                                             nificant correlation between static method cover-
   der test. Consequently, only a small por-
                                                             age and both, test suite e↵ectiveness and dynamic
   tion of the code is covered directly.
                                                             method coverage. Most of Checkstyle’s tests have
                                                             nearly identical coverage results (Finding 8) albeit
   To make matters worse, the integration-like               the e↵ectiveness varies. The SAT could calculate
tests were mixed with actual tests. We argue                 static code coverage, however it is less suitable for
that integrations tests have di↵erent test proper-           more complex projects. The large distance be-
ties compared to unit tests: they often cover more           tween tests and tested functionality (Finding 11)
code, have less assertions, but the assertions have          in the Checkstyle project in terms of call hierar-
a higher impact, e.g., comparing all the reported            chy led to skewed results as some of the must used
messages. These di↵erences can lead to a skew in             calls were not resolved (Finding 9). This can be
the e↵ectiveness results.                                    partially mitigated by improving the call resolving.
                                                                We consider the inaccurate results of the static
6.2.3   Dynamic method coverage and e↵ec-                    coverage algorithm a consequence of the quality of
        tiveness                                             the call graph and the frequent use of Java reflec-
                                                             tion(Finding 10). Furthermore, the unit tests for
We observe in Figure 8 that, within groups of test           Checkstyle show similarities with integration tests.
suites of the same size, test suite with more dy-
namic coverage are also more e↵ective. Similarly,            RQ 2: To what extent is static coverage a
we observe a moderate correlation between dy-                good predictor for test suite e↵ectiveness?
namic method coverage and normal e↵ectiveness
for all three projects in Table 8.                              First, we found a moderate to high correla-
   When comparing test suite e↵ectiveness with               tion between dynamic method coverage and e↵ec-
static method coverage, we observe a low to mod-             tiveness for all analysed projects which suggests
erate correlation for JFreeChart and JodaTime                that method coverage is a suitable indicator. The
when accounting for size in Table 6, but no statis-          projects that showed a statistically significant cor-
tically significant correlation for Checkstyle. Sim-         relation between static and dynamic method cov-
ilarly, only the Checkstyle project does not show a          erage also showed a significant correlation between
statistically significant correlation between static         static method coverage and test suite e↵ectiveness.
and dynamic method coverage, as shown in Ta-                 Although the correlation between test suite e↵ec-
ble 7. We believe this is a consequence of the inte-         tiveness and static coverage was not statistically
gration like test characteristics of the Checkstyle          significant for Checkstyle, the coverage score on
project. Due to the large distance between tests             project level provided a relatively good indication
and code and the abstractions used in-between,               of the project’s real coverage. Based on these ob-
the static coverage is not very accurate.                    servations we consider coverage suitable as a pre-
   The moderate correlation between dynamic                  dictor for test e↵ectiveness.
method coverage and e↵ectiveness suggests there
                                                             6.3     Practicality
is a relation between method coverage and normal
e↵ectiveness. However, the static method coverage            A test quality model based on the current state of
does not show a statistically significant correlation        the metrics would not be sufficiently accurate.
with normal e↵ectiveness for Checkstyle. We state               Although there is evidence of a correlation be-
that our static method coverage metric is not ac-            tween assertion count and e↵ectiveness, the as-
curate enough for the Checkstyle project.                    sertion count of each project’s master test suite




                                                        19
did not map to the relative e↵ectiveness of each              6.5   External threats to validity
project. Each of the analysed projects had on aver-
                                                              We study three open source Java projects. Our re-
age a di↵erent number of assertions per test. Fur-
                                                              sults are not generalisable to projects using other
ther improvements to the assertion count metric,
                                                              programming languages. Also, we only included
e.g., including the strength of the correlation, are
                                                              assertions provided by JUnit. Although JUnit is
needed to get more usable results.
                                                              the most popular testing library for Java, there
   The static method coverage could be used to
                                                              are testing libraries possibly using di↵erent asser-
evaluate e↵ectiveness to a certain extent. We
                                                              tions [44]. We also ignored mocking libraries in
found a low to moderate correlation for two of the
                                                              our analysis. Mocking libraries provide a form of
project between e↵ectiveness and static method
                                                              assertions based on the behaviour of units under
coverage. Furthermore, we found a similar cor-
                                                              test. These assertions are ignored by our analysis,
relation between static and dynamic method cov-
                                                              albeit they can lead to an increase in e↵ectiveness.
erage. The quality of the static call graph should
be improved to better estimate the real coverage.
                                                              6.6   Reliability
   We did not investigate the quality of these met-
rics for other programming languages. However,                Tengeri et al. compared di↵erent instrumentation
the SAT supports call graph analysis and identi-              techniques and found that JaCoCo produces in-
fying assertions for a large range of programming             accurate results especially when mapped back to
languages, facilitating future experiments.                   source code [39]. The main problem was that Ja-
   We encountered scenarios for which the static              CoCo did not include coverage between two di↵er-
metrics gave imprecise results. If these sources of           ent sub-modules in a Maven project. For example,
imprecision would be translated to metrics, they              a call from sub-module A to sub-module B is not
could indicate the quality of the static metrics. An          registered by JaCoCo because JaCoCo only anal-
indication of low quality could suggest that more             yses coverage on a module level. As the projects
manual inspection is needed.                                  analysed in this thesis do not contain sub-modules,
                                                              this JaCoCo issue is not applicable to our work.
6.4   Internal threats to validity
Static call graph. We use the static call graph               7     Related work
constructed by the SAT, for both metrics. We                  We group related work as follows: test quality
found several occurrences where the SAT did not               models, standalone test metrics, code coverage and
correctly resolve the call graph. We fixed some of            e↵ectiveness, and assertions and e↵ectiveness.
the issues encountered during our analysis. How-
ever, as we did not manually analyse all the calls,
                                                              7.1   Test quality models
this remains a threat to validity.
    Equivalent mutants. We treated all mutants                We compare the TQM [18] we used, as described
that were not detected by the master test suite               in Section 2.2 with two other test quality models.
as equivalent mutants, an approach often used in              We first describe the other models, followed by a
literature [35, 24, 45]. There is a high probability          motivation for the choice of a model.
that this resulted in overestimating the number                   STREW. Nagappan introduced the Software
of equivalent mutants, especially for JFreeChart              Testing and Reliability Early Warning (STREW)
where a large part of the code is simply tested. In           metric suite to provide “an estimate of post-
principle, this is not a problem as we only compare           release field quality early in software development
the e↵ectiveness of sub test suites. However, our             phases [34].” The STREW metric suite consists
statement on the order of the master’s tests suite            of nine static source and test code metrics. The
e↵ectiveness is vulnerable to this threat as we did           metric suite is divided into three categories: Test
not manually inspect each mutant for equivalence.             quantification, Complexity and OO-metrics, and
    Accuracy of analysis. We manually in-                     Size adjustment. The test quantifications metrics
spected large parts of the Java code of each                  are the following: 1. Number of assertions per line
project. Most of the inspections were done by                 of production code. 2. Number of tests per line
a single person with four years of experience in              of production code. 3. Number of assertion per
Java. Also, we did not inspect all the tests. Most            test. 4. The ratio between lines of test code and
tests were selected on a statistic driven-basis, i.e.,        production code, divided by the ratio of test and
we looked at tests that showed high e↵ectiveness              production classes.
but low coverage, or tests with a large di↵erence                 TAIME. Tengeri et al. introduced a system-
between static and dynamic. To mitigate this, we              atic approach for test suite assessment with a focus
also verified randomly selected tests. However, the           on code coverage [38]. Their approach, Test Suite
chances of missing relevant source of imprecision             Assessment and Improvement Method (TAIME),
remains a threat to validity.                                 is intended to find improvement points and guide




                                                         20
the improvement process. In this iterative process,         7.3   Code coverage and e↵ectiveness
first, both the test code and production code are
                                                            Namin et al. studied how coverage and size in-
split into functional groups and paired together.
                                                            dependently influence e↵ectiveness [35]. Their ex-
The second step is to determine the granularity of
                                                            periment used seven Siemens suite programs which
the measures, start with coarse metrics on proce-
                                                            varied between 137 and 513 LOC and had between
dure level and in later iterations repeat on state-
                                                            1000 and 5000 test cases. Four types of code cov-
ment level. Based on these functional groups they
                                                            erage were measured: block, decision, C-Use and
define the following set of metrics:
                                                            P-Use. The size was defined by the number of
Code coverage calculated on both procedure                  tests and e↵ectiveness was measured using muta-
    and statement level.                                    tion testing. Test suites of fixed sizes and di↵erent
Partition metric “The         Partition      Metric         coverage levels were randomly generated to mea-
    (PART) characterizes how well a set of                  sure the correlation between coverage and e↵ec-
    test cases can di↵erentiate between the                 tiveness. They showed that both coverage and size
    program elements based on their coverage                independently influence test suite e↵ectiveness.
    information [38]”.                                         Another study on the relation between test ef-
Tests per Program how many tests have been                  fectiveness and code coverage was performed by
    created on average for a functional group.              Inozemtseva and Holmes [24]. They conducted
Specialisation how many tests for a functional              an experiment on a set of five large open source
    group are in the corresponding test group.              Java projects and accounted for the size of the
Uniqueness what portion of covered functional-              di↵erent test suites. Additionally, they intro-
    ity is covered only by a particular test group.         duced a novel e↵ectiveness metric, normalized ef-
    STREW, TAIME and TQM are models for as-                 fectiveness. They found moderate correlations be-
sessing aspects of test quality. STREW and TQM              tween coverage and e↵ectiveness when size was ac-
are both based on static source code analysis.              counted for. However, the correlation was low for
However, STREW lacks coverage related metrics               normalized e↵ectiveness.
compared to TQM. TAIME is di↵erent from the                    The main di↵erence with our work is that
other two models as it does not depend on a spe-            we used static source code analysis to calculate
cific programming language or xUnit framework.              method coverage. Our experiment set-up is simi-
Furthermore, TAIME is more an approach than a               lar to that of Inozemtseva and Holmes except that
simple metric model. It is an iterative process that        we chose a di↵erent set of data points which we
requires user input to identify functional groups.          showed as more representative.
The required user input makes it less suitable for
automated analysis or large-scale studies.                  7.4   Assertions and e↵ectiveness
                                                            Kudrjavets et al. investigated the relation between
7.2   Standalone test metrics
                                                            assertions and fault density [28]. They measured
Bekerom investigated the relation between test              the assertion density, i.e., number of assertions per
smells and test bugs [41]. He built a tool using the        thousand lines of code, for two components of Mi-
SAT to detect a set of test smells: Eager test, Lazy        crosoft Visual Studio written in C and C++. Ad-
test, Assertion Roulette, Sensitive Equality and            ditionally, real faults were taken from an internal
Conditional Test Logic. He showed that classes              bug database and converted to fault density. Their
a↵ected by test bugs score higher on the presence           result showed a negative relation between asser-
of test smells. Additionally, he predicted classes          tion density and fault density, i.e., code that had
that have test bugs based on the eager smell with           a higher assertion density has a lower fault density.
a precision of 7% which was better than random.             Instead of assertion density we focussed on the as-
However, the recall was very low which led to the           sertion count of Java projects and used artificial
conclusion that it is not yet usable to predict test        faults, i.e., mutants.
bugs with smells.                                              Zhang and Mesbah [45] investigated the rela-
   Ramler et al. implemented 42 new rules for               tionship between assertions and test suite e↵ec-
the static analysis tool PDM to evaluate JUnit              tiveness. They found that, even when test suite
code [37]. They defined four key problem areas              size was controlled for, there was a strong corre-
that should be analysed: Usage of the xUnit test            lation between assertion count and test e↵ective-
framework, implementation of the unit test, main-           ness. Our results overlap with their work as we
tainability of the test suite and testability of the        both found a correlation between assertion count
SUT. The rules were applied to the JFreeChart               and e↵ectiveness for the JFreeChart project. How-
project and resulted in 982 violations of which one-        ever, we showed that this correlation is not always
third was deemed to be some symptom of problems             present as both Checkstyle and JodaTime showed
in the underlying code.                                     di↵erent results.




                                                       21
8     Conclusion                                              We found a large number of tests in the Jo-
                                                              daTime project that called the function under
We analysed the relation between test suite e↵ec-
                                                              test several times. For example, JodaTime’s
tiveness and metrics, assertion count and static
                                                              test wordBased pl regEx test checks 140 times
method coverage, for three large Java projects,
                                                              if periods are formatted correctly in Polish. These
Checkstyle, JFreeChart and JodaTime. Both met-
                                                              eager tests should be split into separate cases that
rics were measured using static source code anal-
                                                              test the specific scenarios.
ysis. We found a low correlation between test
suite e↵ectiveness and static method coverage for             8.2   Acknowledgements
JFreeChart and JodaTime and a low to moderate
correlation with assertion count for JFreeChart.              We would like to thank Prof. Serge Demeyer for
We found that the strength of the correlation de-             his elaborate and insightful feedback on our paper.
pends on the characteristics of the project. The
absence of a correlation does not imply that the              References
metrics are not useful for a TQM.                              [1] Checkstyle. https://github.com/checkstyle/
   Our current implementation of the assertion                     checkstyle. Accessed: 2017-07-15.
count metric only shows promising results when                 [2] Checkstyle team.        http://checkstyle.
predicting test suite e↵ectiveness for JFreeChart.                 sourceforge.net/team-list.html.   Accessed:
We found that simply counting the assertions for                   2017-11-19.
each project gives results that do not align with the          [3] Code cover. http://codecover.org/. Accessed:
relative e↵ectiveness of the projects. The project                 2017-07-15.
with the most e↵ective master test suite had a sig-            [4] JaCoCo. http://www.jacoco.org/.       Accessed:
nificantly lower assertion than the other projects.                2017-07-15.
Even for sub test suites of most project, the asser-
                                                               [5] JFreeChart.       https://github.com/jfree/
tion count did not correlate with test e↵ectiveness.
                                                                   jfreechart. Accessed: 2017-07-15.
Incorporating the strength of an assertion could
                                                               [6] JodaTime.      https://github.com/jodaorg/
lead to better predictions.
                                                                   joda-time. Accessed: 2017-07-15.
   Static method coverage is a good candidate for
predicting test suite e↵ectiveness. We found a sta-            [7] JUnit. http://junit.org/. Accessed: 2017-07-
                                                                   15.
tistically significant, low correlation between static
method coverage and test suite e↵ectiveness for                [8] MAJOR     mutation   tool  .        http://
most analysed projects. Furthermore, the cover-                    mutation-testing.org/. Accessed: 2017-07-15.
age algorithm is consistent in its predictions on              [9] muJava mutation tool. https://cs.gmu.edu/
a project level, i.e., the ordering of the projects                ~offutt/mujava/. Accessed: 2017-07-15.
based on the coverage matched the relative rank-              [10] PIT+.      https://github.com/LaurentTho3/
ing in terms of test e↵ectiveness.                                 ExtendedPitest. Accessed: 2017-07-15.
                                                              [11] PIT fork.      https://github.com/pacbeckh/
8.1   Future work                                                  pitest. Accessed: 2017-07-15.
                                                              [12] PIT mutation tool . http://pitest.org/. Ac-
Static coverage. Landman et al. investigated                       cessed: 2017-07-15.
the challenges for static analysis of Java reflec-
                                                              [13] R’s Kendall package. https://cran.r-project.
tion [30]. They identified that is at least possible
                                                                   org/web/packages/Kendall/Kendall.pdf. Ac-
to identify and measure the use of hard to resolve                 cessed: 2017-07-15.
reflection usage. Measuring reflection usage could
                                                              [14] SLOCCount.        https://www.dwheeler.com/
give an indication of the degree of underestimated
                                                                   sloccount/. Accessed: 2017-07-15.
coverage. Similarly, we would like to investigate
whether we can give an indication of the degree of            [15] TIOBE-Index.         https://www.tiobe.com/
                                                                   tiobe-index/. Accessed: 2017-07-15.
overestimation of the project.
   Assertion count. We would like to investi-                 [16] Tiago L. Alves and Joost Visser. Static estima-
                                                                   tion of test coverage. In Ninth IEEE Interna-
gate further whether we can measure the strength
                                                                   tional Working Conference on Source Code Anal-
of an assertion. Zhang and Mesbah included as-
                                                                   ysis and Manipulation, SCAM 2009, Edmonton,
sertion coverage and measured the e↵ectiveness of                  Alberta, Canada, September 20-21, 2009, pages
di↵erent assertion types [45]. We would like to in-                55–64, 2009.
corporate this knowledge into the assertion count.
                                                              [17] Paul Ammann, Márcio Eduardo Delamaro, and
This could result in a more comparable assertion                   Je↵ O↵utt. Establishing theoretical minimal
count on project level.                                            sets of mutants. In Seventh IEEE International
   Deursen et al. described a set of test smells                   Conference on Software Testing, Verification and
including the eager tests, a test the verifies too                 Validation, ICST 2014, March 31 2014-April 4,
much functionality of the tested function [42].                    2014, Cleveland, Ohio, USA, pages 21–30, 2014.




                                                         22
[18] Dimitrios Athanasiou, Ariadi Nugroho, Joost               [29] Tobias Kuipers and Joost Visser. A tool-based
     Visser, and Andy Zaidman. Test code quality and                methodology for software portfolio monitoring.
     its relation to issue handling performance. IEEE               In Software Audit and Metrics, Proceedings of
     Trans. Software Eng., 40(11):1100–1125, 2014.                  the 1st International Workshop on Software Au-
[19] Kent Beck and Erich Gamma. Test infected:                      dit and Metrics, SAM 2004, In conjunction with
     Programmers love writing tests. Java Report,                   ICEIS 2004, Porto, Portugal, April 2004, pages
     3(7):37–50, 1998.                                              118–128, 2004.
[20] Antonia Bertolino. Software testing research:             [30] Davy Landman, Alexander Serebrenik, and Ju-
     Achievements, challenges, dreams. In Interna-                  rgen J. Vinju. Challenges for static analysis of
     tional Conference on Software Engineering, ISCE                java reflection: literature review and empirical
     2007, Workshop on the Future of Software En-                   study. In Proceedings of the 39th International
     gineering, FOSE 2007, May 23-25, 2007, Min-                    Conference on Software Engineering, ICSE 2017,
     neapolis, MN, USA, pages 85–103, 2007.                         Buenos Aires, Argentina, May 20-28, 2017, pages
                                                                    507–518, 2017.
[21] Ilja Heitlager, Tobias Kuipers, and Joost Visser.
     A practical model for measuring maintainabil-             [31] Thomas Laurent, Mike Papadakis, Marinos Kin-
     ity. In Quality of Information and Communi-                    tis, Christopher Henard, Yves Le Traon, and An-
     cations Technology, 6th International Conference               thony Ventresque. Assessing and improving the
     on the Quality of Information and Communica-                   mutation testing practice of PIT. In 2017 IEEE
     tions Technology, QUATIC 2007, Lisbon, Portu-                  International Conference on Software Testing,
     gal, September 12-14, 2007, Proceedings, pages                 Verification and Validation, ICST 2017, Tokyo,
     30–39, 2007.                                                   Japan, March 13-17, 2017, pages 430–435, 2017.
[22] Ferenc Horváth, Bela Vancsics, László Vidács,         [32] András Márki and Birgitta Lindström. Mutation
     Árpád Beszédes, Dávid Tengeri, Tamás Gergely,             tools for java. In Proceedings of the Symposium on
     and Tibor Gyimóthy. Test suite evaluation using               Applied Computing, SAC 2017, Marrakech, Mo-
     code coverage based metrics. In Proceedings of                 rocco, April 3-7, 2017, pages 1364–1415, 2017.
     the 14th Symposium on Programming Languages               [33] Thomas J. McCabe. A complexity measure. IEEE
     and Software Tools (SPLST’15), Tampere, Fin-                   Trans. Software Eng., 2(4):308–320, 1976.
     land, October 9-10, 2015., pages 46–60, 2015.
                                                               [34] Nachiappan Nagappan. A Software Testing and
[23] David C Howell. Statistical methods for psychol-
                                                                    Reliability Early Warning (Strew) Metric Suite.
     ogy. Cengage Learning, 2012.
                                                                    PhD thesis, North Carolina State University,
[24] Laura Inozemtseva and Reid Holmes. Coverage is                 2005.
     not strongly correlated with test suite e↵ective-
     ness. In 36th International Conference on Soft-           [35] Akbar Siami Namin and James H. Andrews. The
     ware Engineering, ICSE ’14, Hyderabad, India -                 influence of size and coverage on test suite e↵ec-
     May 31 - June 07, 2014, pages 435–445, 2014.                   tiveness. In Proceedings of the Eighteenth Interna-
                                                                    tional Symposium on Software Testing and Anal-
[25] Yue Jia and Mark Harman. An analysis and sur-                  ysis, ISSTA 2009, Chicago, IL, USA, July 19-23,
     vey of the development of mutation testing. IEEE               2009, pages 57–68, 2009.
     Trans. Software Eng., 37(5):649–678, 2011.
                                                               [36] Mike Papadakis, Christopher Henard, Mark Har-
[26] René Just, Darioush Jalali, Laura Inozemtseva,
                                                                    man, Yue Jia, and Yves Le Traon. Threats to
     Michael D. Ernst, Reid Holmes, and Gordon
                                                                    the validity of mutation-based test assessment. In
     Fraser. Are mutants a valid substitute for real
                                                                    Proceedings of the 25th International Symposium
     faults in software testing? In Proceedings of the
                                                                    on Software Testing and Analysis, ISSTA 2016,
     22nd ACM SIGSOFT International Symposium
                                                                    Saarbrücken, Germany, July 18-20, 2016, pages
     on Foundations of Software Engineering, (FSE-
                                                                    354–365, 2016.
     22), Hong Kong, China, November 16 - 22, 2014,
     pages 654–665, 2014.                                      [37] Rudolf Ramler, Michael Moser, and Josef Pichler.
                                                                    Automated static analysis of unit test code. In
[27] Marinos Kintis, Mike Papadakis, Andreas
                                                                    First International Workshop on Validating Soft-
     Papadopoulos, Evangelos Valvis, and Nicos
                                                                    ware Tests, VST@SANER 2016, Osaka, Japan,
     Malevris. Analysing and comparing the e↵ec-
                                                                    March 15, 2016, pages 25–28, 2016.
     tiveness of mutation testing tools: A manual
     study. In 16th IEEE International Working Con-            [38] Dávid Tengeri,     Árpád Beszédes,    Tamás
     ference on Source Code Analysis and Manipula-                  Gergely, László Vidács, David Havas, and
     tion, SCAM 2016, Raleigh, NC, USA, October                     Tibor Gyimóthy. Beyond code coverage - an
     2-3, 2016, pages 147–156, 2016.                                approach for test suite assessment and improve-
[28] Gunnar Kudrjavets, Nachiappan Nagappan, and                    ment. In Eighth IEEE International Conference
     Thomas Ball. Assessing the relationship between                on Software Testing, Verification and Validation,
     software assertions and faults: An empirical in-               ICST 2015 Workshops, Graz, Austria, April
     vestigation. In 17th International Symposium on                13-17, 2015, pages 1–7, 2015.
     Software Reliability Engineering (ISSRE 2006),            [39] Dávid Tengeri, Ferenc Horváth, Árpád Beszédes,
     7-10 November 2006, Raleigh, North Carolina,                   Tamás Gergely, and Tibor Gyimóthy. Nega-
     USA, pages 204–212, 2006.                                      tive e↵ects of bytecode instrumentation on java




                                                          23
    source code coverage. In IEEE 23rd Interna-
    tional Conference on Software Analysis, Evolu-
    tion, and Reengineering, SANER 2016, Suita,
    Osaka, Japan, March 14-18, 2016 - Volume 1,
    pages 225–235, 2016.
[40] Paco van Beckhoven. Assessing test suite e↵ec-
     tiveness using static analysis. Master’s thesis,
     University of Amsterdam, 2017.
[41] Kevin van den Bekerom. Detecting test bugs us-
     ing static analysis tools. Master’s thesis, Univer-
     sity of Amsterdam, 2016.
[42] Arie van Deursen, Leon Moonen, Alex van den
     Bergh, and Gerard Kok. Refactoring test code.
     In Proceedings of the 2nd international confer-
     ence on extreme programming and flexible pro-
     cesses in software engineering (XP2001), pages
     92–95, 2001.
[43] Andy Zaidman, Bart Van Rompaey, Serge De-
     meyer, and Arie van Deursen. Mining software
     repositories to study co-evolution of production
     & test code. In First International Conference
     on Software Testing, Verification, and Validation,
     ICST 2008, Lillehammer, Norway, April 9-11,
     2008, pages 220–229, 2008.
[44] Ahmed Zerouali and Tom Mens. Analyzing
     the evolution of testing library usage in open
     source java projects. In IEEE 24th International
     Conference on Software Analysis, Evolution and
     Reengineering, SANER 2017, Klagenfurt, Aus-
     tria, February 20-24, 2017, pages 417–421, 2017.
[45] Yucheng Zhang and Ali Mesbah. Assertions
     are strongly correlated with test suite e↵ective-
     ness. In Proceedings of the 2015 10th Joint
     Meeting on Foundations of Software Engineering,
     ESEC/FSE 2015, Bergamo, Italy, August 30 -
     September 4, 2015, pages 214–224, 2015.
[46] Hong Zhu, Patrick A. V. Hall, and John H. R.
     May. Software unit test coverage and adequacy.
     ACM Comput. Surv., 29(4):366–427, 1997.




                                                           24