Assessing Test Suite E↵ectiveness Using Static Metrics Paco van Beckhoven1,2 , Ana Oprescu1 , and Magiel Bruntink2 1 University of Amsterdam 2 Software Improvement Group Abstract on tests can be difficult and generate risks if done incorrectly [22]. Typically, such risks are related With the increasing amount of automated to the growing size and complexity which conse- tests, we need ways to measure the test quently lead to incomprehensible tests. An impor- e↵ectiveness. The state-of-the-art tech- tant risk is the occurrence of test bugs i.e., tests nique for assessing test e↵ectiveness, mu- that fail although the program is correct (false pos- tation testing, is too slow and cumber- itive) or even worse, tests that do not fail when the some to be used in large scale evolution program is not working as desired (false negative). studies or code audits by external compa- Especially the latter is a problem when breaking nies. In this paper we investigated two al- changes are not detected by the test suite. This ternatives, namely code coverage and as- issue can be addressed by measuring the fault de- sertion count. We discovered that code tecting capability of a test suite, i.e., test suite coverage outperforms assertion count by e↵ectiveness Test suite e↵ectiveness is measured showing a relation with test suite e↵ec- by the number of faulty versions of a System Un- tiveness for all analysed project. Asser- der Test (SUT) that are detected by a test suite. tion count only displays such a relation in However, as real faults are unknown in advance, only one of the analysed projects. Further mutation testing is applied as a proxy measure- analysing this relationship between asser- ment. It has been shown that mutant detection tion count coverage and test e↵ectiveness correlates with real fault detection [26]. would allow to circumvent some of the Mutation testing tools generate faulty versions problems of mutation testing. of the program and then run the tests to determine if the fault was detected. These faults, called mu- 1 Introduction tants, are created by so-called mutators which mu- tate specific statements in the source code. Each Software testing is an important part of the soft- mutant represents a very small change to pre- ware engineering process. It is widely used in vent changing the overall functionality of the pro- the industry for quality assurance as tests can gram. Some examples of mutators are: replacing tackle software bugs early in the development pro- operands or operators in an expression, removing cess and also serve for regression purposes [20]. statements or changing the returned values. A mu- Part of the software testing process is covered by tant is killed if it is detected by the test suite, ei- developers writing automated tests such as unit ther because the program fails to execute (due to tests. This process is supported by testing frame- exceptions) or because the results are not as ex- works such as JUnit [19]. Monitoring the quality pected. If a large set of mutants survives, it might of the test code has been shown to provide valu- be an indication that the test quality is insufficient able insight when maintaining high-quality assur- as programming errors may remain undetected. ance standards [18]. Previous research shows that as the size of production code grows, the size of 1.1 Problem statement test code grows along [43]. Quality control on test suites is therefore important as the maintenance Mutation analysis is used to measure the test suite e↵ectiveness of a project [26]. However, mutation Copyright c by the paper’s authors. Copying permitted for testing techniques have several drawbacks, such as private and academic purposes. limited availability across programming languages Proceedings of the Seminar Series on Advanced Techniques and Tools for Software Evolution SATToSE 2017 (sat- and being resource expensive [46, 25]. Further- tose.org). more, it often requires compilation of source code 07-09 June 2017, Madrid, Spain. and it requires running tests which often depend 1 on other systems that might not be available, ren- Outline. Section 2 revisits background con- dering it impractical for external analysis. Exter- cepts. Section 3 introduces the design of the static nal analysis is often applied in industry by compa- metrics that will be investigated together with an nies such as Software Improvement Group (SIG) to e↵ectiveness metric and a mutation tool. Section 4 advise companies on the quality of their software. describes the empirical method of our research. All these issues are compounded when performing Results are shown in Section 5 and discussed in software evolution analysis on large-scale legacy or Section 6. Section 7 summarises related work and open source projects. Therefore our research goal Section 8 presents the conclusion and future work. has both industry and research relevance. 2 Background 1.2 Research questions and method First, we introduce some basic terminology. Next, To tackle these issues, our goal is to understand we describe a test quality model used as input for to what extent metrics obtained through static the design of our static metrics. We briefly in- source code analysis relate to test suite e↵ective- troduce mutation testing and compare mutation ness as measured with mutation testing. tools. Finally, we summarize test e↵ectiveness Preliminary research [40] on static test metrics measures and describe mutation analysis. highlighted two promising candidates: assertion count and static coverage. We structure our anal- 2.1 Terminology ysis on the following research questions: We define several terms used in this paper: RQ 1 To what extent is assertion count a good Test (case/method) An individual JUnit test. predictor for test suite e↵ectiveness? Test suite A set of tests. RQ 2 To what extent is static coverage a good Test suite size Number of tests in a test suite. predictor for test suite e↵ectiveness? Master test suite All tests of a given project. We select our test suite e↵ectiveness metric and Dynamic metrics Metrics that can only be mutation tool based on state of the art literature. measured by, e.g., running a test suite. When Next, we study existing test quality models to in- we state that something is measured dynam- spect which static metrics can be related to test ically, we refer to dynamic metrics. suite e↵ectiveness. Based on these results we im- Static metrics Metrics measured by analysing plement a set of metrics using only static analysis. the source code of a project. When we state To answer the research questions, we implement that something is measured statically, we re- a simple tool that reads a project’s source files and fer to static metrics. calculates the metrics scores using static analysis. Finally, we evaluate the individual metrics’ 2.2 Measuring test code quality suitability as indicators for e↵ectiveness by per- Athanasiou et al. introduced a Test Quality Model forming a case study using our tool on three (TQM) based on metrics obtained through static projects: Checkstyle, JFreeChart and JodaTime. analysis of production and test code [18]. This The projects were selected from related research, TQM consists of the following static metrics: based on size and structure of their respective test suites. We focus on Java projects as Java is one Code coverage is percentage of code tested, im- of the most popular programming languages [15] plemented via static call graph analysis [16]. and forms the subject of many recent research pa- Assertion-McCabe ratio indicates tested deci- pers surrounding test e↵ectiveness. We rely on sion points in the code; computed as the JUnit [7] as the unit testing framework. JUnit is total number of assertion statements in the the most used unit testing framework for Java [44]. test code divided by the McCabe’s cyclomatic complexity score [33] of the production code. Assertion Density indicates the ability to de- 1.3 Contributions tect defects; computed as the number of asser- In an e↵ort to tackle the drawbacks of using mu- tions divided by Lines Of Test Code (TLOC). tation testing to measure test suite e↵ectiveness, Directness indicates the ability to detect the lo- our research makes the following contributions: cation a defect’s cause when a test fails. Sim- 1. In-depth analysis on the relation between test ilar to code coverage, except that only meth- e↵ectiveness, assertion count and coverage as mea- ods directly called from a test are counted. sured using static metrics for three large real-world Maintainability based on an existing maintain- projects. 2. A set of scenarios which influence the ability model [21], adapted for test suites. results of the static metrics and their sources of The model consists of the following metrics imprecision. 3. An tool to measure static cover- for test code: Duplication, Unit Size, Unit age and assertion count using only static metrics. Complexity and Unit Dependency. 2 2.3 Mutation testing was the most e↵ective, followed by Major and PIT. The ordering in terms of application cost was dif- Test e↵ectiveness is measured by the number of ferent: PIT required the least test cases and gen- mutants that were killed by a test suite. Recent erated the smallest set of equivalent mutants. research introduced a variety of e↵ectiveness mea- Marki and Lindstrom performed similar re- sures and mutants. We describe di↵erent types search on the same mutation tools [32]. They used of mutants, mutation tools, types of e↵ectiveness three small Java programs popular in literature. measures, and work on mutation analysis. They found that none of the mutation tools sub- sumed each other. muJava generated the strongest 2.3.1 Mutant types mutants followed by Major and PIT, however, mu- Not all mutants are equally easy to detect. Easy Java generated significantly more equivalent mu- or weak mutants are killed by many tests and thus tants and was slower than Major and PIT. often easy to detect. Hard to kill mutants can only Laurent et al. introduced PIT+, an improved be killed by very specific tests and often subsume version of PIT with an extended set of muta- other mutants. Below is an overview of the di↵er- tors [31]. They combined the test suites generated ent types of mutants in the literature: by Kintis et al. [27] into a mutation adequate test suite that would detect the combined set of mu- Mutant represents a small change to the pro- tants generated by PIT, muJava and Major. A gram, i.e., a modified version of the SUT. mutation adequate test suite was also generated Equivalent mutants do not change the outcome for PIT+. The set of mutants generated by PIT+ of a program, i.e., they cannot be detected. was equally strong as the combined set of mutants. Given a loop that breaks if i == 10, and i increments by 1. A mutant changing the con- 2.3.3 E↵ectiveness measures dition to i >= 10 remains undetected as the loop still breaks when i becomes 10. We found three types of e↵ectiveness measures: Subsuming mutants are sole contributors to Normal e↵ectiveness calculated as the number the e↵ectiveness scores [36]. If mutants are of killed mutants divided by the total number subsumed, they are often killed “collaterally” of non-equivalents. together with the subsuming mutant. Killing Normalised e↵ectiveness calculated as the these collateral mutants does not lead to more number of killed mutants divided by the e↵ective tests, but they influence the test ef- number of covered mutants, i.e., mutants fectiveness score calculation. located in code executed by the test suite. Intuitively, test suites killing more mutants 2.3.2 Comparison of mutation tools while covering less code are more thorough than test suites killing the same number of Three criteria were used to compare mutation mutants in a larger piece of source code [24]. tools for Java: 1. E↵ectiveness of the mutation Subsuming e↵ectiveness is the percentage of adequate test suite of each tool. A mutation ade- killed subsuming mutants. Intuitively, strong quate test suite kills all the mutants generated by mutants, i.e., subsuming mutants, are not a mutation tool. Each test of this test suite con- equally distributed [36], which could lead to tributes to the e↵ectiveness score, i.e., if one test skewed e↵ectiveness results. is removed, less than 100% e↵ectiveness score is achieved. A cross-testing technique is applied to 2.3.4 Mutation analysis evaluate the e↵ectiveness each tool’s mutation ad- equate test suite. The adequate test suite of each In this section, we describe research conducted on tool is run on the set of mutants generated by the mutation analysis that underpins our approach. other tools. If the mutation adequate test suite for Mutants and real faults. Just et al. in- tool A would detect all the mutants of tool B, but vestigated whether generated faults are a correct the suite of tool B would not detect all the mu- representation of real faults [26]. Statistically sig- tants of tool A, then tool A would subsume tool nificant evidence shows that mutant detection cor- B. 2. Tool’s application cost in terms of the num- relates with real fault detection. They could relate ber of test cases that need to be generated and the 73% of the real faults to common mutators. Of the number of equivalent mutants that would have to remaining 27%, 10% can be detected by enhanc- be inspected. 3. Execution time of each tool. ing the set of commonly used mutators. They used Kintis et al. analysed and compared the e↵ec- Major for generating mutations. Equivalent mu- tiveness of PIT, muJava and Major [27]. Each tool tants were ignored as mutation scores were only was evaluated using the cross-testing technique on compared for subsets of a project’s test suite. twelve methods of six Java projects. They found Code coverage and e↵ectiveness. Inozemt- that the mutation adequate test suite of muJava seva and Holmes analysed the correlation between 3 code coverage and test suite e↵ectiveness [24] on is directly available. This TQM consists of the fol- twelve studies. They found three main shortcom- lowing static metrics: Code Coverage, Assertion- ings: 1. Studies did not control the suite size. As McCabe ratio, Assertion Density, Directness and code coverage relates to the test suite size (more Test Code Maintainability (see also Section 2.2). coverage is achieved by adding more tests), it re- Test code maintainability relates to code read- mains unclear whether the correlation with e↵ec- ability and understandability, indicating how eas- tiveness was due to size or coverage of the test ily we can make changes. We drop maintainability suite. 2. Small or synthetic programs limit gen- as a candidate metric as we consider it the least eralisation to industry. 3. Comparing only test related to completeness or e↵ectiveness of tests. suites that fully satisfy a certain coverage criterion. The model also contains two assertion- and two They argue that these results can be generalised to coverage based metrics. Based on preliminary re- more realistic test suites. Eight studies showed a sults we found that the number of assertions had correlation between some coverage type and e↵ec- a stronger correlation with test e↵ectiveness than tiveness independently of size; the strength varied, the two assertion based TQM metrics for all anal- in some studies appearing only for high coverage. ysed projects. Similarly, the static code coverage They also conducted an experiment on five large performed better than directness in the correlation open source Java projects. All mutants undetected test with test e↵ectiveness. To get a more quali- by the master test suite were marked equivalent. tative analysis, we focus on one assertion based To control for size, fixed size test suites are gener- metric and one coverage based metric, respectively ated by randomly selecting tests from the master assertion count and static coverage. test suite. Coverage was measured using Code- Furthermore, coverage was shown to be related Cover [3] on statement, decision and modified con- to test e↵ectiveness [24, 35]. Others found a rela- dition levels. E↵ectiveness was measured using tion between assertions and fault density [28] and normal and normalised e↵ectiveness. They found between assertions and test suite e↵ectiveness [45]. a low to moderate correlation between coverage and normal e↵ectiveness when controlling for size. 3.2 Tool implementation The coverage type had little impact on the cor- In this section, we explain the foundation of the relation strength and only a weak correlation was tool and the details of the implemented metrics. found for normalised e↵ectiveness. Assertions and e↵ectiveness. Zhang and 3.2.1 Tool architecture Mesbah studied the relationship between asser- tions and test suite e↵ectiveness [45]. Their exper- Figure 1 presents the analysis steps. The rectan- iment used five large open source Java projects, gles are artefacts that form the in/output for the similarly to Inozemtseva and Holmes [24]. They two processing stages. found a strong correlation between assertion count The first processing step is performed by the and test e↵ectiveness, even when test suite size Software Analysis Toolkit (SAT) [29], it constructs was controlled for. They also found that some as- a call graph using only static source code analysis. sertion types are more e↵ective than others, e.g., Our analysis tool uses the call graph to measure boolean and object assertions are more e↵ective both assertion count and static method coverage. than string and numeric assertions. The SAT analyses source code and computes several metrics, e.g., Lines of Code (LOC), Mc- 3 Metrics and mutants Cabe complexity [33] and code duplication, which are stored in a source graph. This graph contains Our goal is to investigate to what extent static information on the structure of the project, such analysis based metrics are related to test suite ef- as which packages contain which classes, which fectiveness. First, we need to select a set of static classes contain which methods and the call rela- metrics. Secondly, we need a tool to measure these tions between these methods. Each node is an- metrics. Thirdly, we need a way to measure test notated with information such as lines of code. e↵ectiveness. This graph is designed such that it can be used for many programming languages. By implementing 3.1 Metric selection our metrics on top of the SAT, we can do mea- surements for di↵erent programming languages. We choose two static analysis-based metrics that could predict test suite e↵ectiveness. We analyse 3.2.2 Code coverage the state of the art TQM by Athanasiou et al. [18] because it is already based on static source code Alves and Visser designed an algorithm for mea- analysis. Furthermore, the TQM was developed in suring method coverage using static source code collaboration with SIG, the host company of this analysis [16]. The algorithm takes as input a call thesis, which means that knowledge of the model graph obtained by static source code analysis. The 4 Figure 1: Analysis steps to statically measure coverage and assertion count. calls from test to production code are counted by found a significant di↵erence between the e↵ective- slicing the source graph and counting the methods. ness of assertions and the type of objects they as- This includes indirect calls, e.g., from one produc- sert [45]. Four assertion content types were clas- tion method to another. Additionally, the con- sified: numeric, string, object and boolean. They structor of each called method’s class is included. found that object and boolean assertions are more They found a strong correlation between static and e↵ective than string and numeric assertions. The dynamic coverage. (The mean of the di↵erence be- type of objects in an assertion can give insights in tween static and dynamic coverage was 9%). We the strength of the assertion. We will include the use this algorithm with the call graph generated by distribution of these content types in the analysis. the SAT to calculate the static method coverage. We use the SAT to analyse the type of objects in However, the static coverage algorithm has four an assertion. The SAT is unable to detect the type sources of imprecision [16]. The first is conditional of an operator expression used inside a method in- logic, e.g., a switch statement that for each case vocation, e.g., assertTrue(a >= b);, resulting in invokes a di↵erent method. Second is dynamic dis- unknown assertion content types. Also, fail state- patch (virtual calls), e.g., a parent class with two ments are put in a separate category as these are a subclasses both overriding a method that is called special type of assertion without any content type. on the parent. Third, library/framework calls, e.g., java.util.List.contains() invoke the .equals() 3.3 Mutation analysis method of each object in the list. The source code In this section we discuss our choice for the muta- of third party libraries is not included in the anal- tion tool and test e↵ectiveness measure. ysis making it impossible to trace which methods are called from the framework. And fourth, the use 3.3.1 Mutation tool of Java reflection, a technique to invoke methods dynamically during runtime without knowledge of We presented four candidate mutation tools for these methods or classes during compile time. our experiment in Section 2.3.2: Major, muJava, For the first two sources of imprecision, an op- PIT and PIT+. MuJava has not been updated timistic approach is chosen i.e., all possible paths in the last two years and does not support JU- are considered covered. Consequently, the cover- nit 4 and Java versions above 1.6 [9]. Conforming age is overestimated. Invocations by the latter two to these requirements would decrease the set of sources of imprecision remain undetected, leading projects we could use in our experiment as both to underestimating the coverage. JUnit 4 and Java 1.7 have been around for quite some time. Major does support JUnit 4 and has 3.2.3 Assertions recently been updated [8]. However, it only works in Unix environments [32]. PIT targets indus- We measure the number of assertions using the try [27], is open source and actively developed [12]. same call graph as the static method coverage al- Furthermore, it supports a wide scale of build tool- gorithm. For each test, we follow the call graph ing and is significantly faster than the other tools. through the test code to include all direct and PIT+ is based on a two-year-old branched version indirect assertion calls. Indirect calls are impor- of PIT and was only recently made available [10]. tant because often tests classes contain some util- The documentation is very sparse, the source code ity method for asserting the correctness of an ob- is missing. However, PIT+ generates a stronger ject. Additionally, we take into account the num- set of mutants than the other three tools whereas ber of times a method is invoked to approximate PIT generates the weakest set of mutants. the number of executed assertions. Only assertions Based on these observations we decided that that are part of JUnit are counted. PIT+ would be the best choice for measuring test Identifying tests. By counting assertions e↵ectiveness. Unfortunately, PIT+ was not avail- based on the number of invocations from tests, we able at the start of our research. We first did should also be able to identify these tests stati- the analysis based on PIT and then later switched cally. We use the SAT to identify all invocations to PIT+. Because we first used PIT, we selected to assertion methods and then slice the call graph projects that used Maven as a build tool. PIT+ backwards following all call and virtual call edges. is based on an old version, 1.1.5, not yet support- All nodes within scope, that have no parameters ing Maven. To enable using the features of PIT’s and have no incoming edges, are marked as tests. new version we merged the mutators provided by Assertion content types. Zhang and Mesbah PIT+ into the regular version of PIT [11]. 5 3.3.2 Dealing with equivalent mutants tive tests will only detect a small portion of the mutants. As a result, a large percentage will be Equivalent mutants are mutants that do not marked as equivalent. This increases the chances change the outcome of the program. Manually re- of false positives which decrease the reliability of moving equivalent mutants is time-consuming and the normalised e↵ectiveness score. generally undecidable [35]. A commonplace so- Given a project of which only a portion of the lution is to mark all the mutants that are not code base is thoroughly tested. There is a high killed by the project’s test suite as equivalent. probability that the equivalent mutants are not The resulting non-equivalent mutants are always equally distributed among the code base. Code detected by at least one test. The disadvantage covered by poor tests is more likely to contain false of this approach is that many mutants might be positives than thoroughly tested code. The poor falsely marked as equivalent. The number of false tests scramble the results e.g., a test with no asser- positives depends for example on the coverage of tions can be incorrectly marked as very e↵ective. the tests: if the mutated code is not covered by any of the tests, it will never be detected and con- Normalised e↵ectiveness is intended to compare sequently be marked as equivalent. Another cause the thoroughness of two test suites, i.e., penalise of false positives could be the lack of assertions the test suites that cover lots of code but only a in tests, i.e., not checking the correctness of the small number of mutants. We believe that it is less program’s result. The percentage of equivalent suitable as a replacement for normal e↵ectiveness mutants expresses to some extent the test e↵ec- We consider normal e↵ectiveness scores more tiveness of the project’s test suite. reliable when studying the relation with our met- With this approach, the complete test suite rics. Normal e↵ectiveness is positively influenced of each project will always kill all the remaining by the breadth of a test and penalises small test non-equivalent mutants. As the number of non- suites as a score of 1.0 can only be achieved if all equivalent mutants heavily relies on the quality of mutants are found. However, this is less of a prob- a project’s test suite, we cannot use these e↵ective- lem when comparing test suites of equal sizes. ness scores to compare between di↵erent projects. Subsuming e↵ectiveness. Current algo- To compensate for that, we will compare sub test rithms for identifying subsuming mutants are in- suites within the same project. fluenced by the overlap between tests. Suppose there are five mutants, M u1..5 , for method M1 . There are 5 tests, T1..5 , that kill M u1..4 and one 3.3.3 Test e↵ectiveness measure test, T6 , that kills all five mutants. Next, we evaluate both normalised and subsuming Amman et al. defined subsuming mutants as e↵ectiveness in the subsections below and describe follows: “one mutant subsumes a second mutant if our choice for an e↵ectiveness measure. every test that kills the first mutant is guaranteed Normalised e↵ectiveness. Normalised e↵ec- also to kill the second [17].” According to this tiveness is calculated by dividing the killed mu- definition, M u5 subsumes M u1..4 because the set tants with the number of non-equivalent mutants of tests that kill M u5 is a subset of the tests that that are present in the code executed by the test. kill M u1..4 : {T6 } ⇢ {T1..5 }. The tests T1..5 will Given the following example in which there are have a subsuming e↵ectiveness score of 0. two Tests T1 and T2 for Method M1 . Suppose M1 Our goal is to identify properties of test suites is only covered by T1 and T2 . In total, there are that determine their e↵ectiveness. If we would five mutants M u1..5 generated for M1 . T1 detects measure the subsuming e↵ectiveness, T1..5 would M u1 and T2 detects M u2 . As T1 and T2 are the be significantly less e↵ective. This would sug- only tests to kill M1 , the mutants M u3..5 remain gest that the assertion count or coverage of these undetected and are marked as equivalent. Both tests did not contribute to the e↵ectiveness, even tests only cover M1 and detect 1 of the two mu- though they still detected 80% of all mutants. tants resulting in a normal e↵ectiveness score of Another vulnerability of this approach is that 0.5. A test suite consisting of only the above tests it is vulnerable to changes in the test set. If we re- would detect all mutants in the covered code, re- move T6 , the mutants previously marked as “sub- sulting in a normalised e↵ectiveness score of 1. sumed” are now subsuming because M u5 is no We notice that the normalised e↵ectiveness longer detected. Consequently, T1..5 now detect score heavily relies on how mutants are marked all the subsuming mutants. In this scenario, we as equivalent. Suppose the mutants marked as decreased the quality of the master test suite by equivalent were valid mutants but the tests failed removing a single test, which leads to a signifi- to detect them (false positive), e.g., due to miss- cant increase in the subsuming e↵ectiveness score ing assertions. In this scenario, the (normalised) of tests, T1..5 . This can lead to strange results over e↵ectiveness score suggests that a bad test suite is time, as the addition of tests can lead to drops in actually very e↵ective. Projects that have ine↵ec- the e↵ectiveness of others. 6 Choice of e↵ectiveness measure. Nor- coverage as input for our analysis to: a) inspect malised e↵ectiveness loses precision when large the accuracy of the static methods coverage al- amounts of mutants are incorrectly marked as gorithm and b) to verify if a correlation between equivalent. Furthermore, normalised e↵ectiveness method coverage and test suite e↵ectiveness exists. is intended as a measurement for the thoroughness of a test suite which is di↵erent from our definition 4.2 Case study setup of e↵ectiveness. Subsuming e↵ectiveness scores change when tests are added or removed which We study our selected projects using an experi- makes the measure very sensitive to change. Fur- ment design based on work by Inozemtseva and thermore, subsuming e↵ectiveness penalises tests Holmes [24]. They surveyed similar studies on that do not kill a subsuming mutant. the relation between test e↵ectiveness and cover- We choose to apply normal e↵ectiveness as this age and found that most studies implemented the measure is more reliable. It also allows for com- following procedure: 1. Create faulty versions of paring with similar research on e↵ectiveness and one or more programs. 2. Create or generate many assertions/coverage [24, 45]. We refer to test suite test suites. 3. Measure the metric scores of each e↵ectiveness also as normal e↵ectiveness. suite. 4. Determine the e↵ectiveness of each suite. We describe our approach for each step in the fol- lowing subsections. 4 Are static metrics related to test suite e↵ectiveness? 4.2.1 Generating faults Mutation tooling is resource expensive and re- We employ mutation testing as a technique for quires running the test suites i.e., dynamic analy- generating faulty versions, mutants, of the di↵er- sis. To address these problems, we investigate to ent projects that will be analysed. We employ PIT what extent static metrics are related to test suite as a mutation tool. Mutants are generated using e↵ectiveness. In this section, we describe how we the default set of mutators 1 . All mutants that are will measure whether static metrics are a good pre- not detected by the master test suite are removed. dictor for test suite e↵ectiveness. 4.1 Measuring the relationship between 4.2.2 Project selection static metrics and test e↵ectiveness We have chosen three projects for our analysis We consider two static metrics, assertion count based on the following set of requirements: The and static method coverage, as candidates for pre- projects had in the order of hundreds of thousands dicting test suite e↵ectiveness. LOC and thousands of tests. Based on these criteria we selected a set of 4.1.1 Assertion count projects: Checkstyle[1], JFreeChart[5] and Joda- Time [6]. Table 1 shows properties of the projects. We hypothesise that assertion count is related to Java LOC and TLOC are generated using David test e↵ectiveness. Therefore, we first measure as- A. Wheeler’s SLOCCount [14]. sertion count by following the call graph from all Checkstyle is a static analysis tool that checks tests. As our context is static source code analysis, if Java code and Javadoc comply with some we should be able to identify the tests statically. coding rules, implemented in checker classes. Thus, we next compare the following approaches: Java and Javadoc grammars are used to gen- Static approach we use static call graph slicing erate Abstract Syntax Trees (ASTs). The (Section 3.2.3) to identify all tests of a project checker classes visit the AST, generating mes- and measure the total assertion count for the sages if violations occur. The core logic is in identified tests. the com.puppycrawl.tools.checkstyle.checks Semi-dynamic approach we use Java reflection package, representing 71% of the project’s size. (Section 4.3) to identify all the tests and mea- Checkstyle is the only project that used contin- sure the total assertion count for these tests. uous integration and quality reports on GitHub Finally, we inspect the type of the asserted ob- to enforce quality, e.g., the build that is triggered ject as input for the analysis of the relationship by a commit would break if coverage or e↵ective- between assertion count and test e↵ectiveness. ness would drop below a certain threshold. We decided to use the build tooling’s class exclusion 4.1.2 Static method coverage filters to get more representative results. These quality measures are needed as there are several We hypothesise that static method coverage is re- developers that contributed to the project. The lated to test e↵ectiveness. To test this hypothesis, project currently has five active team members [2]. we measure the static method coverage using static call graph slicing. We include dynamic method 1 http://pitest.org/quickstart/mutators/ 7 JFreeChart is a chart library for Java. The 4.2.4 Measuring metric scores and e↵ec- project is split into two parts: the logic used for tiveness data and data processing, and the code focussed For each test suite, we measure the e↵ectiveness, on construction and drawing of plots. Most no- assertion count and static method coverage. The table are the classes for the di↵erent plots in the dynamic equivalents of both coverage metrics are org.jfree.chart.plot package, which contains included to evaluate their comparison. We obtain 20% of the production code. JFreeChart is build the dynamic coverage metrics using JaCoCo [4]. and maintained by one developer [5]. JodaTime is a very popular date and time li- 4.2.5 Statistical analysis brary. It provides functionality for calculations with dates and times in terms of periods, durations To determine how we will calculate the correla- or intervals while supporting many di↵erent date tion with e↵ectiveness we analyse related work on formats, calendar systems and time zones. The the relation between test e↵ectiveness and asser- structure of the project is relatively flat, with only tion count [45] and coverage [24]. Both works have five di↵erent packages that are all at the root level. similar experiment set-ups in which they generated Most of the logic is related to either formatting sub test suites of fixed sizes and calculated metric dates or date calculation. Around 25% of the code and e↵ectiveness scores for these suites. Further- is related to date formatting and parsing. Joda- more, both studies used a parametric and non- Time was created by two developers, only of them parametric correlation test, respectively Pearson is maintaining the project [6]. and Kendall. We will also consider the Spearman rank correlation test, another nonparametric test, as it is commonly used in literature. A parametric 4.2.3 Composing test suites test assumes the underlying data to be normally It has been shown that test suite size influences the distributed whereas nonparametric tests do not. relation with test e↵ectiveness [35]. When a test The Pearson correlation coefficient is based on is added to a test suite it can never decrease the the covariance of two variables, i.e., the metric e↵ectiveness, assertion count or coverage. There- and e↵ectiveness scores, divided by the product of fore, we will only compare tests suites of equal sizes their standard deviations. Assumptions for Pear- similar to previous work [24, 45, 35]. son include the absence of outliers, the normality We compose test suites of relative sizes, i.e., of variables and linearity. The Kendall’s Tau rank test suites that contain a certain percentage of all correlation coefficient is a rank based test used to tests in the master test suite. For each size, we measure the extent to which rankings of two vari- generate 1000 test suites. We selected the follow- ables are similar. Spearman is a rank based ver- ing range of relative suite sizes: 1%, 4%, 9%, 16%, sion of the Pearson correlation tests, commonly 25%, 36%, 49%, 64% and 81%. Larger test suite used as its computation is more lightweight than were not included because the di↵erences between Kendall’s. However, our data set leads to similar the generated test suites would become too small. computation time for Spearman and Kendall. Additionally, we found that this sequence had the We discard Pearson because we cannot make least overlap in e↵ectiveness scores for the di↵er- assumptions on our data distribution. Moreover, ent suite sizes while still including a wide spread Kendall “is a better estimate of the correspond- of the test e↵ectiveness across di↵erent test suites. ing population parameter and its standard error is known [23]”. As the advantages of Spearman over Our approach di↵ers from existing research [24] Kendall do not apply in our case and Kendall has in which they used suites of sizes: 3, 10, 30, 100, advantages over Spearman, we choose Kendall’s 300, 1000 and 3000. A disadvantage of this ap- Tau rank correlation test. The correlation coeffi- proach is that the number of test suites for Jo- cient is calculated with R’s “Kendall” package [13]. daTime is larger than for the others because Jo- We use the Guilford scale (Table 2) for verbal de- daTime is the only project that has more than scriptions of the correlation strength [35]. 3000 tests. Another disadvantage is that a test suite with 300 tests might be 50% of the master 4.3 Evaluation tool test suite for one project and only 10% of another project’s test suite. Additionally, most composed We compose 1000 test suites of nine di↵erent sizes tests suites in this approach represent only a small for each project. Running PIT+ on the master portion of the master test suite. With our ap- test suite took from 0.5 to 2 hours depending on proach, we can more precisely study the behaviour the project. As we have to calculate the e↵ec- of the metrics as the suites grow in size. Further- tiveness of 27,000 test suites, this approach would more, we found that test suites with 16% of all take too much time. Our solution is to measure tests already dynamically covered 50% to 70% of the test e↵ectiveness of each test only once. We the methods covered by the master test suite. then combine the results for di↵erent sets of tests 8 Table 1: Characteristics of the selected projects. Total Java LOC is the sum of the pro- duction LOC and TLOC Property Checkstyle JFreeChart JodaTime Total Java LOC 73,244 134,982 84,035 Production LOC 32,041 95,107 28,724 TLOC 41,203 39,875 55,311 Number of tests 1875 2,138 4,197 Method Coverage 98% 62% 90% Date cloned from GitHub 4/30/17 4/25/17 3/23/17 Citations in literature [43, 39] [45, 24, 31, 26, 16] [24, 31, 26, 39] Number of generated mutants 95,185 310,735 100,893 Number of killed mutants 80,380 80,505 69,615 Number of equivalent mutants 14,805 230,230 31,278 Equivalent mutants (%) 15.6% 74.1% 31.0% Table 2: Guilford scale for the verbal description of correlation coefficients. Correlation coefficient below 0.4 0.4 to 0.7 0.7 to 0.9 above 0.9 Verbal description low moderate high very high Figure 2: Overview of the experiment set-up to obtain the relevant metrics for each test. to simulate test suites. To get the scores for a or @after annotations. However, the SAT does test suite with n tests, we combine the coverage not provide information on the used annotations. results, assertion counts and killed mutants of its A common practice is to still name these methods tests. Similarly, we calculate the static metrics and setUp or tearDown. We include methods that are dynamic coverage only once for each test. named setUp or tearDown and are located in the Detecting individual tests. We use a reflec- same class as the tests in the coverage results. tion library to detect both JUnit 3 and 4 tests for Aggregating metrics. To aggregate e↵ective- each project according to the following definitions: ness, we need to know which mutants are detected JUnit 3 All methods in non-abstract subclasses by each test as the set of detected mutants could of JUnit’s TestCase class. Each method overlap. However, PIT does not provide a list of should have a name starting with “test”, be killed mutants. We solved this issue by creating public, void and have no parameters. a custom reporter using PIT’s plug-in system to JUnit 4 All public methods annotated with JU- export the list of killed mutants. nit’s @Test annotation. The coverage of two tests can also overlap. We verified the number of detected tests with Thus, we need information on the methods covered the number of executed tests reported by each by each test. JaCoCo exports this information in project’s build tool. a jacoco.exec report file, a binary file containing We also need to include the set-up and tear- all the information required for aggregation. We down logic of each test. We use JUnit’s test run- aggregate these files via JaCoCo’s API. For the ner API to execute individual tests. This API en- static coverage metric, we export the list of cov- sures execution of the corresponding set-up and ered methods in our analysis tool. tear-down logic. This extra test logic should also The assertion count of a test suite is simply cal- be included in the static coverage metric to get culated as the sum of each test’s assertion count. similar results. With JUnit 3 the extra logic Figure 2 provides an overview of the involved is defined by overriding TestCase.setUp() or tools used and the data they generate. The eval- TestCase.tearDown(). JUnit 4 uses the @before uation tool’s input is raw test data and the sizes 9 of the test suites to create. We then compose test count. These coefficients for each set of test suites suites by randomly selecting a given number of of a given project and relative size are shown in tests from the master test suite. The output of the Table 4. We highlight statistically significant cor- analysis tool is a data set containing the scores on relations that have a p-value < 0.005 with two the dynamic and static metrics for each test suite. asterisks (**), and results with a p-value < 0.01 with a single asterisk (*). 5 Results We observe a statistically significant, low to moderate correlation for nearly all groups of test We first present the results of our analysis on the suites for JFreeChart. For JodaTime and Check- assertion count metric, followed by the results of style, we notice significant but weaker correlations: our analysis on code coverage. 0.08-0.2 compared to JFreeChart’s 0.14-0.4. Table 3 provides an overview of the assertion Table 5 shows the results of the two test identi- count, static and dynamic method coverage, and fication approaches for the assertion count metric the percentage of mutants that were marked as (see Section 4.1.1). False positives are tests that equivalent for the master test suite of each project. were incorrectly marked as tests. False negatives are tests that were not detected. 5.1 Assertion count Figure 5 shows the distribution of asserted ob- Figure 3 shows the distribution of the number of ject types. Assertions for which we could not de- assertions for each test of each project. tect the content type are categorised as unknown. We notice some tests with exceptionally high assertion counts. We manually checked these tests 5.2 Code coverage and found that the assertion count was correct for Figure 6 shows the relation between static method the outliers. We briefly explain a few outliers: coverage and normal e↵ectiveness. A dot repre- TestLocalDateTime Properties.testPropertyRoun sents a test suite and its colour, the relative test dHour (140 asserts), checks the correctness suite size. Table 6 shows the Kendall correlation of rounding 20 times, with for each check 7 coefficients between static coverage and normal ef- assertions on year, month, week, etc. fectiveness for each set of test suites. We highlight TestPeriodFormat.test wordBased pl regEx (140 statistically significant correlations that have a p- asserts) calls and asserts the results of the pol- value < 0.005 with two asterisks (**), and results ish regex parser 140 times. with a p-value < 0.01 with a single asterisk (*). TestGJChronology.testDurationFields (57 as- serts), tests for each duration field whether 5.2.1 Static vs. dynamic method coverage the field names are correct and if some flags To evaluate the quality of the static method cov- are set correctly. erage algorithm, we compare static coverage with CategoryPlotTest.testEquals (114 asserts), in- its dynamic counterpart for each suite (Figure 7). crementally tests all variations of the equals A dot represents a test suite, colours represent the method of a plot object. The other tests with size of a suite relative to the total number of tests. more than 37 assertions are similar tests for The black diagonal line illustrates the ideal line: the equals methods of other types of plots. all test suites below this line overestimate the cov- Figure 4 shows the relation between the asser- erage and all the test suites above underestimate tion count and normal e↵ectiveness. Each dot rep- the coverage. Table 7 shows the Kendall correla- resents a generated test suite; and its colour of tions between static and dynamic method coverage the dot represents the size of the suite relative for the di↵erent projects and suite sizes. Each cor- to the total number of tests. The normal e↵ec- relation coefficient maps to a set of test suites of tiveness, i.e., the percentage of mutants killed by the corresponding suite size and project. Coeffi- a given test suite is shown on the y-axis. The cients with one asterisk (*) have a p-value < 0.01 normalised assertion count is shown on the x-axis. and coefficients with two asterisks (**) have a p- We normalised the assertion count as the percent- value < 0.005. We observe a statistically signif- age of the total number of assertions for a given icant, low to moderate correlation for all sets of project. For example, as Checkstyle has 3819 as- test suites for JFreeChart and JodaTime. sertions (see Table 3), a test suite with 100 asser- tions would have a normalised assertion count of 5.2.2 Dynamic coverage and test suite ef- 100 fectiveness ⇤ 100 ⇡ 2.6%. 3819 We observe that test suites of the same rela- Figure 8 shows the relation between dynamic tive suite are clustered. For each group of test method coverage and normal e↵ectiveness. Each suites, we calculated the Kendall correlation coef- dot represents a test suite; its colour represents ficient between normal e↵ectiveness and assertion the size of that suite relative to the total number 10 Table 3: Results for the master test suite of each project. Project Assertions Static coverage Dynamic coverage Equivalent mutants Checkstyle 3,819 85% 98% 15.6% JFreeChart 9,030 60% 62% 74.1% JodaTime 23,830 85% 90% 31.0% JodaTime ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● JFreeChart ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Checkstyle ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 Number of assertions Figure 3: Distribution of the assertion count among individual tests per project. Figure 4: Relation between assertion count and test suite e↵ectiveness. Table 4: Kendall correlations between assertion count and test suite e↵ectiveness. Project Relative test suite size 1% 4% 9% 16% 25% 36% 49% 64% 81% Checkstyle -0.04 0.08** 0.13** 0.18** 0.20** 0.16** 0.16** 0.12** 0.10** JFreeChart 0.03 0.14** 0.23** 0.32** 0.34** 0.35** 0.39** 0.40** 0.36** JodaTime 0.05 0.11** 0.13** 0.13** 0.07** 0.09** 0.07** 0.10** 0.06* Table 5: Comparison of di↵erent approaches to identify tests for the assertion count metric. Project Semi-static approach Static approach Number Assertion Number of Assertion False False of tests count tests (di↵ ) count (di↵ ) positives negatives CheckStyle 1,875 3,819 1,821 (-54) 3,826 (+0.18%) 5 59 JFreeChart 2,138 9,030 2,172 (+34) 9,224 (+2.15%) 39 7 JodaTime 4,197 23,830 4,180 (-17) 23,943 (+0.47%) 15 32 Assertion content type JodaTime 1% 18% 47% 13% 14% 7% fail boolean JFreeChart 2% 12% 24% 2% 58% 1% string numeric Checkstyle 5% 7% 7% 36% 39% 5% object unknown 0% 25% 50% 75% 100% Percentage of total assertion count Figure 5: The distribution of assertion content types for the analysed projects. 11 Figure 6: Relation between static coverage and test suite e↵ectiveness. Table 6: Kendall correlations between static method coverage and test suite e↵ectiveness. Project Relative test suite size 1% 4% 9% 16% 25% 36% 49% 64% 81% Checkstyle -0.05 -0.01 -0.02 -0.02 0.00 -0.04 -0.01 0.00 0.01 JFreeChart 0.49** 0.28** 0.23** 0.26** 0.27** 0.28** 0.31** 0.31** 0.26** JodaTime 0.13** 0.28** 0.32** 0.28** 0.24** 0.25** 0.23** 0.20** 0.21** Figure 7: Relation between static and dynamic method coverage. Static coverage of test suites below the black line is overestimated, above is underestimated. Table 7: Kendall correlation between static and dynamic method coverage. Project Relative test suite size 1% 4% 9% 16% 25% 36% 49% 64% 81% Checkstyle -0,03 -0,01 0,01 -0,02 0,00 0,00 0,05 0,10** 0,15** JFreeChart 0,67** 0,33** 0,28** 0,31** 0,33** 0,35** 0,43** 0,45** 0,44** JodaTime 0,35** 0,44** 0,48** 0,47** 0,51** 0,51** 0,52** 0,54** 0,59** of tests. Table 8 shows the Kendall correlations cality of this research and the threats to validity. between dynamic method coverage and normal ef- fectiveness for the di↵erent groups of test suites for 6.1 Assertions and test suite e↵ectiveness each project. Similarly to the other tables, two as- terisks indicate that the correlation is statistically We observe that test suites of the same relative significant with a p-value < 0.005. size form groups in the plots in Figure 4, i.e., the assertion count and e↵ectiveness score of same size test suites are relatively close to each other. 6 Discussion For JFreeChart, groups of test suites with a rel- We structure our discussion as follows: First, for ative size >=9% exhibit a diagonal shape. This each metric, we compare the results across all shape is ideal as it suggests that test suites with projects, perform an in-depth analysis on some of more assertions are more e↵ective. These groups the projects and then answer to the corresponding also show the strongest correlation between asser- research question. Next, we describe the practi- tion count and e↵ectiveness (Table 4). 12 Figure 8: Relation between dynamic method coverage and test suite e↵ectiveness. Table 8: Kendall correlation between dynamic method coverage and test suite e↵ectiveness. Project Relative test suite size 1% 4% 9% 16% 25% 36% 49% 64% 81% Checkstyle 0.67** 0.71** 0.68** 0.59** 0.45** 0.36** 0.33** 0.31** 0.36** JFreeChart 0.65** 0.59** 0.52** 0.48** 0.44** 0.47** 0.47** 0.49** 0.45** JodaTime 0.48** 0.49** 0.53** 0.51** 0.48** 0.52** 0.48** 0.47** 0.44** We notice that the normalised assertion count tect only two assertions for tests which might ex- of a test suite is close to the relative suite size, e.g., ecute many assertions at runtime. In addition to suites with a relative size of 81% have a normalised the verify method, we found 60 tests that directly assertion count between 77% and 85%. The di↵er- applied assertions inside for loops. ence between the relative suite size and normalised assertion count is directly related to the variety in Finding 1: Assertions within in an iter- assertion count per test. More variety means that ation block skew the estimated assertion a test suite could exist with only below average count. These iterations are a source of im- assertion counts, resulting in a ¡80% normalised precision because the actual number of as- assertion count. sertions could be much higher than the as- We analyse each project to find to what extent sertion count we measured. assertion count could predict test e↵ectiveness. Another consequence of the high usage of 6.1.1 Checkstyle verify is that these 1156 tests all have the same We notice a very low, statistically significant corre- assertion count. Figure 3 shows similar results for lation between assertion count and test suite e↵ec- the distribution of assertions for Checkstyle’s tests. tiveness for most of Checkstyle’s test suite groups. The e↵ectiveness scores for these 1156 tests Most of the Checkstyle’s tests target the dif- range from 0% to 11% (the highest e↵ectiveness ferent checks in Checkstyle. Out of the 1875 score of an individual test). This range shows that tests, 1503 (80%) tests belong to a class that the group of tests with two assertions include both extends the BaseCheckTestSupport class. The the most and least e↵ective tests. There are ap- BaseCheckTestSupport class contains a set of proximately 1200 tests for which we detect exactly utility methods for creating a checker, executing two assertions. As this concerns 64% of all tests, the checker and verifying the messages generated we state there is too little variety in the assertion by the checker. We notice a large variety in test count to make predictions on the e↵ectiveness. suite e↵ectiveness among the tests that extend this class. Similarly, we expect the same variety in as- Finding 2: 64% of Checkstyle’s tests have sertion counts. However, the assertion count is the identical assertion counts. Variety in the same for at least 75% of these tests. assertion count is needed to distinguish be- We found that 1156 of these tests (62% of tween the e↵ectiveness of di↵erent tests. the master test suite) use the BaseCheckTestSup- port.verify method for asserting the checker’s re- sults. The verify method iterates over the ex- 6.1.2 JFreeChart pected violation messages which are passed as a parameter. This iteration hides the actual num- JFreeChart is the only project exhibiting a low to ber of executed assertions. Consequently, we de- moderate correlation for most groups of test suites. 13 We found many strong assertions in In total, we found only 26 tests of the master JFreeChart’s tests. By strong, we mean that test suite (0.6%) that were directly a↵ected by as- two large objects, e.g., plots, are compared in an sertions in for loops. Thus, for JodaTime, asser- assertion. This assertion uses the object’s equals tions in for loops do not explain the weak correla- implementation. In this equals method, around tion between assertion count and e↵ectiveness. 50 lines long, many fields of the plot, such as Assertion strength. JodaTime has sig- Paint or RectangleInsets are compared, again nificantly more assertions than JFreeChart and relying on their consecutive equals implemen- Checkstyle. We observe many assertions on nu- tation. We also notice that most outliers for meric values as one might expect from a library JFreeChart in Figure 3 are tests for the equals that is mostly about calculations on dates and methods which suggests that the equals methods times. For example, we noticed many utility meth- contain much logic. ods that checked the properties of Date, DateTime or Duration objects. Each of these utility meth- Finding 3: Not all assertions are equally ods asserts the number of years, months, weeks, strong. Some only cover a single property, days, hours, etc. This large number of numeric as- e.g., a string or a number, whereas others sertion corresponds with the observation that 47% compare two objects, potentially covering of the assertions are on numeric types (Figure 5). many properties. For JFreeChart, we no- However, the above is not always the case. For tice a large number of assertions that com- example, we found many tests, related to parsing pare plot objects with many properties. dates or times from a string or tests for formatters, that only had a 1 or 2 assertions while still being Next, we searched for the combination of loops in the top half of most e↵ective tests. and assertions that could skew the results, and We distinguish between two types of tests: a) found no such occurrences in the tests. tests related to the arithmetic aspect with many assertions and b) tests related to formatting with only a few assertions. We find that assertion count 6.1.3 JodaTime does not work well as a predictor for test suite The correlations between assertion count and test e↵ectiveness since the assertion count of a test does suite e↵ectiveness for JodaTime are similar to not directly relate to how e↵ective the test is. that of Checkstyle, and much lower than those of JFreeChart. We further analyse JodaTime to find Finding 4: Almost half of JodaTime’s as- a possible explanation for the weak correlation. sertions are on numeric types. These as- Assertions in for loops. We searched for test sertions often occur in groups of 3 or more utility methods similar to the verify method of to assert a single result. However, a large Checkstyle, i.e., a method that has assertions in- number of e↵ective tests only contains a side an iteration and is used by several tests. We small number of mostly non-numeric asser- observe that the four most e↵ective tests, shown in tions. This mix leads to poor predictions. Table 9, all call testForwardTransitions and/or testReverseTransitions, both are utility meth- ods of the TestBuilder class. The rank columns 6.1.4 Test identification contain the rank relative to the other tests of to provide some context in how they compare. Ranks We measure the assertion count by following the are calculated based on the descending order of static call graph for each test. As our context is e↵ectiveness or assertion count. If multiple tests static source code analysis, we also need to be able have the same score, we show the average rank. to identify the individual tests in the test code. Note that the utility methods are di↵erent from We compare our static approach with a semi-static the tests in the top 4 that share the same name. approach that uses Java reflection to identify tests. The top 4 tests are the only tests calling these Table 5 shows that the assertion count ob- utility methods. Both methods iterate over a tained with the static-approach is closer to the dy- two-dimensional array containing a set of approx- namic approach than the assertion count obtained imately 110 date time transitions. For each tran- through the semi-static approach. sition, 4 to 7 assertions are executed, resulting in For all projects the assertion count of the static more than 440 executed assertions. approach is higher. If the static algorithm does Additionally, we found 22 tests that combined not identify tests, there are no call edges between iterations and assertions. Out of these 22 tests, the tests and the assertions. The absence of edges at least 12 tests contained fix length iterations, implies that these tests either have no assertions e.g., for(int i = 0; i < 10; i++), that could or an edge in the call graph was missing. These be evaluated using other forms of static analysis. tests do not contribute to the assertion count. 14 Table 9: JodaTime’s four most e↵ective tests Test Normal E↵ectiveness Assertions Score Rank Score Rank TestCompiler.testCompile() 17.23% 1 13 361.5 TestBuilder.testSerialization() 14.61% 2 13 361.5 TestBuilder.testForwardTransitions() 12.94% 3 7 1,063.5 TestBuilder.testReverseTransitions() 12.93% 4 4 1,773.0 We notice that the methods that were incor- corporate information about the strength of the as- rectly marked as tests, false positives, are meth- sertions, either by incorporating assertion content ods used for debugging purposes or methods that types, assertion coverage [45] or size of the asserted were missing the @Test annotation. The latter object. Furthermore, such a model should also in- is most noticeable for JFreeChart. We identified clude information about the size of a project. 39 tests that were missing the @Test annotation. If assertion count would be used, we should Of these 39 tests, 38 tests correctly executed when measure the presence of its sources of impreci- the @Test annotation was added. According to the sion to judge the reliability. This measurement repository’s owner, these tests are valid tests 2 . should also include the intensity of the usage of Based on the results of these three projects, we errornous methods. For example, we found hun- also show that the use of call graph slicing gives dreds of methods and tests with assertions in for- accurate results on a project level. loops. However, only few methods that were often used had a significant impact on the results. 6.1.5 Assertion count as a predictor for 6.2 Coverage and e↵ectiveness test e↵ectiveness We observe a diagonal-like shape for most groups We found that the correlation for Checkstyle and of same size test suites in Figure 6. This shape JodaTime is weaker than for JFreeChart. Our is ideal as it suggests that within this group, test analysis indicates that the correlation for Check- suites with more static coverage are more e↵ective. style is less strong because of a combination of These groups also show the strongest correlation assertions in for loops (Finding 1) and the asser- between static coverage and test suite e↵ective- tion distribution (Finding 2). However, this does ness, as shown in Table 6. not explain the weak correlation for JodaTime. Furthermore, we notice a di↵erence in the As shown in Figure 3, JodaTime has a much larger spread of the static coverage on the horizontal axis. spread in the assertion count of each test. Fur- For example, coverage for Checkstyle’s tests suites thermore, we observe that the assertion-iteration can be split into three groups: around 30%, 70% combination does not have a significant impact and 80% coverage. JFreeChart shows a relatively on the relationship with test suite e↵ectiveness large spread of coverage for smaller tests suites, compared to Checkstyle. We notice a set of strong ranging between 18% and 45% coverage, but the assertions for JFreeChart (Finding 3) whereas coverage converges as test suites grow in size. Jo- JodaTime has mostly weak assertions (Finding 4). daTime is the only project for which there is no split in the coverage scores of same size test suites. RQ 1: To what extent is assertion count a We consider these di↵erences in the spread of cov- good predictor for test suite e↵ectiveness? erage a consequence of the quality of the static Assertion count has potential as a predictor for coverage algorithm. These di↵erences are further test suite e↵ectiveness because assertions are di- explored in Section 6.2.1. We perform an in-depth rectly related to detection of mutants. However, analysis on Checkstyle in Section 6.2.2 because it more work on assertions is needed as the correla- is the only project which does not exhibit either a tion with test suite e↵ectiveness is often weak or statistically significant correlation between static statistically insignificant. coverage and test e↵ectiveness, or one between For all three projects, Table 3, we observe dif- static coverage and dynamic method coverage. ferent assertion counts. Checkstyle and Joda- Time are of similar size and quality, but Check- 6.2.1 Static vs. dynamic method coverage style only has 16% of the assertions JodaTime When comparing dynamic and static coverage in has. JFreeChart has more assertions than Check- Figure 7, we notice that the degree of over- or style, but the production code base that should be underestimation of the coverage depends on the tested is also three-times bigger. A test quality project and test suite size. Smaller test suites tend model that includes the assertion count should in- to overestimate, whereas larger test suites under- 2 https://github.com/jfree/jfreechart/issues/57 estimate. We observe that the quality of the static 15 coverage for the Checkstyle project is significantly We illustrate the overlap between over- and di↵erent compared to the other projects. Check- underestimation with a small synthetic example. style is discussed in Section 6.2.2. Given a project with 100 methods and test suite Overestimating coverage. The static cover- T. We divide these methods into three groups: age for the smaller test suites is significantly higher 1. Group A, with 60 methods that are all cov- than the real coverage, as measured with dynamic ered by T, as measured with dynamic coverage. analysis. Suppose a method M1 has a switch state- 2. Group B, with 20 methods that are only called ment that, based on its input, calls one of the through the Java Reflection API, all covered by T following methods, M2 , M3 , M4 . There are three similar to Group A. 3. Group C, with 20 methods tests, T1 , T2 , T3 , that each call M1 , with one of the that are not covered by T. The dynamic coverage three options for the switch statement in M1 as for T consists of the 80 methods in groups A and a parameter. Additionally, there is a Test suite B. The static method coverage for T also consists T S1 that consists of T1 , T2 , T3 . Each test covers of 80 methods. However, the coverage for Group C M1 and one of M2 , M3 , M4 , all tests combined in is overestimated as they are not covered, and the T S1 cover all 4 methods. The static coverage al- coverage for Group B is underestimated as they gorithm does not evaluate the switch statement are not detected by the static coverage algorithm. and detects for each test that 4 methods are cov- JFreeChart has a relatively low coverage score ered. This shows that static coverage is not very compared to the other projects. It is likely that the accurate for individual tests. However, the static parts of the code that are deemed covered by static coverage for T S1 matches the dynamic coverage. and dynamic coverage will not overlap. However, This example illustrates why the loss in accuracy, it should be noted that low coverage does not im- caused by overestimating the coverage, decreases ply more methods are overestimated. When parts as test suites grow in size. The paths detected of the code base are completely uncovered, the by the static and dynamic method coverage will static method coverage might also not detect any eventually overlap once a test suite is created that calls to the code base. contains all tests for a given function. The amount of overestimated coverage depends on how well the Finding 6: The degree of underestimation tests cover the di↵erent code paths. by the static coverage algorithm partially depends on the number of overestimated methods, as this will compensate for the Finding 5: The degree of overestima- underestimated methods, and on the num- tion by the static method coverage algo- ber of methods that were called by reflec- rithm depends on the real coverage and the tion or external libraries. amount of conditional logic and inheritance in the function under test. Correlation between dynamic and static method coverage. Table 4 shows, for Underestimating coverage. We observe JFreeChart and JodaTime, statistically significant that for larger test suites the coverage is often un- correlations that increase from a low correlation derestimated, see Figure 7. Similarly, the under- for smaller suites to a moderate correlation for estimation is also visible in the di↵erence between larger suites. One exception is the correlation for static and dynamic method coverage of the dif- JFreeChart”s test suites with 1% relative size. We ferent master test suites as shown in the project could not find a explanation for this exception. results overview in Table 3. We expected that the tipping point between A method that is called through reflection or static and dynamic coverage would also be visible by an external library is not detected by the static in the correlation table. However, this is not the coverage algorithm. Smaller test suites do not case. Our rank correlation test checks whether two su↵er from this issue as the number of overesti- variables follow the same ordering, i.e., if one vari- mated methods is often significantly larger than able increases, the other also increases. Underesti- the amount of underestimated methods. mating the coverage does not influence the correla- We observe di↵erent tipping points be- tion when the degree of underestimation is similar tween overestimating and underestimating for for all test suites. As test suites grow in size, they JFreeChart and JodaTime. For JFreeChart the become more similar in terms of included tests. tipping point is visible for tests suites with a rel- Consequently, the chances of test suites forming ative size of 81%, whereas JodaTime reaches the an outlier decrease as the size increases. tipping point at a relative size of 25%. We as- sume this is caused by the relatively low “real” Finding 7: As test suites grow, the corre- coverage of JFreeChart. We notice that many of lation between static and dynamic method JFreeChart’s methods that were overestimated by coverage increases from low to moderate. the static coverage algorithm are not covered. 16 6.2.2 Checkstyle and the coverage of the included tests. The coverage of the individual tests is shown Figures 6 and 7 show that the static coverage re- in Figure 9a. We notice a few outliers at 48%, sults for Checkstyle’s test suites are significantly 58%, 74% and 75% coverage. We construct test di↵erent from JFreeChart and JodaTime. For suites by randomly selecting tests. A test suite’s Checkstyle, all groups of test suites with a relative coverage is never lower than the highest coverage size of 49% and lower are split into three subgroups among its individual tests. For example, every that have around 30%, 70% and 80% coverage. In time a test with 74% coverage is included, the test the following subsections, we analyse the quality suite’s coverage will jump to at least that percent- of the static coverage for Checkstyle and the pre- age. As test suites grow in size, the chances of dictability of test suite e↵ectiveness. including a positive outlier increases. We notice Quality of static coverage algorithm. To that the outliers do not exactly match with the analyse the static coverage algorithm for Check- coverage of the vertical groups. The second verti- style we compare the static coverage with the dy- cal for Checkstyle in Figure 7 starts around 71% namic coverage for individual tests (Figure 9a), coverage. We found that if the test with 47.5% and inspect the distribution of the static coverage coverage, AbstractChecktest.testVisitToken, among the di↵erent tests (Figure 9b). is combined with a 30% coverage test (any We regard the di↵erent groupings of test suites of the checker tests), it results in 71% cov- in the static coverage spread as a consequence of erage. This shows that only 6.5% coverage the few tests with high static method coverage. is overlapping between both tests. We ob- Checker tests. Figure 9b shows 1104 tests serve that all test suites in the vertical group scoring 30% to 32.5% coverage. Furthermore, dy- at 71% include at least one checker test and namic coverage only varied between 31.3% and AbstractCheckTest.testVisitToken and that 31.6% coverage and nearly all tests are located in they do not include any of the other outliers with the com.puppycrawl.tools.checkstyle.checks more than 58%. The most right vertical group package. We call these tests checker tests, as they starts at 79% coverage. This coverage is achieved are all focussed on the checks. A small experi- by combining any of the tests with more than 50% ment where we combined the coverage of all 1104 coverage with a single checker test. tests, resulted in 31.8% coverage, indicating that The groupings in Checkstyle’s coverage scores all these checker tests almost completely overlap. are a consequence of the few coverage outliers. We Listing 1 shows the structure typical for show that these outliers can have a significant im- checker tests: the logic is mostly located in utility pact on a project’s coverage score. Without these methods. Once the configuration for the checker is few outliers, the static coverage for Checkstyle’s created, verify is called with the files that will be master test suite would only be 50% checked and the expected messages of the checker. Test suites with low coverage. Figure 9b @Test public void t e s t C o r r e c t ( ) throws E x c e p t i o n { shows that more than half of the tests have at final DefaultConfiguration checkConfig = least 30% coverage. Similarly, Figure 7 shows that createCheckConfig ( all test suites cover at least 31% of the methods. AnnotationLocationCheck . class ) ; However, there are 763 tests with less than 30% f i n a l S t r i n g [ ] e x p e c t e d = CommonUtils . EMPTY STRING ARRAY; coverage, and no test suites with less than 30% v e r i f y ( c h e c k C o n f i g , getPath ( ” coverage. We explain this using probability the- InputCorrectAnnotationLocation . java ” ) , ory. The smallest test suite for Checkstyle has a expected ) ; relative size of 1% which are 19 tests. The chance } of only including tests with less than 31% cover- 763 763 1 763 18 8 Listing 1: Test in AnnotationLocationCheckTest age 1875 ⇤ 1875 1 ⇤ . . . ⇤ 1875 18 ⇡ 3 ⇤ 10 . These chances are negligible, even without considering that a combination of the selected tests might still Finding 8: Most of Checkstyle’s tests are lead to a coverage above 31%. focussed on the checker logic. Although Missing coverage. We found that these tests vary in e↵ectiveness, they cover AbstractCheckTest.testVisitToken scores an almost identical set of methods as mea- 47.5% static method coverage, although it only sured with the static coverage algorithm. tests the AbstractCheck.visitToken method. Therefore any test calling the visitToken method Coverage subgroups and outliers. We no- will have at least 47.5% static method coverage. tice three vertical groups for Checkstyle in Figure 7 160 classes extend AbstractCheck, of which starting around 31%, 71% and 78% static coverage 123 override the visitToken method. The and then slowly curving to the right. These group- static method coverage algorithm includes 123 ings are a result of how test suites are composed virtual calls when AbstractCheck.visitToken is 17 (a) Static and dynamic method coverage of (b) Distribution of the tests over the di↵er- individual tests. Static coverage of tests be- ent levels of static method coverage. low the black line is overestimated, above is underestimated. Figure 9: Static method coverage scores for individual tests of Checkstyle. called.The coverage of all visitToken overrides suite missed calls to 328 methods. Of these meth- combined is 47.5%. Note that the static cover- ods, 248 (7.5% of all methods) are setter meth- age algorithm also considers constructor calls and ods. Further inspection showed that checkers are static blocks as covered when a method of a class configured using reflection, based on a configura- is invoked. We found that only 6.5% of the total tion file with properties that match the setters of method coverage overlaps with testVisitToken. the checkers. This large group of methods missed This large overlap between both tests suggests by the static coverage algorithm partially explains that visitToken is not called by any of the the di↵erence between static and dynamic method check tests. However, we found that the verify coverage of Checkstyle’s master test suite. method indirectly calls visitToken. The call process(File, FileText), is not matched Finding 10: The large gap between static with AbstractFileSetCheck.process(File, and dynamic method coverage for Check- List). The parameter of type FileText extends style is caused by a significant amount of AbstractList which is part of the java.util setter methods for the checker classes that package. During the construction of the static call are called through reflection. graph, it was not detected that AbstractList is an implementation of the List interface because Relation with e↵ectiveness. Checkstyle is only Checkstyle’s source code was inspected. the only project for which there is no statistically If these calls were detected the coverage of all significant correlation between static method cov- checker tests would increase to 71%, filling the erage and test suite e↵ectiveness. gap between the two right-most vertical groups in We notice a large distance, regarding invoca- the plots for Checkstyle in both Figures 6 and 7. tions in the call hierarchy, between most checkers and their tests. There are 9 invocations between Finding 9: Our static coverage algorithm visitToken and the much used verify method. fails to detect a set of calls in the tests for In addition to the actual checker logic, a lot in- the substantial group of checker tests due frastructure is included in each test. For example, to shortcomings in the static call graph. If instantiating the checkers and its properties based these the calls were correctly detected, the on a reflection framework, parsing the files and cre- static coverage for test suites of the same ating an AST, traversing the AST, collecting and size would be grouped more closely possibly converting all messages of the checkers. resulting in a more significant correlation. These characteristics seem to match those of in- tegration tests. Zaidman et al. studied the evolu- High reflection usage. Checkstyle applies a tion of the Checkstyle project and arrived at sim- visitor pattern on an AST for the di↵erent code ilar findings: “Moreover, there is a thin line be- checks. The AbstractCheck class forms the ba- tween unit tests and integration tests. The Check- sis of this visitor and is extended by 160 checker style developers see their tests more as I/O inte- classes. These classes contain the core function- gration tests, yet associate individual test cases ality of Checkstyle and consist of 2090 methods with a single production class by name” [43]. (63% of all methods), according to SAT. Running Directness. We implemented the directness our static coverage algorithm on the master test measure to inspect whether it would reflect the 18 presence of mostly integration like tests. The di- 6.2.4 Method coverage as a predictor for rectness is based on the percentage of methods test suite e↵ectiveness that are directly called from a test. The master We found a statistically significant, low correlation test suites of Checkstyle, JFreeChart and Joda- between test suite e↵ectiveness and static method Time cover respectively 30%, 26% and 61% of all coverage for JFreeChart and JodaTime. We evalu- methods directly. As Checkstyle’s static coverage ated the static coverage algorithm and found that is significantly higher than that of JFreeChart we smaller test suites typically overestimate the cov- observe that Checkstyle covers the smallest por- erage (Finding 5), whereas for larger test suites the tion of methods directly from tests. Given that coverage is often underestimated (Finding 6). The unit tests should be focused on small functional tipping point depends on the real coverage of the units, we expected a relatively high directness project. We also found that static coverage cor- measure for the test suites. relates better with dynamic coverage as test suite increase in size (Finding 7). Finding 11: Many of Checkstyle’s tests An exception to these observations is Check- are integration-like tests that have a large style, the only project without a statistically sig- distance between the test and the logic un- nificant correlation between static method cover- der test. Consequently, only a small por- age and both, test suite e↵ectiveness and dynamic tion of the code is covered directly. method coverage. Most of Checkstyle’s tests have nearly identical coverage results (Finding 8) albeit To make matters worse, the integration-like the e↵ectiveness varies. The SAT could calculate tests were mixed with actual tests. We argue static code coverage, however it is less suitable for that integrations tests have di↵erent test proper- more complex projects. The large distance be- ties compared to unit tests: they often cover more tween tests and tested functionality (Finding 11) code, have less assertions, but the assertions have in the Checkstyle project in terms of call hierar- a higher impact, e.g., comparing all the reported chy led to skewed results as some of the must used messages. These di↵erences can lead to a skew in calls were not resolved (Finding 9). This can be the e↵ectiveness results. partially mitigated by improving the call resolving. We consider the inaccurate results of the static 6.2.3 Dynamic method coverage and e↵ec- coverage algorithm a consequence of the quality of tiveness the call graph and the frequent use of Java reflec- tion(Finding 10). Furthermore, the unit tests for We observe in Figure 8 that, within groups of test Checkstyle show similarities with integration tests. suites of the same size, test suite with more dy- namic coverage are also more e↵ective. Similarly, RQ 2: To what extent is static coverage a we observe a moderate correlation between dy- good predictor for test suite e↵ectiveness? namic method coverage and normal e↵ectiveness for all three projects in Table 8. First, we found a moderate to high correla- When comparing test suite e↵ectiveness with tion between dynamic method coverage and e↵ec- static method coverage, we observe a low to mod- tiveness for all analysed projects which suggests erate correlation for JFreeChart and JodaTime that method coverage is a suitable indicator. The when accounting for size in Table 6, but no statis- projects that showed a statistically significant cor- tically significant correlation for Checkstyle. Sim- relation between static and dynamic method cov- ilarly, only the Checkstyle project does not show a erage also showed a significant correlation between statistically significant correlation between static static method coverage and test suite e↵ectiveness. and dynamic method coverage, as shown in Ta- Although the correlation between test suite e↵ec- ble 7. We believe this is a consequence of the inte- tiveness and static coverage was not statistically gration like test characteristics of the Checkstyle significant for Checkstyle, the coverage score on project. Due to the large distance between tests project level provided a relatively good indication and code and the abstractions used in-between, of the project’s real coverage. Based on these ob- the static coverage is not very accurate. servations we consider coverage suitable as a pre- The moderate correlation between dynamic dictor for test e↵ectiveness. method coverage and e↵ectiveness suggests there 6.3 Practicality is a relation between method coverage and normal e↵ectiveness. However, the static method coverage A test quality model based on the current state of does not show a statistically significant correlation the metrics would not be sufficiently accurate. with normal e↵ectiveness for Checkstyle. We state Although there is evidence of a correlation be- that our static method coverage metric is not ac- tween assertion count and e↵ectiveness, the as- curate enough for the Checkstyle project. sertion count of each project’s master test suite 19 did not map to the relative e↵ectiveness of each 6.5 External threats to validity project. Each of the analysed projects had on aver- We study three open source Java projects. Our re- age a di↵erent number of assertions per test. Fur- sults are not generalisable to projects using other ther improvements to the assertion count metric, programming languages. Also, we only included e.g., including the strength of the correlation, are assertions provided by JUnit. Although JUnit is needed to get more usable results. the most popular testing library for Java, there The static method coverage could be used to are testing libraries possibly using di↵erent asser- evaluate e↵ectiveness to a certain extent. We tions [44]. We also ignored mocking libraries in found a low to moderate correlation for two of the our analysis. Mocking libraries provide a form of project between e↵ectiveness and static method assertions based on the behaviour of units under coverage. Furthermore, we found a similar cor- test. These assertions are ignored by our analysis, relation between static and dynamic method cov- albeit they can lead to an increase in e↵ectiveness. erage. The quality of the static call graph should be improved to better estimate the real coverage. 6.6 Reliability We did not investigate the quality of these met- rics for other programming languages. However, Tengeri et al. compared di↵erent instrumentation the SAT supports call graph analysis and identi- techniques and found that JaCoCo produces in- fying assertions for a large range of programming accurate results especially when mapped back to languages, facilitating future experiments. source code [39]. The main problem was that Ja- We encountered scenarios for which the static CoCo did not include coverage between two di↵er- metrics gave imprecise results. If these sources of ent sub-modules in a Maven project. For example, imprecision would be translated to metrics, they a call from sub-module A to sub-module B is not could indicate the quality of the static metrics. An registered by JaCoCo because JaCoCo only anal- indication of low quality could suggest that more yses coverage on a module level. As the projects manual inspection is needed. analysed in this thesis do not contain sub-modules, this JaCoCo issue is not applicable to our work. 6.4 Internal threats to validity Static call graph. We use the static call graph 7 Related work constructed by the SAT, for both metrics. We We group related work as follows: test quality found several occurrences where the SAT did not models, standalone test metrics, code coverage and correctly resolve the call graph. We fixed some of e↵ectiveness, and assertions and e↵ectiveness. the issues encountered during our analysis. How- ever, as we did not manually analyse all the calls, 7.1 Test quality models this remains a threat to validity. Equivalent mutants. We treated all mutants We compare the TQM [18] we used, as described that were not detected by the master test suite in Section 2.2 with two other test quality models. as equivalent mutants, an approach often used in We first describe the other models, followed by a literature [35, 24, 45]. There is a high probability motivation for the choice of a model. that this resulted in overestimating the number STREW. Nagappan introduced the Software of equivalent mutants, especially for JFreeChart Testing and Reliability Early Warning (STREW) where a large part of the code is simply tested. In metric suite to provide “an estimate of post- principle, this is not a problem as we only compare release field quality early in software development the e↵ectiveness of sub test suites. However, our phases [34].” The STREW metric suite consists statement on the order of the master’s tests suite of nine static source and test code metrics. The e↵ectiveness is vulnerable to this threat as we did metric suite is divided into three categories: Test not manually inspect each mutant for equivalence. quantification, Complexity and OO-metrics, and Accuracy of analysis. We manually in- Size adjustment. The test quantifications metrics spected large parts of the Java code of each are the following: 1. Number of assertions per line project. Most of the inspections were done by of production code. 2. Number of tests per line a single person with four years of experience in of production code. 3. Number of assertion per Java. Also, we did not inspect all the tests. Most test. 4. The ratio between lines of test code and tests were selected on a statistic driven-basis, i.e., production code, divided by the ratio of test and we looked at tests that showed high e↵ectiveness production classes. but low coverage, or tests with a large di↵erence TAIME. Tengeri et al. introduced a system- between static and dynamic. To mitigate this, we atic approach for test suite assessment with a focus also verified randomly selected tests. However, the on code coverage [38]. Their approach, Test Suite chances of missing relevant source of imprecision Assessment and Improvement Method (TAIME), remains a threat to validity. is intended to find improvement points and guide 20 the improvement process. In this iterative process, 7.3 Code coverage and e↵ectiveness first, both the test code and production code are Namin et al. studied how coverage and size in- split into functional groups and paired together. dependently influence e↵ectiveness [35]. Their ex- The second step is to determine the granularity of periment used seven Siemens suite programs which the measures, start with coarse metrics on proce- varied between 137 and 513 LOC and had between dure level and in later iterations repeat on state- 1000 and 5000 test cases. Four types of code cov- ment level. Based on these functional groups they erage were measured: block, decision, C-Use and define the following set of metrics: P-Use. The size was defined by the number of Code coverage calculated on both procedure tests and e↵ectiveness was measured using muta- and statement level. tion testing. Test suites of fixed sizes and di↵erent Partition metric “The Partition Metric coverage levels were randomly generated to mea- (PART) characterizes how well a set of sure the correlation between coverage and e↵ec- test cases can di↵erentiate between the tiveness. They showed that both coverage and size program elements based on their coverage independently influence test suite e↵ectiveness. information [38]”. Another study on the relation between test ef- Tests per Program how many tests have been fectiveness and code coverage was performed by created on average for a functional group. Inozemtseva and Holmes [24]. They conducted Specialisation how many tests for a functional an experiment on a set of five large open source group are in the corresponding test group. Java projects and accounted for the size of the Uniqueness what portion of covered functional- di↵erent test suites. Additionally, they intro- ity is covered only by a particular test group. duced a novel e↵ectiveness metric, normalized ef- STREW, TAIME and TQM are models for as- fectiveness. They found moderate correlations be- sessing aspects of test quality. STREW and TQM tween coverage and e↵ectiveness when size was ac- are both based on static source code analysis. counted for. However, the correlation was low for However, STREW lacks coverage related metrics normalized e↵ectiveness. compared to TQM. TAIME is di↵erent from the The main di↵erence with our work is that other two models as it does not depend on a spe- we used static source code analysis to calculate cific programming language or xUnit framework. method coverage. Our experiment set-up is simi- Furthermore, TAIME is more an approach than a lar to that of Inozemtseva and Holmes except that simple metric model. It is an iterative process that we chose a di↵erent set of data points which we requires user input to identify functional groups. showed as more representative. The required user input makes it less suitable for automated analysis or large-scale studies. 7.4 Assertions and e↵ectiveness Kudrjavets et al. investigated the relation between 7.2 Standalone test metrics assertions and fault density [28]. They measured Bekerom investigated the relation between test the assertion density, i.e., number of assertions per smells and test bugs [41]. He built a tool using the thousand lines of code, for two components of Mi- SAT to detect a set of test smells: Eager test, Lazy crosoft Visual Studio written in C and C++. Ad- test, Assertion Roulette, Sensitive Equality and ditionally, real faults were taken from an internal Conditional Test Logic. He showed that classes bug database and converted to fault density. Their a↵ected by test bugs score higher on the presence result showed a negative relation between asser- of test smells. Additionally, he predicted classes tion density and fault density, i.e., code that had that have test bugs based on the eager smell with a higher assertion density has a lower fault density. a precision of 7% which was better than random. Instead of assertion density we focussed on the as- However, the recall was very low which led to the sertion count of Java projects and used artificial conclusion that it is not yet usable to predict test faults, i.e., mutants. bugs with smells. Zhang and Mesbah [45] investigated the rela- Ramler et al. implemented 42 new rules for tionship between assertions and test suite e↵ec- the static analysis tool PDM to evaluate JUnit tiveness. They found that, even when test suite code [37]. They defined four key problem areas size was controlled for, there was a strong corre- that should be analysed: Usage of the xUnit test lation between assertion count and test e↵ective- framework, implementation of the unit test, main- ness. Our results overlap with their work as we tainability of the test suite and testability of the both found a correlation between assertion count SUT. The rules were applied to the JFreeChart and e↵ectiveness for the JFreeChart project. How- project and resulted in 982 violations of which one- ever, we showed that this correlation is not always third was deemed to be some symptom of problems present as both Checkstyle and JodaTime showed in the underlying code. di↵erent results. 21 8 Conclusion We found a large number of tests in the Jo- daTime project that called the function under We analysed the relation between test suite e↵ec- test several times. For example, JodaTime’s tiveness and metrics, assertion count and static test wordBased pl regEx test checks 140 times method coverage, for three large Java projects, if periods are formatted correctly in Polish. These Checkstyle, JFreeChart and JodaTime. Both met- eager tests should be split into separate cases that rics were measured using static source code anal- test the specific scenarios. ysis. We found a low correlation between test suite e↵ectiveness and static method coverage for 8.2 Acknowledgements JFreeChart and JodaTime and a low to moderate correlation with assertion count for JFreeChart. We would like to thank Prof. Serge Demeyer for We found that the strength of the correlation de- his elaborate and insightful feedback on our paper. pends on the characteristics of the project. The absence of a correlation does not imply that the References metrics are not useful for a TQM. [1] Checkstyle. https://github.com/checkstyle/ Our current implementation of the assertion checkstyle. Accessed: 2017-07-15. count metric only shows promising results when [2] Checkstyle team. http://checkstyle. predicting test suite e↵ectiveness for JFreeChart. sourceforge.net/team-list.html. Accessed: We found that simply counting the assertions for 2017-11-19. each project gives results that do not align with the [3] Code cover. http://codecover.org/. Accessed: relative e↵ectiveness of the projects. The project 2017-07-15. with the most e↵ective master test suite had a sig- [4] JaCoCo. http://www.jacoco.org/. Accessed: nificantly lower assertion than the other projects. 2017-07-15. Even for sub test suites of most project, the asser- [5] JFreeChart. https://github.com/jfree/ tion count did not correlate with test e↵ectiveness. jfreechart. Accessed: 2017-07-15. Incorporating the strength of an assertion could [6] JodaTime. https://github.com/jodaorg/ lead to better predictions. joda-time. Accessed: 2017-07-15. Static method coverage is a good candidate for predicting test suite e↵ectiveness. We found a sta- [7] JUnit. http://junit.org/. Accessed: 2017-07- 15. tistically significant, low correlation between static method coverage and test suite e↵ectiveness for [8] MAJOR mutation tool . http:// most analysed projects. Furthermore, the cover- mutation-testing.org/. Accessed: 2017-07-15. age algorithm is consistent in its predictions on [9] muJava mutation tool. https://cs.gmu.edu/ a project level, i.e., the ordering of the projects ~offutt/mujava/. Accessed: 2017-07-15. based on the coverage matched the relative rank- [10] PIT+. https://github.com/LaurentTho3/ ing in terms of test e↵ectiveness. ExtendedPitest. Accessed: 2017-07-15. [11] PIT fork. https://github.com/pacbeckh/ 8.1 Future work pitest. Accessed: 2017-07-15. [12] PIT mutation tool . http://pitest.org/. Ac- Static coverage. Landman et al. investigated cessed: 2017-07-15. the challenges for static analysis of Java reflec- [13] R’s Kendall package. https://cran.r-project. tion [30]. They identified that is at least possible org/web/packages/Kendall/Kendall.pdf. Ac- to identify and measure the use of hard to resolve cessed: 2017-07-15. reflection usage. Measuring reflection usage could [14] SLOCCount. https://www.dwheeler.com/ give an indication of the degree of underestimated sloccount/. Accessed: 2017-07-15. coverage. Similarly, we would like to investigate whether we can give an indication of the degree of [15] TIOBE-Index. https://www.tiobe.com/ tiobe-index/. Accessed: 2017-07-15. overestimation of the project. Assertion count. We would like to investi- [16] Tiago L. Alves and Joost Visser. Static estima- tion of test coverage. In Ninth IEEE Interna- gate further whether we can measure the strength tional Working Conference on Source Code Anal- of an assertion. Zhang and Mesbah included as- ysis and Manipulation, SCAM 2009, Edmonton, sertion coverage and measured the e↵ectiveness of Alberta, Canada, September 20-21, 2009, pages di↵erent assertion types [45]. We would like to in- 55–64, 2009. corporate this knowledge into the assertion count. [17] Paul Ammann, Márcio Eduardo Delamaro, and This could result in a more comparable assertion Je↵ O↵utt. Establishing theoretical minimal count on project level. sets of mutants. In Seventh IEEE International Deursen et al. described a set of test smells Conference on Software Testing, Verification and including the eager tests, a test the verifies too Validation, ICST 2014, March 31 2014-April 4, much functionality of the tested function [42]. 2014, Cleveland, Ohio, USA, pages 21–30, 2014. 22 [18] Dimitrios Athanasiou, Ariadi Nugroho, Joost [29] Tobias Kuipers and Joost Visser. A tool-based Visser, and Andy Zaidman. Test code quality and methodology for software portfolio monitoring. its relation to issue handling performance. IEEE In Software Audit and Metrics, Proceedings of Trans. Software Eng., 40(11):1100–1125, 2014. the 1st International Workshop on Software Au- [19] Kent Beck and Erich Gamma. Test infected: dit and Metrics, SAM 2004, In conjunction with Programmers love writing tests. Java Report, ICEIS 2004, Porto, Portugal, April 2004, pages 3(7):37–50, 1998. 118–128, 2004. [20] Antonia Bertolino. Software testing research: [30] Davy Landman, Alexander Serebrenik, and Ju- Achievements, challenges, dreams. In Interna- rgen J. Vinju. Challenges for static analysis of tional Conference on Software Engineering, ISCE java reflection: literature review and empirical 2007, Workshop on the Future of Software En- study. In Proceedings of the 39th International gineering, FOSE 2007, May 23-25, 2007, Min- Conference on Software Engineering, ICSE 2017, neapolis, MN, USA, pages 85–103, 2007. Buenos Aires, Argentina, May 20-28, 2017, pages 507–518, 2017. [21] Ilja Heitlager, Tobias Kuipers, and Joost Visser. A practical model for measuring maintainabil- [31] Thomas Laurent, Mike Papadakis, Marinos Kin- ity. In Quality of Information and Communi- tis, Christopher Henard, Yves Le Traon, and An- cations Technology, 6th International Conference thony Ventresque. Assessing and improving the on the Quality of Information and Communica- mutation testing practice of PIT. In 2017 IEEE tions Technology, QUATIC 2007, Lisbon, Portu- International Conference on Software Testing, gal, September 12-14, 2007, Proceedings, pages Verification and Validation, ICST 2017, Tokyo, 30–39, 2007. Japan, March 13-17, 2017, pages 430–435, 2017. [22] Ferenc Horváth, Bela Vancsics, László Vidács, [32] András Márki and Birgitta Lindström. Mutation Árpád Beszédes, Dávid Tengeri, Tamás Gergely, tools for java. In Proceedings of the Symposium on and Tibor Gyimóthy. Test suite evaluation using Applied Computing, SAC 2017, Marrakech, Mo- code coverage based metrics. In Proceedings of rocco, April 3-7, 2017, pages 1364–1415, 2017. the 14th Symposium on Programming Languages [33] Thomas J. McCabe. A complexity measure. IEEE and Software Tools (SPLST’15), Tampere, Fin- Trans. Software Eng., 2(4):308–320, 1976. land, October 9-10, 2015., pages 46–60, 2015. [34] Nachiappan Nagappan. A Software Testing and [23] David C Howell. Statistical methods for psychol- Reliability Early Warning (Strew) Metric Suite. ogy. Cengage Learning, 2012. PhD thesis, North Carolina State University, [24] Laura Inozemtseva and Reid Holmes. Coverage is 2005. not strongly correlated with test suite e↵ective- ness. In 36th International Conference on Soft- [35] Akbar Siami Namin and James H. Andrews. The ware Engineering, ICSE ’14, Hyderabad, India - influence of size and coverage on test suite e↵ec- May 31 - June 07, 2014, pages 435–445, 2014. tiveness. In Proceedings of the Eighteenth Interna- tional Symposium on Software Testing and Anal- [25] Yue Jia and Mark Harman. An analysis and sur- ysis, ISSTA 2009, Chicago, IL, USA, July 19-23, vey of the development of mutation testing. IEEE 2009, pages 57–68, 2009. Trans. Software Eng., 37(5):649–678, 2011. [36] Mike Papadakis, Christopher Henard, Mark Har- [26] René Just, Darioush Jalali, Laura Inozemtseva, man, Yue Jia, and Yves Le Traon. Threats to Michael D. Ernst, Reid Holmes, and Gordon the validity of mutation-based test assessment. In Fraser. Are mutants a valid substitute for real Proceedings of the 25th International Symposium faults in software testing? In Proceedings of the on Software Testing and Analysis, ISSTA 2016, 22nd ACM SIGSOFT International Symposium Saarbrücken, Germany, July 18-20, 2016, pages on Foundations of Software Engineering, (FSE- 354–365, 2016. 22), Hong Kong, China, November 16 - 22, 2014, pages 654–665, 2014. [37] Rudolf Ramler, Michael Moser, and Josef Pichler. Automated static analysis of unit test code. In [27] Marinos Kintis, Mike Papadakis, Andreas First International Workshop on Validating Soft- Papadopoulos, Evangelos Valvis, and Nicos ware Tests, VST@SANER 2016, Osaka, Japan, Malevris. Analysing and comparing the e↵ec- March 15, 2016, pages 25–28, 2016. tiveness of mutation testing tools: A manual study. In 16th IEEE International Working Con- [38] Dávid Tengeri, Árpád Beszédes, Tamás ference on Source Code Analysis and Manipula- Gergely, László Vidács, David Havas, and tion, SCAM 2016, Raleigh, NC, USA, October Tibor Gyimóthy. Beyond code coverage - an 2-3, 2016, pages 147–156, 2016. approach for test suite assessment and improve- [28] Gunnar Kudrjavets, Nachiappan Nagappan, and ment. In Eighth IEEE International Conference Thomas Ball. Assessing the relationship between on Software Testing, Verification and Validation, software assertions and faults: An empirical in- ICST 2015 Workshops, Graz, Austria, April vestigation. In 17th International Symposium on 13-17, 2015, pages 1–7, 2015. Software Reliability Engineering (ISSRE 2006), [39] Dávid Tengeri, Ferenc Horváth, Árpád Beszédes, 7-10 November 2006, Raleigh, North Carolina, Tamás Gergely, and Tibor Gyimóthy. Nega- USA, pages 204–212, 2006. tive e↵ects of bytecode instrumentation on java 23 source code coverage. In IEEE 23rd Interna- tional Conference on Software Analysis, Evolu- tion, and Reengineering, SANER 2016, Suita, Osaka, Japan, March 14-18, 2016 - Volume 1, pages 225–235, 2016. [40] Paco van Beckhoven. Assessing test suite e↵ec- tiveness using static analysis. Master’s thesis, University of Amsterdam, 2017. [41] Kevin van den Bekerom. Detecting test bugs us- ing static analysis tools. Master’s thesis, Univer- sity of Amsterdam, 2016. [42] Arie van Deursen, Leon Moonen, Alex van den Bergh, and Gerard Kok. Refactoring test code. In Proceedings of the 2nd international confer- ence on extreme programming and flexible pro- cesses in software engineering (XP2001), pages 92–95, 2001. [43] Andy Zaidman, Bart Van Rompaey, Serge De- meyer, and Arie van Deursen. Mining software repositories to study co-evolution of production & test code. In First International Conference on Software Testing, Verification, and Validation, ICST 2008, Lillehammer, Norway, April 9-11, 2008, pages 220–229, 2008. [44] Ahmed Zerouali and Tom Mens. Analyzing the evolution of testing library usage in open source java projects. In IEEE 24th International Conference on Software Analysis, Evolution and Reengineering, SANER 2017, Klagenfurt, Aus- tria, February 20-24, 2017, pages 417–421, 2017. [45] Yucheng Zhang and Ali Mesbah. Assertions are strongly correlated with test suite e↵ective- ness. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, Bergamo, Italy, August 30 - September 4, 2015, pages 214–224, 2015. [46] Hong Zhu, Patrick A. V. Hall, and John H. R. May. Software unit test coverage and adequacy. ACM Comput. Surv., 29(4):366–427, 1997. 24