Attention, Test Code is Low-quality!

                                                                 Xinye Tang
                                               State Key Laboratory of Computer Science
                                                            Institute of Software,
                                                       Chinese Academy of Sciences
                                                      tangxinye@nfs.iscas.ac.cn


ABSTRACT                                                                     Therefore, it’s critical to maintain high quality test code. Existing
In this paper, we describe the formatting guidelines for ACM SIG             work [3] revealed that high quality test code of a software system
Proceedings. Software testing is an essential process during                 could improve the development team’s performance. Moreover,
software development and maintenance for improving software                  they reported that the bad quality of test code demonstrated a
quality. Test code, the artefact during software testing, has been           significant positive correlation between test code quality and the
widely used in many software quality assurance techniques.                   throughput and productivity of issue handling. In this paper, we
                                                                             conduct a pilot study to examine test code quality from a different
Traditionally, software quality assurance techniques, e.g.,                  aspect, specifically, we try to study the potential relation between
automatic bug repair, fault localization, test case prioritization,          test code quality and some software quality assurance techniques,
and mining API usage from test code are based on the hypothesis              e.g., automatic bug repair, fault localization, test case
of a sound quality of the test code. However, via empirical study            prioritization, and mining API usage from test code.
on four open source projects, we found that the quality of test
code is quite low comparing with corresponding source code, and              Traditionally, software quality assurance techniques, e.g.,
this might hurt the above software quality assurance techniques.             automatic bug repair [4, 9, 14], fault localization [11, 15], test
                                                                             case prioritization [6], and mining APIs from test code [8][16] are
In this paper, we studied more than 140,000 LOC(lines of code)               based on the hypothesis of a sound quality of the test code.
test code from four large scale and widely used open source                  Research studies on automatic fault repair leverage test code to
projects and found that it is common for test code to be                     measure their performance. For fault localization, test code is
unregulated and of low-quality in open source projects. First, the           required to improve the accuracy and establish the lower and
comment clone ratio, unreleased resource ratio and clone code                upper bounds. Test case Prioritization techniques aim to rearrange
ratio of test code is much higher than that of corresponding source          the execution order of test cases, which is based on the sound
code; second, bug-fixed coverage is down to 0. We have learned               quality of the test code. What’s more, test code is essential to
the following lessons: the quality of test code is quite low                 mine API usage examples, which is helpful for developers to learn
comparing with corresponding source code, and the low quality                and understand the correct usage of APIs of libraries.
test code may misguide existing software quality assurance
techniques.                                                                  In this paper, via empirical study on four large scale and widely
                                                                             used open source projects, we found that the quality of test code is
                                                                             quite low comparing with corresponding source code, and this
Categories and Subject Descriptors                                           might have negative impact on the above software quality
D.2.5 [Software Engineering]: [Testing and Debugging]                        assurance techniques. We studied more than 140,000 LOC test
                                                                             code from four large scale and widely used open source projects
General Terms                                                                and found that it is common for test code to be unregulated and
Experimentation, Measurement.                                                low-quality in open source projects. Specifically, first the
                                                                             comment clone ratio, unreleased resource ratio and clone code
                                                                             ratio of test code is much higher than that of corresponding source
Keywords                                                                     code; second, bug-fixed coverage is down to 0.
Test code quality, empirical study, testing, software quality
assurance.                                                                   We have learned the following lessons: the quality of test code is
                                                                             quite low comparing with corresponding source code, and the low
                                                                             quality test code may misguide existing software quality assurance
1. INTRODUCTION                                                              techniques. The main contributions of this work include:
Software testing is an essential process during software
development and maintenance. Test code is widely used as the                 1. We proposed five criteria for the measurement of test code
artefact during software testing for ensuring software quality.              quality.
                                                                             2. Based on proposed criteria, we measured test code quality of
                                                                             four large scale, widely used open source projects. Results show
Copyright © 2015 for this paper by its authors. Copying permitted for
                                                                             that the quality of test code is quite low comparing with
private and academic purposes
                                                                             corresponding source code. We further discuss the potential


                                                                        27
                           Figure 1: An example of duplicate code extracted from FailTest.java in Ant 1.9.4.


impact of low quality test code on existing software quality                 evaluate the effectiveness of these approaches, specifically, given
assurance techniques. To the best of our knowledge, this is the              a bug, if a generated repair patch could pass all test cases, the
first work to report that the test code is low-quality and                   generated repair patch will be treated as an effective repair for this
untrustworthy, which should be taken seriously.                              bug.
In the remainder of this paper, section II presents points on which          Fault Localization: is the indispensable process to identify
the author would like to get the most advice on; Section III                 exactly where the bugs are before fixing them. Xuan et al. [15]
presents essential background and related work of our study;                 pointed out that the effectiveness of fault localization depends on
Section IV shows our motivation; Section V present the three                 the quantity of test code and proposed an approach to improve
categories for the measurement of the test code quality. Section             fault localization with test case purification. Steimann et al. [11]
VI explains how we conduct our empirical study; Section VII                  empirically explored performance of existing fault locators and
discusses the threats to this work; Section VIII concludes this              their results shown the quality of test code is a key factor for fault
paper and discuss our future work.                                           locators. Campos et al. [5] proposed an approach to fault
                                                                             localization by entropy-based test generation.
2. ADVICE WANTED
As for the points on which we would like to get the most advice              Test Case Prioritization: aims to rearrange the
on, we are thinking about the possibilities of proceeding our                execution order of test cases to maximize specific objectives.
research further. Specifically, we plan to conduct quantitative and          Elbaum et al. [6] compare different test case prioritization
qualitative studies to explore how exactly the low-quality test              techniques in regression testing on the performance in improving
code could impact software quality assurance techniques, i.e.,               the rate of fault detection.
automatic bug repair, fault localization, test case prioritization,
and mining API usage. We will appreciate it if mentors could give            Mining API Usage from Test Code: Understanding
insightful suggestions on whether the work is valuable and how it            and learning the correct usage of APIs of libraries are significant
could be effectively done. Also, any feedback on the structure and           but complex activities for developers. Ghafari et al. [8] described
content of the paper would be welcome.                                       an approach to code recommendation where examples are
                                                                             obtained by mining and manipulating the unit tests of the API.
3. BACKGROUND AND RELATED WORK                                               Zhu et al. [16] proposed an approach to mining API usage
Automatic Bug Repair: is the process of automatically                        examples from test code, combining the technique of clustering to
generating patches for repairing bugs. A lot of studies have been            improve the representativeness of extracted examples. Nasehi et al.
carried out to address this issue. Weimer et al. [14] present a fully        [10] proposed to supplement the standard API documentation
automated technique for repairing bugs using one part of a                   with relevant examples taken from the unit tests. Thus, the quality
program as a template to repair another part. Kim et al. [9]                 of APIs usage in test code is critical to this topic.
proposed a patch generation approach learned from human-
written patches and identified common fix patterns for automatic
                                                                             Test Code Quality: It is critical to detect the quality of test
                                                                             code, however there are not enough work done in this field.
patch generation. Tan et al. [13] propose an approach of
                                                                             Athanasiou et al. [3] revealed that high quality test code of a
automated repair of software regression bugs. Test code is used to


                                                                        28
                                               Table 1. Details of Studied Projects in This Work
                                                       Clone       Clone         Unreleased         Unreleased         Clone      LOC#     Clone
     Project      Version                   TC*        Com#         Com          Resources#          Resource          Codes#              Code
                                                                   Ratio#                             Ratio#                               Ratio#
                                Test        329          35         0.11                67             0.20             6863      43774     0.16
       Ant          1.9.4
                                 Src        832          25         0.03                101              0.12          15918      183692    0.09

                                Test        175          29         0.17                35               0.2            2816      18581     0.15
     Maven          3.2.5
                                 Src        659          18         0.03                27               0.04           5801      77983     0.07

                                Test         90          26         0.29                29               0.32           2993      13681     0.22
      Log4j        1.2.17
                                 Src        272          28         0.10                33               0.12           4757      41991     0.11

                                Test        570         114          0.2                41               0.07          38986      98626     0.40
    Commons
                     3.5
      Math                       Src        942         242         0.26                19               0.02          29295      95697     0.31
* TC represents Total Classes, which means the number of the test classes.LOC represents lines of code.
# Clone Com(Comment) are the number of test classes which clone comment, and Clone Comment Ratio represents the number of clone
comment classes out of the total number of test classes. Unreleased Resources are the number of test classes which do not release resources,
and Unreleased Resource Ratio represents the number of test classes which do not release resource out of the total number of test classes.
Code Clones represent the lines of clone code, and Clone Code Ratio represents the lines of clone code out of the total lines of code.


Software system could improve the development team’s
performance.                                                                        1 public void testPassFile () throws Exception {
                                                                                    2     buildRule . executeTarget (”test3 ”);
4. MOTIVATION
As is mentioned above, test code is closely related with software                   3     File f = new File (
quality assurance techniques, e.g., automatic bug repair, fault                     4         buildRule . getProject (). getBaseDir () ,
localization, test case prioritization, and mining API usage from
test code. These studies are based on the hypothesis of a sound                     5         “testpassfile .tmp ”);
quality of the test code. Thus, the quality of test code is critical for            6     assertTrue ( . . . ) ;
the performance of these techniques. In this study, we try to                       7     assertEquals ( . . . ) ;
conduct a pilot study to explore the quality of test code according
to the five criteria.                                                               8}

5. TEST CODE QUALITY                                                            In this study, the unreleased resource ratio is defined as the
In this section, we present the three categories for the                        following:
measurement of test code quality.                                               Unreleased Resource Ratio = (#unreleased resource classes)/TC
                                                                                where #unreleased resource classes shows the number of the
5.1 Incorrectness                                                               classes which did not release resources, and TC shows the total
This category consists of test criteria that focus on measuring the             number of the classes.
error detection ability of the code. Unreleased resource and code
clone are the main criteria for this category.                                  We have developed a tool to automatically examine the code, and
                                                                                checked whether every resource was closed after being opened
                                                                                and used. If the resource was closed, we checked whether the
5.1.1 Unreleased resource                                                       close statement is enclosed in finally blocks. We counted both the
Unreleased resource is a kind of incorrect use of APIs. Unreleased              resources which were not closed and those that were not closed in
resource occurs when developers fail to release resource such as                finally block for the unreleased resources.
File, ResultScanner and so on. When developers finished the
input and output operations on a file object, developers should                 5.1.2 Code Clones
close file and release resources, or there will be a potential                  Code clones are separate fragments of code that are very similar.
memory leak vulnerability. Moreover, developers should close the                They are a common phenomenon in the open source systems
resource with finally clause to ensure the resource is closed no                which have been under development for some time. Clone codes
matter what happens in the try block.                                           are often referred to due to the difficulty it makes in changing and
The following real test case from Ant 1.9.4, is an example where                maintaining the open source systems since developers have to
the method did not close the file object it opened.                             locate and update many fragments frequently. For example,
                                                                                Fowler [7] argues that code duplicates are bad smells of poor


                                                                           29
designFigure 1 shows an example of clone codes. Clone code                   methods, which makes the code inconsistent with the code
ratio is defined as the following:                                           documentation and hard to understand. Often this occurs when
                                                                             developers miss to change the clone comment.
          Clone Code Ratio = (#clone code lines)/LOC
where #clone code line shows the lines of clone code, and LOC                Consider the following real test code from Ant 1.9.4:
shows the lines of code.
To estimate the clone code, we use PMD’s Copy/Paste Detector                   1 /∗∗ Test right use of cache names. ∗/
(CPD) [2] and set the minimum Tile-size at 10, meaning that CPD                2 @Test
cannot find clones in methods that are less than 10 statements
                                                                               3 public void testValidateWrongCache () {
long. Figure 1 indicates that the clone analysis tool can find
fragments which differ in the names of variables and parameters,               4      ...
and in which some statements have been rearranged.                             5}
                                                                               6 /∗∗ Test right use of cache names. ∗/
5.2 Insufficiency                                                              7 @Test
Criteria that fall inside this category focus on measuring the loss
detection ability of the code. To identify the loss of code, we                8 public void testValidateWrongAlgorithm() {
choose code coverage and bug-fixed coverage as the main criteria.              9      ...
                                                                               10 }
5.2.1 Code Coverage
Code coverage is the most frequently used metric for test code               In this study, the clone comment ratio is defined as the following:
quality assessment [3]. It is used to describe the degree to which
the source code of a project is tested by a particular test suite.                  Clone Comment Ratio = (#clone comment classes)/TC
Usually the project with lower code coverage has been                        where #clone comment classes shows the number of the classes
insufficiently tested and therefore has a higher chance of                   which clone comment, and TC shows the total number of the
containing bugs. There exist many tools for dynamic code                     classes.
coverage estimation (e.g., Clover 5 and Cobertura 6 for Java,
Testwell CTC++ 7 for C++, NCover 8 for C#). We used Clover [1]               The logic of the two test cases are different, yet the comments are
to obtain a code coverage metric.                                            the same, which might cause the developers’ confusion in terms of
                                                                             the understanding of the test logic of these two test cases. We
                                                                             developed a simple tool to detect cloned comments, which could
5.2.2 Bug-fixed Coverage
                                                                             look into the code and locate the functions that share comment,
In general, after fixing a bug, new unit test should be created and
                                                                             and collect the clone comments automatically.
added to the regression test suite to ensure this bug will not be re-
introduced in the following versions of the projects. The bug-
fixed coverage is a measure used to describe the degree to which
                                                                             6. EMPIRICAL STUDY
extent the fixed bug is covered by test code. If a fixed bug has not         In this section, we present the quality of test code of the four open
been covered by test cases, this fixed bug will be under high risk           source projects according to the five criterion respectively.
of reopening. In this study, the bug-fixed coverage is defined as
the following:                                                               6.1 Dataset
                                                                             In this paper, we try to explore the quality of test code of four
       Bug-fixed Coverage = (#tested bugs)/ (#fixed bugs)
                                                                             large-scale and widely used open source projects, i.e., Ant, Maven,
where #tested bugs is the number of fixed bugs which is tested in            Log4j, and Commons Math, using the five criteria. For each
the current version of the project, and #fixed bugs is the number            project, we extracted the test code and source code separately
of fixed bugs in the current version of the project.                         from the latest version. Details of these projects are shown in
                                                                             Table 1.
To calculate the bug-fixed coverage, we first collected the fixed
bugs for the responding version of the four open source projects
from their release notes; second, we found the fixed bugs which              6.2 Result Analysis
are not tested in the current version, manually. Specifically, the
first author is responsible for the collection of un-tested fixed            6.2.1 Unreleased Resource
bugs, after which the second author recollected the data. Then               As is shown in Table 1, in the four projects, the unreleased
results are merged, and conflicts were get resolved by a joint pair-         resource ratio for test code is up to 32% and on average it is 20%,
inspection of all three authors.                                             while the average ratio for source code is less than 7%. Overall,
                                                                             all the unreleased resource ratios in test code are much higher
                                                                             than that in the corresponding source code.
5.3 Bad Readability
This category consists of test criteria that focus on measuring the          Potential Impact on Software Quality Assurance Techniques:
detection of the unreadable code. Clone comment ratio is the main            The results indicate that the quality of source code is much higher
criterion for this category.Comments make the code developer-                than test code. As the test code is widely used for mining API
readable. They generate code documentation in predefined format.             usage examples [10, 16], this can result in bad API usage
Clone comments are the same prologue comment of different                    examples, making API learning quite confusing for developers.


                                                                        30
                                                  Table 2. Details of Quality of Test Code
               Project               Version           Code Coverage          Tested FBs#        FBs#        Bug-fixed Coverage#
                Ant                   1.9.4                85%                     0              15                 0.00

               Maven                   3.25                  78%                     1             6                 16.67

                Log4j                 1.2.17                 75%                     0             5                  0.00

           Commons Math                3.5                  81%              2           8               25.00
# FBs are the number of fixed bugs in the current version, and Tested FBs are the number of fixed bugs which have been tested in the
current version. Bug Coverage is the ratio of Test FBs out of FBs.


                                                                             code is less than 10%. Overall, the clone comment ratio in test
6.2.2 Code Clone                                                             code is much higher than that in source code of the four projects.
As is shown in Table 1, in the open source projects, the clone               The results indicate that the quality of source code is much higher
code ratio for test code is up to 40% and on average the ratio is            than the test code in terms of comment clone.
about 23%, while the average ratio for source code is about 17%.
Overall, all the clone code ratios in test code are much higher than         Potential Impact on Software Quality Assurance Techniques:
that in corresponding source code.                                           Program comments are important for developers to understand
                                                                             code. Moreover, comments that are inconsistent with code can
Potential Impact on Software Quality Assurance Techniques:                   easily confuse and misguide developers to introduce bugs in
Clone test code is harmful for software quality, which increases             subsequent versions [12]. The high clone comment ratio in test
test maintenance overhead and propagates any pre-existing errors.            code is harmful for developers to understand the test logic.
Clone test code always cover similar source code, which reduces
the discrimination among test code and might hurt fault                      7. THREATS TO VALIDITY
localization and test case prioritization.
                                                                             7.1 Internal Validity
6.2.3 Code Coverage                                                          In this paper, we study the quality of test code. we present the five
As is shown in Table 2, in the four projects, the code coverage for          criteria for the measurement of test code quality. However, other
test code varies from 75% to 85%, while the average ratio for                criteria that we have overlooked may also can measure the quality
code coverage is around 80 %. Overall, the code coverage are                 of test code.
high enough to basically cover the test of the source code.
                                                                             7.2 External Validity
6.2.4 Bug-fixed Coverage                                                     In this work we investigate the quality of test code in terms of five
Results are shown in Table 2, Bug-fixed Coverage in our studied              proposed criteria on open source projects. However, it is possible
versions of Ant and Log4j is 0, which means none of the fixed 20             that our approach may not work well on some closed-source
bugs in Ant 1.9.4 and Log4j 1.2.17 has been tested after fixing. In          software, or small scale open source software projects. The
Maven, and Commons Math, values of Bug-fixed Coverage are                    purpose of this work is to study the impact of low quality test
also quite low, more than 70% fixed bugs are not addressed in test           code on several software assurance techniques, however, not all
code. Overall, the bug-fixed coverage is so low that the test for the        projects maintain valid test code. Our approach is not suitable for
fixed bugs are not sufficient.                                               these projects without test code.
Potential Impact on Software Quality Assurance Techniques:
The low Bug-fixed Coverage values in software projects make it
                                                                             8. CONCLUSION AND FUTURE WORK
                                                                             This paper found that it is common for test code to be unregulated
hard to practice test case prioritization, and also hurt the
                                                                             and of low quality in the open source projects. We studied 1164
effectiveness of automatic bug repair. Since test case prioritization
                                                                             test classes and more than 140,000 LOC test code from the
aims to arrange test cases based on the code coverage for
                                                                             current version of four open source projects. Results indicate that
accelerating bug detection, the low Bug-fixed Coverage means
                                                                             the quality of test code is much lower than that for the
that when un-tested fixed bugs are re-introduced, prioritized test
                                                                             corresponding source code in terms of the proposed criteria, e.g.,
cases might not reveal these bugs; for automatic bug repair, the
                                                                             unreleased resource, code clone and comment clone, and the the
low Bug-fixed Coverage makes the evaluation inefficient.
                                                                             coverage of test code for fixed bug is not sufficient.
Automatic bug repair evaluates its repair patch by running all test
cases. If all test cases are passed, the generated patch will be             We further discussed the potential impact of low quality on the
treated as an effective repair patch for a bug. However the low              existing software quality assurance techniques. To the best of our
Bug-fixed Coverage means many fixed bugs are not tested by the               knowledge, this is the first work to report that the test code is low-
existing test cases. So when evaluating generated repair patches,            quality and untrustworthy, which should be taken seriously.
even all test cases are passed, these un-covered fixed bugs might
be re-introduced.                                                            Future work. Our research is in a final stage. We explored the
                                                                             quality of test code in terms of five proposed criteria and the
                                                                             impact of test code on software quality assurance techniques. In
6.2.5 Comment Clone                                                          the future, we plan to conduct quantitative and qualitative studies
As is shown in Table 1, the average clone comment ratio for test             to explore how exactly the low-quality test code could impact
code is about 20% and the average clone comment ratio for source


                                                                        31
software quality assurance techniques, i.e., automatic bug repair,             Software Engineering, pages 364–374. IEEE Computer
fault localization, test case prioritization, and mining API usage.            Society, 2009.
                                                                           [15] J. Xuan and M. Monperrus. Test case purification for
Acknowledgment                                                                 improving fault localization. In Proceedings of the 22 nd ACM
This research was supported in part by National Natural Science                SIGSOFT International Symposium on Foundations of
Foundation of China under Grant Nos. 91218302, 91318301,                       Software Engineering, pages 52–63. ACM, 2014.
71101138, and 61303163.                                                    [16] Z. Zhu, Y. Zou, B. Xie, Y. Jin, Z. Lin, and L. Zhang. Mining
                                                                                api usage examples from test code. In Software Maintenance
References                                                                      and Evolution (ICSME), 2014 IEEE International
[1] Clover. https://www.atlassian.com/software/clover/overview.                 Conference on, pages 301–310. IEEE.
    Accessed April 20, 2015.
[2] Pmd’s cpd. http://pmd.sourceforge.net/pmd-4.3.0/cpd.html.
[3] D. Athanasiou, A. Nugroho, J. Visser, and A. Zaidman.Test
     code quality and its relation to issue handling performance.
     volume 40, pages 1100–1125, Nov 2014.
[4] E. T. Barr, Y. Brun, P. Devanbu, M. Harman, and F. Sarro.
     The plastic surgery hypothesis. In 22nd ACM SIGSOFT
     International Symposium on the Foundations of Software
     Engineering (FSE 2014), Hong Kong, volume 16.
[5] J. Campos, R. Abreu, G. Fraser, and M. d’Amorim. Entropy-
     based test generation for improved fault localization. In
     Automated Software Engineering (ASE), 2013 IEEE/ACM
     28th International Conference on, pages 257–267.
[6] S. Elbaum, A. Malishevsky, and G. Rothermel. Test case
    prioritization: A family of empirical studies. volume 28,
    pages 159–182, 2002.
[7] M. Fowler. Refactoring: Improving the design of existing code.
     In Proceedings of the Second XP Universe and First Agile
     Universe Conference on Extreme Programming and Agile
     Methods - XP/Agile Universe 2002, page 256, 2002.
[8] M. Ghafari, C. Ghezzi, A. Mocci, and G. Tamburrelli. Mining
     unit tests for code recommendation. In Proceedings of the
     22Nd International Conference on Program Comprehension,
     pages 142–145. ACM, 2014.
[9] D. Kim, J. Nam, J. Song, and S. Kim. Automatic patch
    generation learned from human-written patches. In
    Proceedings of the 2013 International Conference on
    Software Engineering, pages 802–811. IEEE Press.
[10] S. M. Nasehi and F. Maurer. Unit tests as api usage examples.
     In Software Maintenance (ICSM), 2010 IEEE International
     Conference on, pages 1–10. IEEE, 2010.
[11] F. Steimann, M. Frenkel, and R. Abreu. Threats to the
     validity and value of empirical assessments of the accuracy
     of coverage-based fault locators. In Proceedings of the 2013
     International Symposium on Software Testing and Analysis,
     pages 314–324. ACM.
[12] L. Tan, D. Yuan, G. Krishna, and Y. Zhou. /* iComment:
     Bugs or bad comments? */. In Proceedings of the 21st ACM
     Symposium on Operating Systems Principles (SOSP07),
     October 2007.
[13] S. H. Tan and A. Roychoudhury. relifix: Automated repair of
     software regressions. In Proceedings of the 2015
     International Conference on Software Engineering. IEEE,
     2015.
[14] W. Weimer, T. V. Nguyen, C. L. Goues, and S. Forrest.
     Automatically finding patches using genetic programming. In
     Proceedings of the 31st International Conference on


                                                                      32