Test Refactoring: a Research Agenda

                        Brent van Bladel                                       Serge Demeyer
                 brent.vanbladel@uantwerpen.be                         serge.demeyer@uantwerpen.be
                                                 University of Antwerp,
                                                   Middelheimlaan 1,
                                                 2020 Antwerp, Belgium


                                                                   tests suite. Yet, there is no metric or method to mea-
                                                                   sure the quality of the test code in terms of readability
                       Abstract                                    and maintainability.
                                                                      One indication of the quality of test code could
    Research on software testing generally fo-                     be the presence of test smells. Similar to how pro-
    cusses on the e↵ectiveness of test suites to de-               duction code can su↵er from code smells, these test
    tect bugs. The quality of the test code in                     specific smells can indicate problems with the test
    terms of maintainability remains mostly ig-                    code in terms of maintainability [VDMvdBK01]. How-
    nored. However, just like production code,                     ever, refactoring test smells can be tricky, as there
    test code can su↵er from code smells that im-                  is no reliable method to verify if a refactored test
    ply refactoring opportunities. In this paper,                  suite preserves its external behaviour. Several stud-
    we will summerize the state-of-the-art in the                  ies point out the peculiarities of test code refactor-
    field of test refactoring. We will show that                   ing [VDMvdBK01, VDM02, Pip02, Fow09]. However,
    there is a gap in the tool support, and pro-                   none of them provided an operative method to guaran-
    pose future work which will aim to fill this                   tee that such refactoring was preserving the behaviour
    gap.                                                           of the test.
                                                                      The rest of the paper is organized as follows. In
1    Introduction                                                  section 2 we will summerize the related work on test
                                                                   smells and test refactoring, which shows test smells to
Refactoring is “the process of changing a software sys-            be an important issue. Section 3 we will go over the
tem in such a way that it does not alter the exter-                existing test refactoring tools, showing there is a gap
nal behaviour of the code yet improves its internal                in the current tool support. We will propose our future
structure” [Fow09]. If applied correctly, refactoring              work which aims to fill the gap in existing tool support
improves the design of software, makes software eas-               in section 4. In section 5 we define a theoretical model
ier to understand, helps to find faults, and helps to              for defining test behaviour, which will form the basis
develop a program faster [Fow09].                                  of our proposed future work. We conlude in section 6.
    In most organizations, the test code is the final
“quality gate” for an application, allowing or deny-
ing the move from development to release. With this                2    Related Work
role comes a large responsibility: the success of an ap-
                                                                   The term test smell was first introduced by van
plication, and possibly the organization, rests on the
                                                                   Deursen et al. in 2001 as a name for any symptom
quality of the software product [Dus02]. Therefore,
                                                                   in the test code of a program that possibly indicates
it is critical that the test code itself is of high quality.
                                                                   a deeper problem. In their paper, they defined a first
Methods, such as code coverage analysis and mutation
                                                                   set of eleven common test smells and a set of specific
testing, help developers assess the e↵ectiveness of the
                                                                   refactorings which solve those smells [VDMvdBK01].
                                                                   Meszaros expanded the list of test smells in 2007, mak-
Copyright c by the paper’s authors. Copying permitted for
private and academic purposes.
                                                                   ing a further distinction between test smells, behaviour
Proceedings of the Seminar Series on Advanced Techniques and
                                                                   smells, and project smells [Mes07]. Greiler et al. de-
Tools for Software Evolution SATToSE 2017 (sattose.org).           fined five new test smells specifically related to test
07-09 June 2017, Madrid, Spain.                                    fixtures in 2013 [GvDS13].


                                                               1
   Several studies have investigated the impact test                In 2013, Greiler et al. presented a tool which can
smells have on the quality of the code. Van Rompaey              automatically detect test smells in fixtures [GvDS13].
et al. performed a case study in 2006 in which they              Their tool, called TestHound, provides reports on test
investigated two test smells (General Fixture and Ea-            smells and recommendations for refactoring the smelly
ger Test). They concluded that all tests which suf-              test code. They performed a case study where develop-
fer from these smells have a negative e↵ect on the               ers are asked to use the tool and afterwards are inter-
maintainability of the system [VRDBD06]. In 2012,                viewed. They show that developers find that the tool
Bavota et al. performed an experiment with master                helps them to understand, reflect on and adjust test
students in which they studied eight test smells (Mys-           code. However, their tool is limited to smells related
tery Guest, General Fixture, Eager Test, Lazy Test,              to test fixtures. Furthermore, they only report the
Assertion Roulette, Indirect Testing, Sensitive Equal-           occurences of the di↵erent fixture-related test smells
ity, and Test Code Duplication). This study pro-                 in the code. They do not give one single metric that
vided the first empirical evidence of the negative im-           represents the overall quality of the test code. During
pact test smells have on maintainability [BQO+ 12].              the interviews, one developer said that the di↵erent
In 2015, they continued their research and performed             smells should be integrated in one high-level metric:
the experiment with a larger group, containing more              “This would give us an overall assessment, so that if
students as well as developers from industry. They               you make some improvements you should see it in the
conclude that test smells represent a potential dan-             metric.” [GvDS13].
ger to the maintainability of production code and test
suites [BQO+ 15].                                                Defining Test Behaviour
   In 2016, Tufano et al. investigated the nature of test
smells. They conducted a large-scale empirical study             Refactoring of the production code can be done with
over the commit history of 152 open source projects.             little risk using the test suite as a safeguard. Since
They found that test smells a↵ect the project since              there is no safeguard when refactoring test code, there
their creation and that they have a very high surviv-            is a need for tool support that can verify if a refac-
ability. This shows the importance of identifying test           tored test suite preserves its behaviour pre- and post-
smells early, preferably in the IDE before the commit.           refactoring. Previous research on this topic has been
They also performed a survey with 19 developers which            performed by Parsai et al. in 2015 [PMSD15]. They
looked into their perception of test smells and design           propose the use of mutation testing to verify the test
issues. They showed that developers are not able to              behaviour. However, mutation testing requires the test
identify the presence of test smells in their code, nor do       suite to be ran for each mutant, which can be hundreds
developers perceive them as actual design problems.              of times, making it unlikely to be useful in practice.
This highlights the importance of investing e↵ort in             Furthermore, while mutation testing gives an indica-
the development of tools to identify and refactor test           tion of the test behaviour, it cannot fully guarantee
smells [TPB+ 16].                                                that the behaviour is preserved.

                                                                 4     Research Plan
3    Tool Support
                                                                 As we have shown, there is a lack of tool support when
Test Smell Detection                                             it comes to test refactoring. We plan on creating a
                                                                 tool that will help developers during this process. We
There are many tools that can automatically detect
                                                                 present our future work in terms of a research agenda:
code smells, for example the JDeodorant Eclipse plu-
gin and the inFusion tool [FMM+ 11]. Test smells,
however, are very di↵erent from code smells and these            Test Smell Detection
tools are not able to detect them. Tool support for                  • Objective - Create a tool that is able to detect
handling test smells and refactoring test code is lim-                 test smells. More specifically, the tool should
ited.                                                                  be able to detect all test smells defined by van
   In 2008, Breugelmans et al. presented TestQ, a                      Deursen, Meszaros, and Greiler [VDMvdBK01,
tool which can statically detect and visualize 12 test                 Mes07, GvDS13]. This tool should also be able to
smells [BVR08]. TestQ enables developers to quickly                    create a metric that represents the overall quality
identify test smell hot spots, indicating which tests                  of the test code in terms of maintainability.
need refactoring. However, the lack of integration in
development environments and the overall slow per-                   • Approach - Breugelmans et al. proposed methods
formance make TestQ unlikely to be useful in rapid                     for detecting all the original test smells (defined
code-test-refactor cycles [BVR08].                                     by van Deursen et al.) [BVR08]. We will use these


                                                             2
      methods in our tool. For the other test smells (de-                 stored as a sequence of operations. When encount-
      fined by Meszaros and Greiler et al.), we will use                  ing an assert, a node which represents the assert is
      a similar approach in order to define detection                     added to the TBT. All child nodes of the assert are
      methods ourselves. The metric that represents                       also added, replacing variables with their stored value.
      the overall quality of the test code can be calcu-
      lated based on the amount of test smells present
      in the test code.                                                   Running Example

    • Validation - Verification of correctness will be                    As an example to illustrates the approach, we use the
      made using a dataset consisting of a set of real                    following simple production code:
      open-source software projects. We can compare
      the tool with TestHound for fixture related test                1   c l a s s Rectangle {
      smells and with TestQ for the other test smells.                2   public :
                                                                      3         Rectangle () ;
      Smells not covered by either TestHound or TestQ                 4         int getHeigth () ;
      will require manual verification.                               5         i n t getWidth ( ) ;
                                                                      6         void setHeigth ( i n t h) ;
Defining Test Behaviour                                               7         v o i d setWidth ( i n t w) ;
                                                                      8   private :
    • Objective - Define test behaviour such that de- 9                           int heigth ;
      velopers can verify if the test code is behaviour 10                        i n t width ;
                                                        11                };
      preserving between pre- and post- refactoring.    12
                                                                     13   R e c t a n g l e : : R e c t a n g l e ( ) {}
    • Approach - The production code should be deter- 14 i n t R e c t a n g l e : : g e t H e i g t h ( ) { r e t u r n h e i g t h ; } ;
      ministic, and thus the same set of inputs should 15 i n t R e c t a n g l e : : getWidth ( ) { r e t u r n width ; } ;
      always result in the same set of outputs. We will 16 v o i d R e c t a n g l e : : s e t H e i g t h ( i n t h ) { h e i g t h = h ; }
                                                          17 v o i d R e c t a n g l e : : setWidth ( i n t w) { width = w; }
      analyse the code in order to map all entry and
      exit points from test code to production code and
      link them with the corresponding assertions. This      It defines a class Rectangle which has two private data
      will result in the construction of a Test Behaviour    members heigth and width, as well as getters and
      Tree (TBT), which defines the behaviour of the         setters for these data members. Note that even though
      test. Comparison of TBTs will allow for validat-       this is a toy example, there is no technical di↵erence
      ing behavior preservation between pre- and post-       between simple getters and setters and large algoritmic
      refactoring. Section 5 will explain this concept in    functions as the production code is considered a ’black
      more detail.                                           box’. There would be no di↵erence if the getters did
                                                             some advanced mathematical calculations, read from
    • Validation - We will run the algorithm on the          a file, or contacted a networked database.
      dataset of commits used for verifying the test             We will start with a simple test for this production
      quality metric. We can do an initial check us-         code:
      ing coverage metrics and mutation testing. When
      these metrics change pre- and post-refactoring, we 1 R e c t a n g l e r = R e c t a n g l e ( ) ;
      know for certain that the test behaviour changed. 2 r . setWidth ( 5 ) ;
      When these metrics remain constant, we will have 3 r . s e t H e i g t h ( 1 0 ) ;
      to manually verify wether the refactoring is be- 4 a s s e r t ( 5 == r . getWidth ( ) ) ;
                                                           5 a s s e r t ( 1 0 == r . g e t H e i g t h ( ) ) ;
      haviour preserving.

5     Theoretical Model for Defining Test                                 This test will result in the Test Behaviour Tree shown
                                                                          in figure 1. As shown, the TBT has one root node
      Behaviour                                                           which has a child for every assert. Each assert node
In order to determine test behaviour, a Test Behaviour                    has the full comparison as a child, where variables are
Tree (TBT) can be constructed from the Abstract Syn-                      replaced with their value. Since the call on the rectan-
tax Tree (AST). This can be done by simply traversing                     gle object is considered a call to production code, the
the AST once. During this pass of the AST, all vari-                      sequence of operations is appended as a child rather
ables and objects need to be stored with their value.                     than a single value, because we consider production
All subsequent operations on variables are then per-                      code as a ’black box’. We can safely assume this, since
formed on the stored value. If a variable is initialized                  the production code should be deterministic (other-
with a functioncall to production code, it can be stored                  wise you could not write tests for it) and should not
as that call. Operations on that variable will then be                    change when refactoring test code.


                                                                     3
                                                                                                                  Node


                                                                                                         assert              assert


                                                                                                    ==                           ==


                                                                        5               FunctionMemberCallNode                    10               FunctionMemberCallNode


                                                                FunctionMemberCallNode                getWidth                              FunctionMemberCallNode          getHeigth


                                 FunctionMemberCallNode                setHeigth           10                                FunctionMemberCallNode              getWidth


              FunctionCallNode           setWidth           5                             FunctionMemberCallNode                       setHeigth         10


                  Rectangle                                          FunctionCallNode               setWidth             5


                                                                            Rectangle


                                             Figure 1: The Test Behaviour Tree from the example tests.

    Variable Refactorings                                                                               These refactorings did not change behaviour, which
                                                                                                     is why we get the same resulting TBT. If you would
    One way to refactor this test would be to replace the
                                                                                                     change the value of testWidth or testHeigth, the be-
    ’magic numbers’ in the with variables. This would
                                                                                                     haviour of the test would change as you would be test-
    greatly increase maintainability, as consistency be-
                                                                                                     ing di↵erent input - output pairs. This change in be-
    tween input and expected output would be guaran-
                                                                                                     haviour would be detected easily detected by our ap-
    teed. Because variables are replaced with their value
                                                                                                     proach, as the values in the TBT would change accord-
    in our approach, the following refactored test code will
                                                                                                     ingly, resulting in a di↵erent TBT.
    result in the exact same TBT:
1   int x = 5;                                                                                       Expression Refactorings
2   int y = 10;
3   Rectangle r = Rectangle () ;                                                                     Detecting a change in input - output pairs is more im-
4   r . setWidth ( x ) ;                                                                             portant when the test code contains some arithmetic
5   r . setHeigth (y) ;
6   a s s e r t ( x == r . getWidth ( ) ) ;                                                          operations. Sometimes it is necessary to make a cal-
7   a s s e r t ( y == r . g e t H e i g t h ( ) ) ;                                                 culation in the test code to use as an oracle. When it
                                                                                                     comes to these kind of expressions in the AST, it is pos-
       Similarly, the common refactoring where a vari-
                                                                                                     sible to simply evaluate them during traversal of the
    able is renamed can be performed without changing
                                                                                                     AST. The values of all variables are stored upto that
    the TBT. The following code also generates the same
                                                                                                     point in the program, and the result can be stored as
    TBT:
                                                                                                     the new value for the corresponding variable. There-
1   i n t testWidth = 5 ;                                                                            fore, the following code still generates the same TBT,
2   int testHeigth = 10;
3   Rectangle testRectangle = Rectangle () ;                                                         as the behaviour did not change since the values for
4   t e s t R e c t a n g l e . setWidth ( t e s t W i d t h ) ;                                     testWidth and testHeigth still evaluate to 5 and 10
5   testRectangle . setHeigth ( testHeigth ) ;                                                       respectively 1 ):
6   a s s e r t ( t e s t W i d t h == t e s t R e c t a n g l e . getWidth ( ) ) ;
7   a s s e r t ( t e s t H e i g t h == t e s t R e c t a n g l e . g e t H e i g t h ( )              1 Note that it would be bad practice to write this test, but

            );                                                                                       we use it here simply to showcase the approach.


                                                                                                4
1   i n t testWidth = 1 ;                                                                        and rewrite our test to:
2   i n t t e s t H e i g t h = ((++ t e s t W i d t h ) ⇤ 2 ) + ( (
            t e s t W i d t h++) ⇤ 3 ) + 2 ;                                                 1   i n t t e s t W i d t h = setupData ( 1 ) ;
3   t e s t W i d t h = t e s t W i d t h++;                                                 2   i n t t e s t H e i g t h = setupData ( 2 ) ;
4   Rectangle testRectangle = Rectangle () ;                                                 3   Rectangle testRectangle = Rectangle () ;
5   t e s t R e c t a n g l e . setWidth ( t e s t W i d t h ) ;                             4   t e s t R e c t a n g l e . setWidth ( t e s t W i d t h ) ;
6   testRectangle . setHeigth ( testHeigth ) ;                                               5   testRectangle . setHeigth ( testHeigth ) ;
7   a s s e r t ( t e s t W i d t h == t e s t R e c t a n g l e . getWidth ( ) ) ;          6   a s s e r t ( t e s t W i d t h == t e s t R e c t a n g l e . getWidth ( ) ) ;
8   a s s e r t ( t e s t H e i g t h == t e s t R e c t a n g l e . g e t H e i g t h ( )   7   a s s e r t ( t e s t H e i g t h == t e s t R e c t a n g l e . g e t H e i g t h ( )
            );                                                                                           );

                                                                                                    Again, the values for testWidth and testHeigth still
    Function Refactorings                                                                        evaluate to 5 and 10 respectively, resulting in the same
                                                                                                 TBT. When conditionals or loops are used in combina-
    Another common refactoring is to extract part of the
                                                                                                 tion with calls to production code, it would be handled
    test code to a function. As an example, we could define
                                                                                                 similarly to how the testRectangle object is handled.
    the following functions:
                                                                                                 The sequence of operations would be kept, including
1 i n t setupWidth ( i n t x ) {                                                                 the conditional or loop, similarly to how they would
2       return x /2;
3 }
                                                                                                 be represented in AST form.
4
5 int setupHeigth ( int y) {                                                                     6      Conclusion
6     return y ⇤2;
7 }                                                                                              We have presented an overview of the research done in
    and rewrite our test to:                                                                     the field of test smells and test refactoring. Research
                                                                                                 has indicated that test smells have a negative impact
1   i n t t e s t W i d t h = setupWidth ( 1 0 ) ;
2   int testHeigth = setupHeigth (5) ;
                                                                                                 on maintainability and therefore need to be refactored.
3   Rectangle testRectangle = Rectangle () ;                                                     We have shown that there is a lack of tool support to
4   t e s t R e c t a n g l e . setWidth ( t e s t W i d t h ) ;                                 aid developers with test refactoring. We also provided
5   testRectangle . setHeigth ( testHeigth ) ;                                                   a theoretical model that defines test behaviour, in the
6   a s s e r t ( t e s t W i d t h == t e s t R e c t a n g l e . getWidth ( ) ) ;
7   a s s e r t ( t e s t H e i g t h == t e s t R e c t a n g l e . g e t H e i g t h ( )
                                                                                                 form of Test Behaviour Trees, which can be used to
            );                                                                                   compare test behaviour pre- and post-refactoring. We
                                                                                                 plan to create a tool for test refactoring which can de-
       If these functions are marked as part of the pro-
                                                                                                 tect test code smells, evaluate the test quality, and as-
    duction code, they will be treated as ’black box’ func-
                                                                                                 sure behaviour is preserved after test refactoring using
    tions. This is not desirable, since then the TBT will
                                                                                                 our theoretical model. We currently have a working
    change while behaviour is preserved. Therefore, these
                                                                                                 prototype for the latter. Our final tool will help devel-
    functions need to be evaluated similarly to expressions.
                                                                                                 opers decide when and where to refactor the test code,
    Again this is perfectly possible since we have the val-
                                                                                                 as well as help them perform the refactorings correctly,
    ues of all variables at each point in the program. Upon
                                                                                                 allowing developers to improve their test suite quickly
    evaluation, the values for testWidth and testHeigth
                                                                                                 and with confidence.
    still result in 5 and 10 respectively, and thus the TBT
    would be unchanged.
                                                                                                 References
    Conditionals and Loops                                                                       [BQO+ 12]                Gabriele Bavota, Abdallah Qusef,
    Upto now, our examples did not contain any condi-                                                                     Rocco Oliveto, Andrea De Lucia, and
    tionals or loops, since they are not desirable in test                                                                David Binkley. An empirical anal-
    code. However, sometimes they could appear in test                                                                    ysis of the distribution of unit test
    code, in which case they can be evaluated similarly to                                                                smells and their impact on software
    expressions and function calls. For example, we could                                                                 maintenance. In Software Mainte-
    define the following function:                                                                                        nance (ICSM), 2012 28th IEEE Inter-
                                                                                                                          national Conference on, pages 56–65.
1  i n t setupData ( i n t i ) {
2        i f ( i == 1 ) {
                                                                                                                          IEEE, 2012.
 3             return 5;
 4       } else {                                                                                [BQO+ 15]                Gabriele Bavota, Abdallah Qusef,
 5             i f ( i == 2 ) {                                                                                           Rocco Oliveto, Andrea De Lucia, and
 6                   return 5 + 5;                                                                                        Dave Binkley. Are test smells really
 7             }
 8       }
                                                                                                                          harmful? an empirical study. Empiri-
 9       return 0;                                                                                                        cal Software Engineering, 20(4):1052–
10 }                                                                                                                      1094, 2015.


                                                                                             5
[BVR08]     Manuel Breugelmans and Bart                                 nature of test smells. In Proceedings
            Van Rompaey.       Testq: Exploring                         of the 31st IEEE/ACM International
            structural and maintenance char-                            Conference on Automated Software
            acteristics of unit test suites. In                         Engineering, pages 4–15. ACM, 2016.
            WASDeTT-1:         1st International
            Workshop on Advanced Software                 [VDM02]       Arie Van Deursen and Leon Moonen.
            Development Tools and Techniques,                           The video store revisited–thoughts on
            2008.                                                       refactoring and testing. In Proc. 3rd
                                                                        Intl Conf. eXtreme Programming and
[Dus02]     Elfriede Dustin.   E↵ective Soft-                           Flexible Processes in Software Engi-
            ware Testing: 50 Ways to Improve                            neering, pages 71–76. Citeseer, 2002.
            Your Software Testing.    Addison-
            Wesley Longman Publishing Co., Inc.,          [VDMvdBK01] A Van Deursen, L Moonen, A van den
            Boston, MA, USA, 2002.                                    Bergh, and G Kok. Refactoring test
                                                                      code. In 2nd International Conference
[FMM+ 11]   Francesca Arcelli Fontana, Elia Mar-                      on Extreme Programming and Flexi-
            iani, Andrea Mornioli, Raul Sormani,                      ble Processes in Software Engineering
            and Alberto Tonello. An experience                        (XP2001), pages 92–95. University of
            report on using code smells detec-                        Cagliary, 2001.
            tion tools. In Software Testing, Ver-
            ification and Validation Workshops            [VRDBD06]     Bart Van Rompaey, Bart Du Bois,
            (ICSTW), 2011 IEEE Fourth Inter-                            and Serge Demeyer.         Characteriz-
            national Conference on, pages 450–                          ing the relative significance of a test
            457. IEEE, 2011.                                            smell. In 2006 22nd IEEE Interna-
                                                                        tional Conference on Software Main-
[Fow09]     Martin Fowler. Refactoring: improv-                         tenance, pages 391–400. IEEE, 2006.
            ing the design of existing code. Pear-
            son Education India, 2009.
[GvDS13]    Michaela Greiler, Arie van Deursen,
            and Margaret-Anne Storey. Auto-
            mated detection of test fixture strate-
            gies and smells. In 2013 IEEE Sixth
            International Conference on Software
            Testing, Verification and Validation,
            pages 322–331. IEEE, 2013.
[Mes07]     Gerard Meszaros. xUnit test patterns:
            Refactoring test code. Pearson Educa-
            tion, 2007.
[Pip02]     Jens Uwe Pipka. Refactoring in a test
            first-world. In Proc. Third Intl Conf.
            eXtreme Programming and Flexible
            Processes in Software Eng, 2002.
[PMSD15]    Ali Parsai, Alessandro Murgia, Quin-
            ten David Soetens, and Serge De-
            meyer. Mutation testing as a safety
            net for test code refactoring. In Sci-
            entific Workshop Proceedings of the
            XP2015, page 8. ACM, 2015.
[TPB+ 16]   Michele Tufano, Fabio Palomba,
            Gabriele  Bavota,      Massimiliano
            Di Penta, Rocco Oliveto, Andrea
            De Lucia, and Denys Poshyvanyk.
            An empirical investigation into the


                                                      6