Test Refactoring: a Research Agenda Brent van Bladel Serge Demeyer brent.vanbladel@uantwerpen.be serge.demeyer@uantwerpen.be University of Antwerp, Middelheimlaan 1, 2020 Antwerp, Belgium tests suite. Yet, there is no metric or method to mea- sure the quality of the test code in terms of readability Abstract and maintainability. One indication of the quality of test code could Research on software testing generally fo- be the presence of test smells. Similar to how pro- cusses on the e↵ectiveness of test suites to de- duction code can su↵er from code smells, these test tect bugs. The quality of the test code in specific smells can indicate problems with the test terms of maintainability remains mostly ig- code in terms of maintainability [VDMvdBK01]. How- nored. However, just like production code, ever, refactoring test smells can be tricky, as there test code can su↵er from code smells that im- is no reliable method to verify if a refactored test ply refactoring opportunities. In this paper, suite preserves its external behaviour. Several stud- we will summerize the state-of-the-art in the ies point out the peculiarities of test code refactor- field of test refactoring. We will show that ing [VDMvdBK01, VDM02, Pip02, Fow09]. However, there is a gap in the tool support, and pro- none of them provided an operative method to guaran- pose future work which will aim to fill this tee that such refactoring was preserving the behaviour gap. of the test. The rest of the paper is organized as follows. In 1 Introduction section 2 we will summerize the related work on test smells and test refactoring, which shows test smells to Refactoring is “the process of changing a software sys- be an important issue. Section 3 we will go over the tem in such a way that it does not alter the exter- existing test refactoring tools, showing there is a gap nal behaviour of the code yet improves its internal in the current tool support. We will propose our future structure” [Fow09]. If applied correctly, refactoring work which aims to fill the gap in existing tool support improves the design of software, makes software eas- in section 4. In section 5 we define a theoretical model ier to understand, helps to find faults, and helps to for defining test behaviour, which will form the basis develop a program faster [Fow09]. of our proposed future work. We conlude in section 6. In most organizations, the test code is the final “quality gate” for an application, allowing or deny- ing the move from development to release. With this 2 Related Work role comes a large responsibility: the success of an ap- The term test smell was first introduced by van plication, and possibly the organization, rests on the Deursen et al. in 2001 as a name for any symptom quality of the software product [Dus02]. Therefore, in the test code of a program that possibly indicates it is critical that the test code itself is of high quality. a deeper problem. In their paper, they defined a first Methods, such as code coverage analysis and mutation set of eleven common test smells and a set of specific testing, help developers assess the e↵ectiveness of the refactorings which solve those smells [VDMvdBK01]. Meszaros expanded the list of test smells in 2007, mak- Copyright c by the paper’s authors. Copying permitted for private and academic purposes. ing a further distinction between test smells, behaviour Proceedings of the Seminar Series on Advanced Techniques and smells, and project smells [Mes07]. Greiler et al. de- Tools for Software Evolution SATToSE 2017 (sattose.org). fined five new test smells specifically related to test 07-09 June 2017, Madrid, Spain. fixtures in 2013 [GvDS13]. 1 Several studies have investigated the impact test In 2013, Greiler et al. presented a tool which can smells have on the quality of the code. Van Rompaey automatically detect test smells in fixtures [GvDS13]. et al. performed a case study in 2006 in which they Their tool, called TestHound, provides reports on test investigated two test smells (General Fixture and Ea- smells and recommendations for refactoring the smelly ger Test). They concluded that all tests which suf- test code. They performed a case study where develop- fer from these smells have a negative e↵ect on the ers are asked to use the tool and afterwards are inter- maintainability of the system [VRDBD06]. In 2012, viewed. They show that developers find that the tool Bavota et al. performed an experiment with master helps them to understand, reflect on and adjust test students in which they studied eight test smells (Mys- code. However, their tool is limited to smells related tery Guest, General Fixture, Eager Test, Lazy Test, to test fixtures. Furthermore, they only report the Assertion Roulette, Indirect Testing, Sensitive Equal- occurences of the di↵erent fixture-related test smells ity, and Test Code Duplication). This study pro- in the code. They do not give one single metric that vided the first empirical evidence of the negative im- represents the overall quality of the test code. During pact test smells have on maintainability [BQO+ 12]. the interviews, one developer said that the di↵erent In 2015, they continued their research and performed smells should be integrated in one high-level metric: the experiment with a larger group, containing more “This would give us an overall assessment, so that if students as well as developers from industry. They you make some improvements you should see it in the conclude that test smells represent a potential dan- metric.” [GvDS13]. ger to the maintainability of production code and test suites [BQO+ 15]. Defining Test Behaviour In 2016, Tufano et al. investigated the nature of test smells. They conducted a large-scale empirical study Refactoring of the production code can be done with over the commit history of 152 open source projects. little risk using the test suite as a safeguard. Since They found that test smells a↵ect the project since there is no safeguard when refactoring test code, there their creation and that they have a very high surviv- is a need for tool support that can verify if a refac- ability. This shows the importance of identifying test tored test suite preserves its behaviour pre- and post- smells early, preferably in the IDE before the commit. refactoring. Previous research on this topic has been They also performed a survey with 19 developers which performed by Parsai et al. in 2015 [PMSD15]. They looked into their perception of test smells and design propose the use of mutation testing to verify the test issues. They showed that developers are not able to behaviour. However, mutation testing requires the test identify the presence of test smells in their code, nor do suite to be ran for each mutant, which can be hundreds developers perceive them as actual design problems. of times, making it unlikely to be useful in practice. This highlights the importance of investing e↵ort in Furthermore, while mutation testing gives an indica- the development of tools to identify and refactor test tion of the test behaviour, it cannot fully guarantee smells [TPB+ 16]. that the behaviour is preserved. 4 Research Plan 3 Tool Support As we have shown, there is a lack of tool support when Test Smell Detection it comes to test refactoring. We plan on creating a tool that will help developers during this process. We There are many tools that can automatically detect present our future work in terms of a research agenda: code smells, for example the JDeodorant Eclipse plu- gin and the inFusion tool [FMM+ 11]. Test smells, however, are very di↵erent from code smells and these Test Smell Detection tools are not able to detect them. Tool support for • Objective - Create a tool that is able to detect handling test smells and refactoring test code is lim- test smells. More specifically, the tool should ited. be able to detect all test smells defined by van In 2008, Breugelmans et al. presented TestQ, a Deursen, Meszaros, and Greiler [VDMvdBK01, tool which can statically detect and visualize 12 test Mes07, GvDS13]. This tool should also be able to smells [BVR08]. TestQ enables developers to quickly create a metric that represents the overall quality identify test smell hot spots, indicating which tests of the test code in terms of maintainability. need refactoring. However, the lack of integration in development environments and the overall slow per- • Approach - Breugelmans et al. proposed methods formance make TestQ unlikely to be useful in rapid for detecting all the original test smells (defined code-test-refactor cycles [BVR08]. by van Deursen et al.) [BVR08]. We will use these 2 methods in our tool. For the other test smells (de- stored as a sequence of operations. When encount- fined by Meszaros and Greiler et al.), we will use ing an assert, a node which represents the assert is a similar approach in order to define detection added to the TBT. All child nodes of the assert are methods ourselves. The metric that represents also added, replacing variables with their stored value. the overall quality of the test code can be calcu- lated based on the amount of test smells present in the test code. Running Example • Validation - Verification of correctness will be As an example to illustrates the approach, we use the made using a dataset consisting of a set of real following simple production code: open-source software projects. We can compare the tool with TestHound for fixture related test 1 c l a s s Rectangle { smells and with TestQ for the other test smells. 2 public : 3 Rectangle () ; Smells not covered by either TestHound or TestQ 4 int getHeigth () ; will require manual verification. 5 i n t getWidth ( ) ; 6 void setHeigth ( i n t h) ; Defining Test Behaviour 7 v o i d setWidth ( i n t w) ; 8 private : • Objective - Define test behaviour such that de- 9 int heigth ; velopers can verify if the test code is behaviour 10 i n t width ; 11 }; preserving between pre- and post- refactoring. 12 13 R e c t a n g l e : : R e c t a n g l e ( ) {} • Approach - The production code should be deter- 14 i n t R e c t a n g l e : : g e t H e i g t h ( ) { r e t u r n h e i g t h ; } ; ministic, and thus the same set of inputs should 15 i n t R e c t a n g l e : : getWidth ( ) { r e t u r n width ; } ; always result in the same set of outputs. We will 16 v o i d R e c t a n g l e : : s e t H e i g t h ( i n t h ) { h e i g t h = h ; } 17 v o i d R e c t a n g l e : : setWidth ( i n t w) { width = w; } analyse the code in order to map all entry and exit points from test code to production code and link them with the corresponding assertions. This It defines a class Rectangle which has two private data will result in the construction of a Test Behaviour members heigth and width, as well as getters and Tree (TBT), which defines the behaviour of the setters for these data members. Note that even though test. Comparison of TBTs will allow for validat- this is a toy example, there is no technical di↵erence ing behavior preservation between pre- and post- between simple getters and setters and large algoritmic refactoring. Section 5 will explain this concept in functions as the production code is considered a ’black more detail. box’. There would be no di↵erence if the getters did some advanced mathematical calculations, read from • Validation - We will run the algorithm on the a file, or contacted a networked database. dataset of commits used for verifying the test We will start with a simple test for this production quality metric. We can do an initial check us- code: ing coverage metrics and mutation testing. When these metrics change pre- and post-refactoring, we 1 R e c t a n g l e r = R e c t a n g l e ( ) ; know for certain that the test behaviour changed. 2 r . setWidth ( 5 ) ; When these metrics remain constant, we will have 3 r . s e t H e i g t h ( 1 0 ) ; to manually verify wether the refactoring is be- 4 a s s e r t ( 5 == r . getWidth ( ) ) ; 5 a s s e r t ( 1 0 == r . g e t H e i g t h ( ) ) ; haviour preserving. 5 Theoretical Model for Defining Test This test will result in the Test Behaviour Tree shown in figure 1. As shown, the TBT has one root node Behaviour which has a child for every assert. Each assert node In order to determine test behaviour, a Test Behaviour has the full comparison as a child, where variables are Tree (TBT) can be constructed from the Abstract Syn- replaced with their value. Since the call on the rectan- tax Tree (AST). This can be done by simply traversing gle object is considered a call to production code, the the AST once. During this pass of the AST, all vari- sequence of operations is appended as a child rather ables and objects need to be stored with their value. than a single value, because we consider production All subsequent operations on variables are then per- code as a ’black box’. We can safely assume this, since formed on the stored value. If a variable is initialized the production code should be deterministic (other- with a functioncall to production code, it can be stored wise you could not write tests for it) and should not as that call. Operations on that variable will then be change when refactoring test code. 3 Node assert assert == == 5 FunctionMemberCallNode 10 FunctionMemberCallNode FunctionMemberCallNode getWidth FunctionMemberCallNode getHeigth FunctionMemberCallNode setHeigth 10 FunctionMemberCallNode getWidth FunctionCallNode setWidth 5 FunctionMemberCallNode setHeigth 10 Rectangle FunctionCallNode setWidth 5 Rectangle Figure 1: The Test Behaviour Tree from the example tests. Variable Refactorings These refactorings did not change behaviour, which is why we get the same resulting TBT. If you would One way to refactor this test would be to replace the change the value of testWidth or testHeigth, the be- ’magic numbers’ in the with variables. This would haviour of the test would change as you would be test- greatly increase maintainability, as consistency be- ing di↵erent input - output pairs. This change in be- tween input and expected output would be guaran- haviour would be detected easily detected by our ap- teed. Because variables are replaced with their value proach, as the values in the TBT would change accord- in our approach, the following refactored test code will ingly, resulting in a di↵erent TBT. result in the exact same TBT: 1 int x = 5; Expression Refactorings 2 int y = 10; 3 Rectangle r = Rectangle () ; Detecting a change in input - output pairs is more im- 4 r . setWidth ( x ) ; portant when the test code contains some arithmetic 5 r . setHeigth (y) ; 6 a s s e r t ( x == r . getWidth ( ) ) ; operations. Sometimes it is necessary to make a cal- 7 a s s e r t ( y == r . g e t H e i g t h ( ) ) ; culation in the test code to use as an oracle. When it comes to these kind of expressions in the AST, it is pos- Similarly, the common refactoring where a vari- sible to simply evaluate them during traversal of the able is renamed can be performed without changing AST. The values of all variables are stored upto that the TBT. The following code also generates the same point in the program, and the result can be stored as TBT: the new value for the corresponding variable. There- 1 i n t testWidth = 5 ; fore, the following code still generates the same TBT, 2 int testHeigth = 10; 3 Rectangle testRectangle = Rectangle () ; as the behaviour did not change since the values for 4 t e s t R e c t a n g l e . setWidth ( t e s t W i d t h ) ; testWidth and testHeigth still evaluate to 5 and 10 5 testRectangle . setHeigth ( testHeigth ) ; respectively 1 ): 6 a s s e r t ( t e s t W i d t h == t e s t R e c t a n g l e . getWidth ( ) ) ; 7 a s s e r t ( t e s t H e i g t h == t e s t R e c t a n g l e . g e t H e i g t h ( ) 1 Note that it would be bad practice to write this test, but ); we use it here simply to showcase the approach. 4 1 i n t testWidth = 1 ; and rewrite our test to: 2 i n t t e s t H e i g t h = ((++ t e s t W i d t h ) ⇤ 2 ) + ( ( t e s t W i d t h++) ⇤ 3 ) + 2 ; 1 i n t t e s t W i d t h = setupData ( 1 ) ; 3 t e s t W i d t h = t e s t W i d t h++; 2 i n t t e s t H e i g t h = setupData ( 2 ) ; 4 Rectangle testRectangle = Rectangle () ; 3 Rectangle testRectangle = Rectangle () ; 5 t e s t R e c t a n g l e . setWidth ( t e s t W i d t h ) ; 4 t e s t R e c t a n g l e . setWidth ( t e s t W i d t h ) ; 6 testRectangle . setHeigth ( testHeigth ) ; 5 testRectangle . setHeigth ( testHeigth ) ; 7 a s s e r t ( t e s t W i d t h == t e s t R e c t a n g l e . getWidth ( ) ) ; 6 a s s e r t ( t e s t W i d t h == t e s t R e c t a n g l e . getWidth ( ) ) ; 8 a s s e r t ( t e s t H e i g t h == t e s t R e c t a n g l e . g e t H e i g t h ( ) 7 a s s e r t ( t e s t H e i g t h == t e s t R e c t a n g l e . g e t H e i g t h ( ) ); ); Again, the values for testWidth and testHeigth still Function Refactorings evaluate to 5 and 10 respectively, resulting in the same TBT. When conditionals or loops are used in combina- Another common refactoring is to extract part of the tion with calls to production code, it would be handled test code to a function. As an example, we could define similarly to how the testRectangle object is handled. the following functions: The sequence of operations would be kept, including 1 i n t setupWidth ( i n t x ) { the conditional or loop, similarly to how they would 2 return x /2; 3 } be represented in AST form. 4 5 int setupHeigth ( int y) { 6 Conclusion 6 return y ⇤2; 7 } We have presented an overview of the research done in and rewrite our test to: the field of test smells and test refactoring. Research has indicated that test smells have a negative impact 1 i n t t e s t W i d t h = setupWidth ( 1 0 ) ; 2 int testHeigth = setupHeigth (5) ; on maintainability and therefore need to be refactored. 3 Rectangle testRectangle = Rectangle () ; We have shown that there is a lack of tool support to 4 t e s t R e c t a n g l e . setWidth ( t e s t W i d t h ) ; aid developers with test refactoring. We also provided 5 testRectangle . setHeigth ( testHeigth ) ; a theoretical model that defines test behaviour, in the 6 a s s e r t ( t e s t W i d t h == t e s t R e c t a n g l e . getWidth ( ) ) ; 7 a s s e r t ( t e s t H e i g t h == t e s t R e c t a n g l e . g e t H e i g t h ( ) form of Test Behaviour Trees, which can be used to ); compare test behaviour pre- and post-refactoring. We plan to create a tool for test refactoring which can de- If these functions are marked as part of the pro- tect test code smells, evaluate the test quality, and as- duction code, they will be treated as ’black box’ func- sure behaviour is preserved after test refactoring using tions. This is not desirable, since then the TBT will our theoretical model. We currently have a working change while behaviour is preserved. Therefore, these prototype for the latter. Our final tool will help devel- functions need to be evaluated similarly to expressions. opers decide when and where to refactor the test code, Again this is perfectly possible since we have the val- as well as help them perform the refactorings correctly, ues of all variables at each point in the program. Upon allowing developers to improve their test suite quickly evaluation, the values for testWidth and testHeigth and with confidence. still result in 5 and 10 respectively, and thus the TBT would be unchanged. References Conditionals and Loops [BQO+ 12] Gabriele Bavota, Abdallah Qusef, Upto now, our examples did not contain any condi- Rocco Oliveto, Andrea De Lucia, and tionals or loops, since they are not desirable in test David Binkley. An empirical anal- code. However, sometimes they could appear in test ysis of the distribution of unit test code, in which case they can be evaluated similarly to smells and their impact on software expressions and function calls. For example, we could maintenance. In Software Mainte- define the following function: nance (ICSM), 2012 28th IEEE Inter- national Conference on, pages 56–65. 1 i n t setupData ( i n t i ) { 2 i f ( i == 1 ) { IEEE, 2012. 3 return 5; 4 } else { [BQO+ 15] Gabriele Bavota, Abdallah Qusef, 5 i f ( i == 2 ) { Rocco Oliveto, Andrea De Lucia, and 6 return 5 + 5; Dave Binkley. Are test smells really 7 } 8 } harmful? an empirical study. Empiri- 9 return 0; cal Software Engineering, 20(4):1052– 10 } 1094, 2015. 5 [BVR08] Manuel Breugelmans and Bart nature of test smells. In Proceedings Van Rompaey. Testq: Exploring of the 31st IEEE/ACM International structural and maintenance char- Conference on Automated Software acteristics of unit test suites. In Engineering, pages 4–15. ACM, 2016. WASDeTT-1: 1st International Workshop on Advanced Software [VDM02] Arie Van Deursen and Leon Moonen. Development Tools and Techniques, The video store revisited–thoughts on 2008. refactoring and testing. In Proc. 3rd Intl Conf. eXtreme Programming and [Dus02] Elfriede Dustin. E↵ective Soft- Flexible Processes in Software Engi- ware Testing: 50 Ways to Improve neering, pages 71–76. Citeseer, 2002. Your Software Testing. Addison- Wesley Longman Publishing Co., Inc., [VDMvdBK01] A Van Deursen, L Moonen, A van den Boston, MA, USA, 2002. Bergh, and G Kok. Refactoring test code. In 2nd International Conference [FMM+ 11] Francesca Arcelli Fontana, Elia Mar- on Extreme Programming and Flexi- iani, Andrea Mornioli, Raul Sormani, ble Processes in Software Engineering and Alberto Tonello. An experience (XP2001), pages 92–95. University of report on using code smells detec- Cagliary, 2001. tion tools. In Software Testing, Ver- ification and Validation Workshops [VRDBD06] Bart Van Rompaey, Bart Du Bois, (ICSTW), 2011 IEEE Fourth Inter- and Serge Demeyer. Characteriz- national Conference on, pages 450– ing the relative significance of a test 457. IEEE, 2011. smell. In 2006 22nd IEEE Interna- tional Conference on Software Main- [Fow09] Martin Fowler. Refactoring: improv- tenance, pages 391–400. IEEE, 2006. ing the design of existing code. Pear- son Education India, 2009. [GvDS13] Michaela Greiler, Arie van Deursen, and Margaret-Anne Storey. Auto- mated detection of test fixture strate- gies and smells. In 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation, pages 322–331. IEEE, 2013. [Mes07] Gerard Meszaros. xUnit test patterns: Refactoring test code. Pearson Educa- tion, 2007. [Pip02] Jens Uwe Pipka. Refactoring in a test first-world. In Proc. Third Intl Conf. eXtreme Programming and Flexible Processes in Software Eng, 2002. [PMSD15] Ali Parsai, Alessandro Murgia, Quin- ten David Soetens, and Serge De- meyer. Mutation testing as a safety net for test code refactoring. In Sci- entific Workshop Proceedings of the XP2015, page 8. ACM, 2015. [TPB+ 16] Michele Tufano, Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, Andrea De Lucia, and Denys Poshyvanyk. An empirical investigation into the 6