Clone Detection vs. Pattern Mining: The Battle Céline Deknop and Kim Mens UCLouvain Louvain-la-Neuve, Belgium {kim.mens, celine.deknop}@uclouvain.be Simon Baars and Ana Oprescu Johan Fabry University of Amsterdam Raincode Labs Amsterdam, The Netherlands Brussels, Belgium a.m.oprescu@uva.nl and simon.mailadres@gmail.com johan@raincode.com other occurrences to be changed as well [2]. Also, du- plicated code has been shown to add up to 25% of Abstract total system volume [3], which entails more code to be maintained. Unfortunately, due to the “tyranny of the In this paper we compare two approaches to dominant decomposition” [4], redundant crosscutting discover recurrent fragments in source code: code fragments cannot always be avoided. Neverthe- clone detection and frequent subtree min- less, it remains essential to discover them, in order to ing. We apply both approaches to a medium- better maintain, understand or evolve the software. sized Java case and compare qualitatively and Several techniques have been proposed to detect quantitatively their results in terms of what recurrent code fragments. This paper compares two types of code fragments are detected, as well such approaches: clone detection and pattern min- as their size, relevance, coverage, and level of ing. Clone detection entails that fragments of code detail. We conclude that both approaches are are compared to each other line by line or statement complementary, while existing overlap may be by statement, in order to find similar fragments, often used for cross-validation of the approaches. with certain groups of tokens being excluded from the Index terms— clone detection, pattern mining, match. Pattern mining, in particular frequent subtree frequent subtree mining, code clones, type 3 clones, mining, is the problem of finding all subtrees occurring duplicate code. frequently in other trees, in our case abstract syntax trees (ASTs), with a support that is above a given 1 The Battle threshold. Since clone detection is more lightweight than pat- Recurrent code fragments are often considered as tern mining, a relevant question is whether pattern symptoms of bad design [1]. They create implicit mining is worth the additional effort, by providing dependencies, thus increasing maintenance efforts or alternative, richer or larger patterns, or whether it causing bugs in evolving software. Changing one oc- is largely redundant with respect to clone detection. currence of such a duplicated fragment may require To perform this comparison, we conduct a case study where we compare the approaches from two angles: Copyright © by the paper’s authors. Use permitted under Cre- ative Commons License Attribution 4.0 International (CC BY to what extent do mined patterns correspond to de- 4.0). tected clones, and how well do clones match to mined In: D. Di Nucci, C. De Roover (eds.): Proceedings of the 18th patterns? Belgium-Netherlands Software Evolution Workshop, Brussels, For each direction we investigate what code frag- Belgium, 28-11-2019, published at http://ceur-ws.org Copyright 2019 for this paper by its authors. Use permitted ments are found by one approach that are not found under Creative Commons License Attribution 4.0 International by the other. For those fragments found by both ap- (CC BY 4.0). proaches we compare their coverage, size, and level of 1 detail. This analysis allows us to make interesting ob- clones are found in method bodies, other clones were servations regarding the similarity and differences of found in constructors or exceed the boundaries of a both approaches, and to draw conclusions regarding single method. Detecting these results in the analysed the strengths of either approach. JHotDraw system took 38.7 seconds on a Macbook Air (this is relatively fast compared to pattern mining). 2 The Arena 3.2 Pattern mining To compare both approaches, we apply them to a medium-sized Java project: JHotDraw [5]. We se- Our pattern mining tool is based on an extension of lected JHotDraw version 7.5.1, which is part of the the existing FREQT tree mining algorithm [13]. As Qualitas Corpus [6], a curated collection of open- input it takes an abstract syntax tree (AST) repre- source Java software systems meant to be used for sentation of the source code, meaning that a mined empirical studies of code artefacts [7]. JHotDraw is pattern is an AST fragment that occurs frequently in a two-dimensional graphics framework for structured the codebase. One of FREQT’s limitations is that it drawing editors. It consists of 428 files and is known to tends to find too many patterns to be practical, and make good use of design patterns [8]. Previous stud- that many of them remain quite small. In order to be ies have used earlier versions of JHotDraw to look for useful for mining code fragments in large codebases, recurrent code regularities [9, 10]. All this makes us we adapted the original FREQT algorithm with some confident that it is an interesting case on which to con- dedicated constraints and with an additional step to duct our comparison. try to grow the patterns found as large as possible [14]. More specifically, we use maximal frequent subtree mining to ensure that a condensed representation of 3 The fighters large patterns is found, and we add the following ad- We now explain both approaches, their corresponding ditional constraints to the mining process: tools and used configuration, and provide some raw C0 minimum support: for the experiment presented data and statistics on their results when applying them here, we used a value of 5; to JHotDraw. C1 maximum size of the pattern: 4; C2 minimum size of the pattern: 2; 3.1 Clone Detection C3 limit the set of labels allowed to occur in the root Code cloning is an active field of study: many detec- of patterns: Type Declaration and Blocks; tion techniques and tools were proposed [11]. Different C4 forbid some labels to occur in patterns: Javadoc, clone types allow a different granularity of variance be- annotations; tween cloned fragments [12]. C5 limit the number of siblings in a pattern that can Type 1 clones allow variance in structure and whites- have the same label: 10; pace only. C6 all leaf nodes in a pattern must have a label that Type 2 clones allow variance in identifier names. can occur as a leaf node in the AST; Type 3 clones allow complete statements to vary. C7 discovered patterns may not miss any mandatory We identify two useful concepts regarding code labels. clones [12]: The threshold values of the different constraints Clone instance: A single cloned fragment. above were determined experimentally by applying the Clone class: A set of similar clone instances, referred tool on several Java cases and manually analysing the to hereafter as a clone. quality of the patterns mined. Since mining is quite We detect clones using our CloneRefactor tool1 . This computationally intensive, in order for our mining al- tool supports several clone type definitions, but for gorithm to finish within reasonable memory and time this study we only considered type 3 clones. We used bounds, we have to split the codebase on which we the following detection settings for our comparison: work into separate folds, and run the miner on one Minimum number of lines cloned: 3 fold at a time.2 For JHotDraw, we created 4 folds, Minimum clone class size: 5 and found 156 patterns in total for the configuration above. The size of the AST fragments of the mined With these settings we detected 136 clone classes patterns varied between 15 and 207 nodes, with an each having 9.2 instances on average. Each of these average size of 36. The mining took 33 minutes on a instances span 6.8 lines on average. This makes up middle grade tower PC. We noticed, not surprisingly, for a total of 8.559 out of 39.403 lines cloned (i.e., 2 This does imply that some patterns that occur across folds 21.7% of the system is cloned). About half of these may not be found, if their frequency within a single fold does 1 Available on GitHub: https://github.com/SimonBaars/ not surpass the minimum support threshold. The less and larger CloneRefactor the folds, the less this problem occurs. 2 that the bigger the pattern size, the more interesting node in the pattern and the end line that of the last it tends to be. AST node in the pattern. To compare the patterns and clones at these loca- 4 The Fight tions we define a similarity metric for clones and pat- terns. For each pattern, we search for the clone with We now describe how we compared both approaches the highest similarity and vice versa by comparing each and then report on the results of our comparison. clone instance of each clone class to each pattern in- stance of each pattern. We then use the following for- 4.1 Methodology mula to determine the similarity percentage between We compared the results of both approaches through a clone instance and pattern instance: (1) manual comparison: the pattern mining team ex- 2 ∗ match(pi , ci ) haustively went over all clones to look for possible similarity(pi , ci ) = (1) matching patterns and (2) automated comparison: the size(pi ) + size(ci ) clone detection team formalized an automated method where pi is a pattern instance, ci is a clone instance, to compare clones and patterns. match is the function that yields the number of match- ing lines between two instances and size is the function 4.1.1 Manual comparison that yields the number of lines in a code fragment. If Per clone class we manually investigated 3 a pattern and a clone do not intersect, the result of match will be 0, meaning no similarity at all. Given a • the overlap and coverage (in #classes, #LOC, clone instance, we compute the instance similarity by #occurrences/ class) when (at least) one pattern seeking the pattern instance that maximizes the sim- matches it; ilarity function (so we find the pattern instance most similar to a clone instance): • if there was a matching pattern, we also looked for similar patterns occurring multiple times; instanceSim(p, ci ) = max({similarity(pi , ci ) | pi ∈ p}) (2) • the relevance/usefulness/richness of the pattern where p is a set of pattern instances. Next, we cal- for a software developer (this is a subjective cri- culate the similarity at the clone class level by sum- terion), in terms of a rating -/0/+. ming the instance similarities of all instances in a clone class:   instanceSim(p, ci ) 4.1.2 Automated comparison classSim(p, c) = sum( ∗ size(ci ) ci ∈ c ) size(p) + size(c) Using a script4 , we find for each pattern the clone class (3) that is the most similar and vice versa. These results where c is a clone class. We compute which of the help analysing to what extent one approach is redun- C=136 clone classes is most similar to a pattern p dant over the other: if a large percentage of clones as the maximum similarity across all clone classes: closely resembles most patterns (or vice versa) this would be the case. collectionSim(p, C) = max({classSim(p, c) | c ∈ C}) In the automated comparison we first determine the (4) locations of all 156 patterns and 136 clones. These lo- Finally, we collect per pattern the most similar clone cations consist of the file and range of each instance class: in a pattern/clone. For clones, this range is simply overallSim(P, C) = {collectionSim(p, C) | p ∈ P } the begin and end line of a code fragment that exists (5) elsewhere. To simplify the comparison, we chose to where P is the set of all 156 identified patterns. Using compare on a line-level and not take into account the this method, we can find which pattern is most simi- begin and end column of ranges. For patterns, each lar to a clone and vice versa (by using the respective separate AST node that belongs to the pattern (high- symmetric relations). lighted in orange in Fig. 3) has its own range (line and columns). For simplicity, we consider the range of a 4.2 Results of the manual analysis pattern instance to be the begin line of the first AST 3 You 4.2.1 Clone instances with no matching pat- can find the full manual analysis here: https: //github.com/CelineDknp/CloneVSPatternAnalysis/blob/ tern master/CloneVSPatternAnalysis.txt 4 Source code available at https://github.com/SimonBaars/ The first thing the pattern team noticed when explor- CloneRefactor/tree/master/src/main/java/com/simonbaars/ ing the clones was that out of 90 clones for which clonerefactor/scripts/intimals there was no matching pattern, 42 occurred in less 3 50% Clone → Pattern 43% 40% 40% Pattern → Clone 30% 20% 10% 10% 10% 11% 9% 10% 5% 6% 7% 6% 7% 5% 6% 5% 3% 4% 4% 3% 2% 1% 3% 0% 0% 0% 0 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 100 Figure 1: Percentage of clones/patterns that have a certain percentage of instances intersecting with each other. The x-axis shows the percentage of clones or patterns. The y-axis shows the percentage of instances that intersect with each other. Clone → Pattern 50% 43% Pattern → Clone 40% 40% 30% 20% 14% 16% 17% 13% 13% 10% 8% 7% 8% 5% 4% 4% 1% 1% 1% 1% 1% 1% 1% 0% 0% 0% 0 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 Figure 2: Distribution of similarity percentages for all clones and patterns. The x-axis shows the percentage of clones or patterns. The y-axis shows the similarity as defined in Sec. 4.1.2. than five different code files. For example, the fol- of their approach and is investigating how to improve lowing code snippet occurs exactly 6 times in the their algorithm to work on larger data sets to avoid this JMDIDesktopPane file, but was not found as a pattern problem.) The following code snippet is an example by the miner. of such a code clone, repeated exactly 9 times in 6 different files, of which 1 file in fold 1, 2 files in fold try { 2 and 3 files in fold 4. So it was not discovered as a ((JInternalFrame)allFrames[i]) pattern in either of those folds. .setMaximum(false); } catch (PropertyVetoException e) { protected void generateLookupTables() { e.printStackTrace(); radials = new float[w * h]; } angulars = new float[w * h]; alphas = new int[w * h]; float radius = getRadius(); We quickly realised that this behavior is caused by float blend = (radius + 2f) / radius - 1f; the C0 constraint (minimum support), which considers //clone ends here patterns occurring in at least five different code files rather than having five different occurrences. This is a result of the way the miner was configured to have a manageable amount of patterns, and concluded that it might not have been the best fit for this comparison. The 35 other clones for which there was no matching For future comparisons with code clone detection tools pattern are either too small to fit the constraints of that often find multiple clone instances in a given file, the miner, or cannot really be considered as a pattern. we will need to change the configuration to avoid this For example, some clones consist of the end of one issue. function along with the start of another one, like in the following snippet. A second reason why some clones are not matched by patterns, results from the fact that for the mining //Start of function not in clone process to finish correctly, we need to divide the code isClampRGB = b; base into smaller folds. But if there is a clone that } spans over 5 files but that are not all in the same fold, the pattern may still be discarded because it would public boolean isClampRGBValues() { not have a high enough support value for either fold. return isClampRGB; We found 13 clones that did not match a pattern for //End of function not in clone this reason. (The mining team knew this limitation 4 4.2.2 Matching clone instances and patterns 4.3 Results of the automated analysis Using the automated analysis method described in When analysing clone instances that did match mined Section 4.1.2 we collected interesting statistical data patterns, the pattern mining team observed that we on the similarity of the clones and patterns found by tend to have two types of matches. both tools. A first type is where the pattern is broader, so that When looking into the percentage of clones and pat- it includes multiple clones. For example, Fig. 3 shows terns that intersect each other (see Fig. 1), we observe pattern 46 of fold 4, which actually matches 4 different that 40% of clones do not match any patterns.5 For clone classes, one for each case block. 11% of the clones, all clone instances intersect with the pattern instances. 43% of the patterns do not intersect with any clones. For 4% of the patterns, all pattern instances intersect with the clone instances. When manually inspecting the 11% of clone classes of which all instances intersect with patterns, we find that most of the clone instances in the clone classes are contained within a pattern. Patterns are often larger because they allow more variance in a fragment of code. Because of that, patterns often capture ad- ditional information surrounding a clone. This is why there is a large difference between clone to pattern and pattern to clone at 100% intersecting instances (see Fig. 1). Fig. 2 shows how many clones and patterns are sim- ilar to each other. The same percentages of clones/- patterns that do not intersect can be found here as 0% similar. However, here we see the categories gradually decreasing towards 100% (no clone/pattern combina- tion is actually 100% similar). By far, most clone/- pattern combinations are less than 50% similar: 92% Figure 3: Pattern 46 of fold 4 of clones and 96% of patterns. This leaves only 8% of clones and 4% of patterns with a similarity higher A second type of match is when we have multiple than 50%. patterns for a single clone class. In such cases, the clone usually corresponds to a specific part of multiple 5 The outcome patterns that are similar, or to related patterns across Before presenting our analysis of these results, we em- folds. For example, the following code snippet occurs phasize that this experiment was only an initial com- in our patterns 60, 78, 111 and 115 of fold 4. parison between the two approaches. Further valida- tion on other case studies, with improved settings, and @Override by researchers other than the original tool creators, are public void transform(AffineTransform tx) { Point2D.Double anchor = getStartPoint(); required. Nevertheless, the results obtained already Point2D.Double lead = getEndPoint(); allowed us to reach some interesting insights. setBounds( (Point2D.Double) tx.transform(anchor, 5.1 Manual inspection anchor), In cases where both approaches found similar code (Point2D.Double) tx.transform(lead, fragments, the overlap was not always complete. lead)); Sometimes one approach (often code cloning) found } more fragments than the other, or (typically pattern mining) found larger or richer code fragments. Com- Finally, using our rating of how relevant a devel- bining both approaches to complement each other’s oper considers a pattern, we observed that out of a results would be an interesting research direction. total of 46 clones that where matched by a pattern, 5 This percentage would very likely become significantly 41 were considered interesting, leading us to conclude larger if we would resolve the issue we encountered with the that many of the results found by both methods can C0 constraint; requiring the patterns to occur in different code be considered as useful to developers. files. 5 5.2 Automated comparison 2014 Joint Conference of the International Work- shop on Software Measurement and the Interna- The results of the automated comparison show that, tional Conference on Software Process and Prod- although most clones share some relation with some uct Measurement, pages 32–37. IEEE Computer patterns, there are only very few clones where this re- Society, 2014. lation is particularly strong. Often, patterns are found in completely different parts than where clones are [3] Magiel Bruntink, Arie Van Deursen, Remco found, or they briefly intersect instead of matching Van Engelen, and Tom Tourwé. On the use of completely. clone detection for identifying crosscutting con- This shows that clone detection does, for the largest cern code. IEEE Transactions on Software Engi- part, find different results than pattern mining. This neering, 31(10):804–818, 2005. indicates that both approaches are not redundant over one another, instead complementing each other. [4] Peri Tarr, Harold Ossher, William Harrison, and Stanley M. Sutton. N degrees of separation: 6 Conclusion Multi-dimensional separation of concerns. In Proceedings of the 1999 International Conference As recurrent code fragments are often considered on Software Engineering, pages 107–119. IEEE, symptoms of bad design, several detection techniques 1999. have been proposed. Comparative studies across emerging techniques could shed further light into the [5] Erich Gamma and Thomas Eggenschwiler. Jhot- trade-offs between time complexity and quality of re- draw, 2004. sults. We set out to compare two such approaches: clone detection and pattern mining, since clone detec- [6] Ewan Tempero. Qualitas Corpus, 2013. tion seems more lightweight and a relevant question is [http://qualitascorpus.com/; accessed 31- whether pattern mining is worth the additional effort. October-2019]. Our automated comparison method involves several [7] Jolita Savolskyte. Review of the jhotdraw frame- levels of data aggregation and is symmetric with re- work, 2004. Technical University Hamburg- spect to the comparison direction: patterns to clone Harburg. classes and vice versa. Our findings indicate that the two approaches are [8] Henrik Bærbak Christensen. Frameworks: rather complementary. About half of the clones/pat- Putting design patterns into perspective. In terns share no relation with each other. In the cases Proceedings of the 9th Annual SIGCSE Confer- that clone detection and pattern mining are not 100% ence on Innovation and Technology in Computer similar but do intersect, often the pattern is larger Science Education, ITiCSE ’04, pages 142–145. than the clone. This is because patterns allow for more ACM, 2004. structural variance in recurrent fragments than clone detection. The main reason that sometimes clones are [9] Andy Kellens, Kim Mens, and Paolo Tonella. larger than patterns is that patterns seem constrained A survey of automated code-level aspect mining to a single subtree, whereas clones can span several techniques. In Awais Rashid and Mehmet Aksit, methods and even exceed class boundaries if they are editors, Transactions on Aspect-Oriented Soft- all similar. ware Development IV, pages 143–162. Springer, 2007. Acknowledgments [10] Angela Lozano, Andy Kellens, Kim Mens, and Part of this work was conducted in the context of Gabriela Arevalo. Mining source code for struc- an industry-university research project, funded by the tural regularities. In Proceedings of the 2010 Belgian Innoviris TeamUp project INTiMALS (2017- 17th Working Conference on Reverse Engineer- TEAM-UP-7). ing, pages 22–31. IEEE Computer Society, 2010. [11] Jeffrey Svajlenko and Chanchal K Roy. Evaluat- References ing modern clone detection tools. In 2014 IEEE [1] Martin Fowler. Refactoring: Improving the design International Conference on Software Mainte- of existing code. Addison-Wesley, second edition, nance and Evolution, pages 321–330. IEEE, 2014. 2018. [12] Chanchal Kumar Roy and James R Cordy. A sur- [2] Jan-Peter Ostberg and Stefan Wagner. On auto- vey on software clone detection research. Queen’s matically collectable metrics for software main- School of Computing TR, 541(115):64–68, 2007. tainability evaluation. In Proceedings of the 6 [13] Tatsuya Asai, Kenji Abe, Shinji Kawasoe, Hiroki Johan Fabry, and Vadim Zaytsev. Mining pat- Arimura, Hiroshi Sakamoto, and Arikawa Setsuo. terns in source code using tree mining algorithms. Efficient substructure discovery from large semi- In Petra Kralj Novak, Tomislav Šmuc, and Sašo structured data. IEICE Transactions on Infor- Džeroski, editors, Discovery Science, pages 471– mation and Systems, 04 2002. 480, Cham, 2019. Springer International Publish- ing. [14] Hoang Son Pham, Siegfried Nijssen, Kim Mens, Dario Di Nucci, Tim Molderez, Coen De Roover, 7