Clone Detection vs. Pattern Mining: The Battle

                                          Céline Deknop and Kim Mens
                                                    UCLouvain
                                           Louvain-la-Neuve, Belgium
                                     {kim.mens, celine.deknop}@uclouvain.be

                    Simon Baars and Ana Oprescu                                                Johan Fabry
                      University of Amsterdam                                                 Raincode Labs
                     Amsterdam, The Netherlands                                              Brussels, Belgium
          a.m.oprescu@uva.nl and simon.mailadres@gmail.com                                 johan@raincode.com


                                                                      other occurrences to be changed as well [2]. Also, du-
                                                                      plicated code has been shown to add up to 25% of
                        Abstract                                      total system volume [3], which entails more code to be
                                                                      maintained. Unfortunately, due to the “tyranny of the
    In this paper we compare two approaches to                        dominant decomposition” [4], redundant crosscutting
    discover recurrent fragments in source code:                      code fragments cannot always be avoided. Neverthe-
    clone detection and frequent subtree min-                         less, it remains essential to discover them, in order to
    ing. We apply both approaches to a medium-                        better maintain, understand or evolve the software.
    sized Java case and compare qualitatively and
                                                                         Several techniques have been proposed to detect
    quantitatively their results in terms of what
                                                                      recurrent code fragments. This paper compares two
    types of code fragments are detected, as well
                                                                      such approaches: clone detection and pattern min-
    as their size, relevance, coverage, and level of
                                                                      ing. Clone detection entails that fragments of code
    detail. We conclude that both approaches are
                                                                      are compared to each other line by line or statement
    complementary, while existing overlap may be
                                                                      by statement, in order to find similar fragments, often
    used for cross-validation of the approaches.
                                                                      with certain groups of tokens being excluded from the
   Index terms— clone detection, pattern mining,                      match. Pattern mining, in particular frequent subtree
frequent subtree mining, code clones, type 3 clones,                  mining, is the problem of finding all subtrees occurring
duplicate code.                                                       frequently in other trees, in our case abstract syntax
                                                                      trees (ASTs), with a support that is above a given
1    The Battle                                                       threshold.
                                                                         Since clone detection is more lightweight than pat-
Recurrent code fragments are often considered as                      tern mining, a relevant question is whether pattern
symptoms of bad design [1]. They create implicit                      mining is worth the additional effort, by providing
dependencies, thus increasing maintenance efforts or                  alternative, richer or larger patterns, or whether it
causing bugs in evolving software. Changing one oc-                   is largely redundant with respect to clone detection.
currence of such a duplicated fragment may require                    To perform this comparison, we conduct a case study
                                                                      where we compare the approaches from two angles:
Copyright © by the paper’s authors. Use permitted under Cre-
ative Commons License Attribution 4.0 International (CC BY            to what extent do mined patterns correspond to de-
4.0).                                                                 tected clones, and how well do clones match to mined
In: D. Di Nucci, C. De Roover (eds.): Proceedings of the 18th         patterns?
Belgium-Netherlands Software Evolution Workshop, Brussels,               For each direction we investigate what code frag-
Belgium, 28-11-2019, published at http://ceur-ws.org
    Copyright 2019 for this paper by its authors. Use permitted
                                                                      ments are found by one approach that are not found
under Creative Commons License Attribution 4.0 International          by the other. For those fragments found by both ap-
(CC BY 4.0).                                                          proaches we compare their coverage, size, and level of


                                                                  1
detail. This analysis allows us to make interesting ob-         clones are found in method bodies, other clones were
servations regarding the similarity and differences of          found in constructors or exceed the boundaries of a
both approaches, and to draw conclusions regarding              single method. Detecting these results in the analysed
the strengths of either approach.                               JHotDraw system took 38.7 seconds on a Macbook Air
                                                                (this is relatively fast compared to pattern mining).
2     The Arena
                                                                3.2   Pattern mining
To compare both approaches, we apply them to a
medium-sized Java project: JHotDraw [5]. We se-                 Our pattern mining tool is based on an extension of
lected JHotDraw version 7.5.1, which is part of the             the existing FREQT tree mining algorithm [13]. As
Qualitas Corpus [6], a curated collection of open-              input it takes an abstract syntax tree (AST) repre-
source Java software systems meant to be used for               sentation of the source code, meaning that a mined
empirical studies of code artefacts [7]. JHotDraw is            pattern is an AST fragment that occurs frequently in
a two-dimensional graphics framework for structured             the codebase. One of FREQT’s limitations is that it
drawing editors. It consists of 428 files and is known to       tends to find too many patterns to be practical, and
make good use of design patterns [8]. Previous stud-            that many of them remain quite small. In order to be
ies have used earlier versions of JHotDraw to look for          useful for mining code fragments in large codebases,
recurrent code regularities [9, 10]. All this makes us          we adapted the original FREQT algorithm with some
confident that it is an interesting case on which to con-       dedicated constraints and with an additional step to
duct our comparison.                                            try to grow the patterns found as large as possible [14].
                                                                More specifically, we use maximal frequent subtree
                                                                mining to ensure that a condensed representation of
3     The fighters
                                                                large patterns is found, and we add the following ad-
We now explain both approaches, their corresponding             ditional constraints to the mining process:
tools and used configuration, and provide some raw              C0 minimum support: for the experiment presented
data and statistics on their results when applying them         here, we used a value of 5;
to JHotDraw.                                                    C1 maximum size of the pattern: 4;
                                                                C2 minimum size of the pattern: 2;
3.1     Clone Detection                                         C3 limit the set of labels allowed to occur in the root
Code cloning is an active field of study: many detec-           of patterns: Type Declaration and Blocks;
tion techniques and tools were proposed [11]. Different         C4 forbid some labels to occur in patterns: Javadoc,
clone types allow a different granularity of variance be-       annotations;
tween cloned fragments [12].                                    C5 limit the number of siblings in a pattern that can
Type 1 clones allow variance in structure and whites-           have the same label: 10;
pace only.                                                      C6 all leaf nodes in a pattern must have a label that
Type 2 clones allow variance in identifier names.               can occur as a leaf node in the AST;
Type 3 clones allow complete statements to vary.                C7 discovered patterns may not miss any mandatory
We identify two useful concepts regarding code                  labels.
clones [12]:                                                       The threshold values of the different constraints
Clone instance: A single cloned fragment.                       above were determined experimentally by applying the
Clone class: A set of similar clone instances, referred         tool on several Java cases and manually analysing the
to hereafter as a clone.                                        quality of the patterns mined. Since mining is quite
We detect clones using our CloneRefactor tool1 . This           computationally intensive, in order for our mining al-
tool supports several clone type definitions, but for           gorithm to finish within reasonable memory and time
this study we only considered type 3 clones. We used            bounds, we have to split the codebase on which we
the following detection settings for our comparison:            work into separate folds, and run the miner on one
Minimum number of lines cloned: 3                               fold at a time.2 For JHotDraw, we created 4 folds,
Minimum clone class size: 5                                     and found 156 patterns in total for the configuration
                                                                above. The size of the AST fragments of the mined
   With these settings we detected 136 clone classes
                                                                patterns varied between 15 and 207 nodes, with an
each having 9.2 instances on average. Each of these
                                                                average size of 36. The mining took 33 minutes on a
instances span 6.8 lines on average. This makes up
                                                                middle grade tower PC. We noticed, not surprisingly,
for a total of 8.559 out of 39.403 lines cloned (i.e.,
                                                                  2 This does imply that some patterns that occur across folds
21.7% of the system is cloned). About half of these
                                                                may not be found, if their frequency within a single fold does
    1 Available
             on GitHub:   https://github.com/SimonBaars/        not surpass the minimum support threshold. The less and larger
CloneRefactor                                                   the folds, the less this problem occurs.


                                                            2
that the bigger the pattern size, the more interesting             node in the pattern and the end line that of the last
it tends to be.                                                    AST node in the pattern.
                                                                      To compare the patterns and clones at these loca-
4     The Fight                                                    tions we define a similarity metric for clones and pat-
                                                                   terns. For each pattern, we search for the clone with
We now describe how we compared both approaches                    the highest similarity and vice versa by comparing each
and then report on the results of our comparison.                  clone instance of each clone class to each pattern in-
                                                                   stance of each pattern. We then use the following for-
4.1     Methodology                                                mula to determine the similarity percentage between
We compared the results of both approaches through                 a clone instance and pattern instance:
(1) manual comparison: the pattern mining team ex-
                                                                                                      2 ∗ match(pi , ci )
haustively went over all clones to look for possible                        similarity(pi , ci ) =                            (1)
matching patterns and (2) automated comparison: the                                                  size(pi ) + size(ci )
clone detection team formalized an automated method                where pi is a pattern instance, ci is a clone instance,
to compare clones and patterns.                                    match is the function that yields the number of match-
                                                                   ing lines between two instances and size is the function
4.1.1       Manual comparison                                      that yields the number of lines in a code fragment. If
Per clone class we manually investigated 3                         a pattern and a clone do not intersect, the result of
                                                                   match will be 0, meaning no similarity at all. Given a
    • the overlap and coverage (in #classes, #LOC,                 clone instance, we compute the instance similarity by
      #occurrences/ class) when (at least) one pattern             seeking the pattern instance that maximizes the sim-
      matches it;                                                  ilarity function (so we find the pattern instance most
                                                                   similar to a clone instance):
    • if there was a matching pattern, we also looked
      for similar patterns occurring multiple times;               instanceSim(p, ci ) = max({similarity(pi , ci ) | pi ∈ p})
                                                                                                                          (2)
    • the relevance/usefulness/richness of the pattern             where p is a set of pattern instances. Next, we cal-
      for a software developer (this is a subjective cri-          culate the similarity at the clone class level by sum-
      terion), in terms of a rating -/0/+.                         ming the instance similarities of all instances in a clone
                                                                   class:
                                                                                                                               
                                                                                          instanceSim(p, ci )
4.1.2       Automated comparison                                   classSim(p, c) = sum(                      ∗ size(ci ) ci ∈ c )
                                                                                           size(p) + size(c)
Using a script4 , we find for each pattern the clone class                                                                     (3)
that is the most similar and vice versa. These results              where c is a clone class. We compute which of the
help analysing to what extent one approach is redun-               C=136 clone classes is most similar to a pattern p
dant over the other: if a large percentage of clones               as the maximum similarity across all clone classes:
closely resembles most patterns (or vice versa) this
would be the case.                                                 collectionSim(p, C) = max({classSim(p, c) | c ∈ C})
   In the automated comparison we first determine the                                                                 (4)
locations of all 156 patterns and 136 clones. These lo-            Finally, we collect per pattern the most similar clone
cations consist of the file and range of each instance             class:
in a pattern/clone. For clones, this range is simply
                                                                     overallSim(P, C) = {collectionSim(p, C) | p ∈ P }
the begin and end line of a code fragment that exists
                                                                                                                        (5)
elsewhere. To simplify the comparison, we chose to
                                                                   where P is the set of all 156 identified patterns. Using
compare on a line-level and not take into account the
                                                                   this method, we can find which pattern is most simi-
begin and end column of ranges. For patterns, each
                                                                   lar to a clone and vice versa (by using the respective
separate AST node that belongs to the pattern (high-
                                                                   symmetric relations).
lighted in orange in Fig. 3) has its own range (line and
columns). For simplicity, we consider the range of a
                                                                   4.2     Results of the manual analysis
pattern instance to be the begin line of the first AST
    3 You
                                                                   4.2.1    Clone instances with no matching pat-
          can find the full manual analysis here: https:
//github.com/CelineDknp/CloneVSPatternAnalysis/blob/                        tern
master/CloneVSPatternAnalysis.txt
   4 Source code available at https://github.com/SimonBaars/       The first thing the pattern team noticed when explor-
CloneRefactor/tree/master/src/main/java/com/simonbaars/            ing the clones was that out of 90 clones for which
clonerefactor/scripts/intimals                                     there was no matching pattern, 42 occurred in less


                                                               3
50%                                                                                                                                                       Clone → Pattern
             43%
       40%
40%                                                                                                                                                       Pattern → Clone

30%

20%
                                    10%                      10%                           10%                                                                      11%
                                                                                                                                                9%
10%                5%                        6%   7%                              6%              7%        5%   6%                       5%
                        3%     4%                                                                                                                                         4%
                                                                   3%        2%                                            1%   3%
                                                                                                                                                      0%     0%
 0%
         0         0-10        10-20         20-30           30-40           40-50             50-60        60-70          70-80          80-90       90-100          100


Figure 1: Percentage of clones/patterns that have a certain percentage of instances intersecting with each other.
The x-axis shows the percentage of clones or patterns. The y-axis shows the percentage of instances that intersect
with each other.

                                                                                                                                                            Clone → Pattern
50%
             43%                                                                                                                                            Pattern → Clone
       40%
40%

30%

20%                 14% 16%
                                       17%
                                13%               13%
10%                                                     8%         7%   8%
                                                                                     5%   4%           4%
                                                                                                            1%        1%   1%        1%    1%        1%                   1%
                                                                                                                                                           0%       0%
 0%
             0          0-10        10-20         20-30             30-40              40-50           50-60          60-70          70-80           80-90          90-100

Figure 2: Distribution of similarity percentages for all clones and patterns. The x-axis shows the percentage of
clones or patterns. The y-axis shows the similarity as defined in Sec. 4.1.2.
than five different code files. For example, the fol-       of their approach and is investigating how to improve
lowing code snippet occurs exactly 6 times in the           their algorithm to work on larger data sets to avoid this
JMDIDesktopPane file, but was not found as a pattern        problem.) The following code snippet is an example
by the miner.                                               of such a code clone, repeated exactly 9 times in 6
                                                            different files, of which 1 file in fold 1, 2 files in fold
try {                                                       2 and 3 files in fold 4. So it was not discovered as a
    ((JInternalFrame)allFrames[i])                          pattern in either of those folds.
                                .setMaximum(false);
} catch (PropertyVetoException e) {                                                       protected void generateLookupTables() {
    e.printStackTrace();                                                                      radials = new float[w * h];
}                                                                                             angulars = new float[w * h];
                                                                                              alphas = new int[w * h];
                                                                                              float radius = getRadius();
   We quickly realised that this behavior is caused by
                                                                                              float blend = (radius + 2f) / radius - 1f;
the C0 constraint (minimum support), which considers
                                                                                              //clone ends here
patterns occurring in at least five different code files
rather than having five different occurrences. This is
a result of the way the miner was configured to have a
manageable amount of patterns, and concluded that it
might not have been the best fit for this comparison.                                        The 35 other clones for which there was no matching
For future comparisons with code clone detection tools                                    pattern are either too small to fit the constraints of
that often find multiple clone instances in a given file,                                 the miner, or cannot really be considered as a pattern.
we will need to change the configuration to avoid this                                    For example, some clones consist of the end of one
issue.                                                                                    function along with the start of another one, like in
                                                                                          the following snippet.
   A second reason why some clones are not matched
by patterns, results from the fact that for the mining                                    //Start of function not in clone
process to finish correctly, we need to divide the code                                    isClampRGB = b;
base into smaller folds. But if there is a clone that                                     }
spans over 5 files but that are not all in the same fold,
the pattern may still be discarded because it would                                       public boolean isClampRGBValues() {
not have a high enough support value for either fold.                                         return isClampRGB;
We found 13 clones that did not match a pattern for                                       //End of function not in clone
this reason. (The mining team knew this limitation


                                                                                     4
4.2.2   Matching clone instances and patterns                   4.3     Results of the automated analysis
                                                                Using the automated analysis method described in
When analysing clone instances that did match mined             Section 4.1.2 we collected interesting statistical data
patterns, the pattern mining team observed that we              on the similarity of the clones and patterns found by
tend to have two types of matches.                              both tools.
    A first type is where the pattern is broader, so that          When looking into the percentage of clones and pat-
it includes multiple clones. For example, Fig. 3 shows          terns that intersect each other (see Fig. 1), we observe
pattern 46 of fold 4, which actually matches 4 different        that 40% of clones do not match any patterns.5 For
clone classes, one for each case block.                         11% of the clones, all clone instances intersect with the
                                                                pattern instances. 43% of the patterns do not intersect
                                                                with any clones. For 4% of the patterns, all pattern
                                                                instances intersect with the clone instances.
                                                                   When manually inspecting the 11% of clone classes
                                                                of which all instances intersect with patterns, we find
                                                                that most of the clone instances in the clone classes
                                                                are contained within a pattern. Patterns are often
                                                                larger because they allow more variance in a fragment
                                                                of code. Because of that, patterns often capture ad-
                                                                ditional information surrounding a clone. This is why
                                                                there is a large difference between clone to pattern and
                                                                pattern to clone at 100% intersecting instances (see
                                                                Fig. 1).
                                                                   Fig. 2 shows how many clones and patterns are sim-
                                                                ilar to each other. The same percentages of clones/-
                                                                patterns that do not intersect can be found here as 0%
                                                                similar. However, here we see the categories gradually
                                                                decreasing towards 100% (no clone/pattern combina-
                                                                tion is actually 100% similar). By far, most clone/-
                                                                pattern combinations are less than 50% similar: 92%
            Figure 3: Pattern 46 of fold 4
                                                                of clones and 96% of patterns. This leaves only 8%
                                                                of clones and 4% of patterns with a similarity higher
   A second type of match is when we have multiple              than 50%.
patterns for a single clone class. In such cases, the
clone usually corresponds to a specific part of multiple        5     The outcome
patterns that are similar, or to related patterns across        Before presenting our analysis of these results, we em-
folds. For example, the following code snippet occurs           phasize that this experiment was only an initial com-
in our patterns 60, 78, 111 and 115 of fold 4.                  parison between the two approaches. Further valida-
                                                                tion on other case studies, with improved settings, and
@Override
                                                                by researchers other than the original tool creators, are
public void transform(AffineTransform tx) {
    Point2D.Double anchor = getStartPoint();                    required. Nevertheless, the results obtained already
    Point2D.Double lead = getEndPoint();                        allowed us to reach some interesting insights.
    setBounds(
           (Point2D.Double) tx.transform(anchor,                5.1     Manual inspection
               anchor),                                         In cases where both approaches found similar code
           (Point2D.Double) tx.transform(lead,
                                                                fragments, the overlap was not always complete.
               lead));
                                                                Sometimes one approach (often code cloning) found
}
                                                                more fragments than the other, or (typically pattern
                                                                mining) found larger or richer code fragments. Com-
   Finally, using our rating of how relevant a devel-           bining both approaches to complement each other’s
oper considers a pattern, we observed that out of a             results would be an interesting research direction.
total of 46 clones that where matched by a pattern,                 5 This percentage would very likely become significantly
41 were considered interesting, leading us to conclude          larger if we would resolve the issue we encountered with the
that many of the results found by both methods can              C0 constraint; requiring the patterns to occur in different code
be considered as useful to developers.                          files.


                                                            5
5.2   Automated comparison                                         2014 Joint Conference of the International Work-
                                                                   shop on Software Measurement and the Interna-
The results of the automated comparison show that,
                                                                   tional Conference on Software Process and Prod-
although most clones share some relation with some
                                                                   uct Measurement, pages 32–37. IEEE Computer
patterns, there are only very few clones where this re-
                                                                   Society, 2014.
lation is particularly strong. Often, patterns are found
in completely different parts than where clones are             [3] Magiel Bruntink, Arie Van Deursen, Remco
found, or they briefly intersect instead of matching                Van Engelen, and Tom Tourwé. On the use of
completely.                                                         clone detection for identifying crosscutting con-
   This shows that clone detection does, for the largest            cern code. IEEE Transactions on Software Engi-
part, find different results than pattern mining. This              neering, 31(10):804–818, 2005.
indicates that both approaches are not redundant over
one another, instead complementing each other.                  [4] Peri Tarr, Harold Ossher, William Harrison, and
                                                                    Stanley M. Sutton. N degrees of separation:
6     Conclusion                                                    Multi-dimensional separation of concerns. In
                                                                    Proceedings of the 1999 International Conference
As recurrent code fragments are often considered                    on Software Engineering, pages 107–119. IEEE,
symptoms of bad design, several detection techniques                1999.
have been proposed. Comparative studies across
emerging techniques could shed further light into the           [5] Erich Gamma and Thomas Eggenschwiler. Jhot-
trade-offs between time complexity and quality of re-               draw, 2004.
sults. We set out to compare two such approaches:
clone detection and pattern mining, since clone detec-          [6] Ewan Tempero.          Qualitas Corpus, 2013.
tion seems more lightweight and a relevant question is              [http://qualitascorpus.com/;    accessed  31-
whether pattern mining is worth the additional effort.              October-2019].
Our automated comparison method involves several                [7] Jolita Savolskyte. Review of the jhotdraw frame-
levels of data aggregation and is symmetric with re-                work, 2004.     Technical University Hamburg-
spect to the comparison direction: patterns to clone                Harburg.
classes and vice versa.
   Our findings indicate that the two approaches are            [8] Henrik Bærbak Christensen.        Frameworks:
rather complementary. About half of the clones/pat-                 Putting design patterns into perspective. In
terns share no relation with each other. In the cases               Proceedings of the 9th Annual SIGCSE Confer-
that clone detection and pattern mining are not 100%                ence on Innovation and Technology in Computer
similar but do intersect, often the pattern is larger               Science Education, ITiCSE ’04, pages 142–145.
than the clone. This is because patterns allow for more             ACM, 2004.
structural variance in recurrent fragments than clone
detection. The main reason that sometimes clones are            [9] Andy Kellens, Kim Mens, and Paolo Tonella.
larger than patterns is that patterns seem constrained              A survey of automated code-level aspect mining
to a single subtree, whereas clones can span several                techniques. In Awais Rashid and Mehmet Aksit,
methods and even exceed class boundaries if they are                editors, Transactions on Aspect-Oriented Soft-
all similar.                                                        ware Development IV, pages 143–162. Springer,
                                                                    2007.
Acknowledgments                                                [10] Angela Lozano, Andy Kellens, Kim Mens, and
Part of this work was conducted in the context of                   Gabriela Arevalo. Mining source code for struc-
an industry-university research project, funded by the              tural regularities. In Proceedings of the 2010
Belgian Innoviris TeamUp project INTiMALS (2017-                    17th Working Conference on Reverse Engineer-
TEAM-UP-7).                                                         ing, pages 22–31. IEEE Computer Society, 2010.

                                                               [11] Jeffrey Svajlenko and Chanchal K Roy. Evaluat-
References                                                          ing modern clone detection tools. In 2014 IEEE
 [1] Martin Fowler. Refactoring: Improving the design               International Conference on Software Mainte-
     of existing code. Addison-Wesley, second edition,              nance and Evolution, pages 321–330. IEEE, 2014.
     2018.
                                                               [12] Chanchal Kumar Roy and James R Cordy. A sur-
 [2] Jan-Peter Ostberg and Stefan Wagner. On auto-                  vey on software clone detection research. Queen’s
     matically collectable metrics for software main-               School of Computing TR, 541(115):64–68, 2007.
     tainability evaluation. In Proceedings of the


                                                           6
[13] Tatsuya Asai, Kenji Abe, Shinji Kawasoe, Hiroki         Johan Fabry, and Vadim Zaytsev. Mining pat-
     Arimura, Hiroshi Sakamoto, and Arikawa Setsuo.          terns in source code using tree mining algorithms.
     Efficient substructure discovery from large semi-       In Petra Kralj Novak, Tomislav Šmuc, and Sašo
     structured data. IEICE Transactions on Infor-           Džeroski, editors, Discovery Science, pages 471–
     mation and Systems, 04 2002.                            480, Cham, 2019. Springer International Publish-
                                                             ing.
[14] Hoang Son Pham, Siegfried Nijssen, Kim Mens,
     Dario Di Nucci, Tim Molderez, Coen De Roover,


                                                         7