=Paper=
{{Paper
|id=Vol-2511/QuASoQ-03
|storemode=property
|title=A Code Clone Curation - Towards Scalable and Incremental Clone Detection -
|pdfUrl=https://ceur-ws.org/Vol-2511/QuASoQ-03.pdf
|volume=Vol-2511
|authors=Masayuki Doi,Yoshiki Higo,Shinji Kusumoto
|dblpUrl=https://dblp.org/rec/conf/apsec/DoiHK19
}}
==A Code Clone Curation - Towards Scalable and Incremental Clone Detection -==
<pdf width="1500px">https://ceur-ws.org/Vol-2511/QuASoQ-03.pdf</pdf>
<pre>
                           7th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2019)


                                        A Code Clone Curation
                — Towards Scalable and Incremental Clone Detection —
                                       Masayuki Doi, Yoshiki Higo, and Shinji Kusumoto
                         Graduate School of Information Science and Technology, Osaka University, Japan
                                           {m-doi, higo, kusumoto}@ist.osaka-u.ac.jp


     Abstract—Code clones have a negative impact on software              detector outputs a huge number of clones as the results of the
 maintenance. Code clone detection on large-scale source code             clone detection. It is impractical to check manually whether
 takes a long time, and even worse, such detections occasionally          or not each of all detected clones is worth more than a glance.
 aborted due to too much target size. Herein, we consider that
 we detect cross-project code clones from a set of many projects.         Furthermore, if a new project is added to the target projects
 In such a detection situation, even if a project is updated, we          of clone detection or a part of the target projects is updated,
 need to detect cross-project code clones again from the whole of         the clone detector needs to detect clones from all the target
 the projects if we simply use a clone detection tool. Therefore          projects again. Such a clone detection takes a too long time. In
 we need a new clone detection scheme that rapidly detects code           order to solve those problems, a new technique for large-scale
 clones from a set of many projects with incremental detection
 functionality. In this paper, we propose an approach that rapidly        clone detection is required.
 obtains code clones from many projects. Our approach includes               The results of clone detections may include many negli-
 a strategy of multi-stage code clone detection unlike other clone        gible clones. Negligible clones are worthless when dealing
 detection techniques: in the first-stage detection, code clones are      with clone information in software development and mainte-
 detected from each of the projects, respectively; then in the
 second-stage detection, the code clones detected in the first-stage
                                                                          nance. For example, language-dependent clones are negligible
 are unified by using our clone curation technique. This multi-           clones [10]. When a specific programming language is used, a
 stage detection strategy has a capability of incremental clone           programmer cannot help writing some similar code fragments
 detection: if the source code of a project is updated, the first-stage   which cannot be merged into code fragments due to language
 detection is applied only to the project and then the second-stage       limitations. A language-dependent clone consists of such code
 curation is performed. We constructed a software system based
 on the proposed approach. At present, the system utilizes an
                                                                          fragments, so that language-dependent clones are negligible.
 existing detection tool, CCFinder. We have applied the system to         Many techniques have been proposed to remove the negligible
 the large-scale source code (12 million LoC) of 128 projects. We         clones from the clone detection results. The techniques classify
 also detected clones from the target source code by simply using         clones according to whether code fragments are high cohesion
 CCFinder and compared the detection time. As a result, the clone         and low coupling [11], according to the ratio of repetitive
 detection with our system was 17 times shorter than CCFinder’s
 detection. We also compared detection time under an assumption
                                                                          structures included in code fragments [10], and according to
 that each project of the targets is updated once. The experimental       the machine learning using some metrics [12]. By using those
 results showed that our incremental clone detection shortened            techniques, negligible clones can be filtered out from the clone
 the detection time by 91% compared to non-incremental clone              detection results. We thought that incorporating such filtering
 detection.                                                               technique in clone detection process should make it possible
     Index Terms—code clone detection, large-scale, incrementabil-
                                                                          to detect clones on a larger scale more efficiently.
 ity
                                                                             In this paper, we propose a clone curation approach which
                         I. I NTRODUCTION                                 rapidly detects clones from many projects. Our approach de-
                                                                          tects clones in three stages: (1) detecting intra-project clones,
    A code clone (hereafter, clone) is a code fragment which is           (2) filtering out negligible clones from the detection results,
 identical or similar to another code fragment in source code.            and (3) consolidating different sets of similar intra-project
 Clones may lead to software maintenance problems [1], [2]                clones into a set of cross-project clones. The above process
 and bug propagation [3], [4]. Thus, researchers have been                realizes a scalable clone detection because
 actively studying techniques for clone detection. For example,
 a number of tools for automatic clone detection from source                • traditional clone detection, which requires a high com-
 code have been developed and released [5].                                   putational complexity, is performed on small-size source
    The amount of source code is steadily increasing, and                     code (source code of a single project)
                                                                            • negligible clones are filtered out before generating cross-
 large-scale clone detection has become a necessity. Large-
 scale clone detection can be used for detecting similar mobile               project clones.
 applications [6], finding the provenance of components [7],                We implemented the proposed technique and evaluated the
 and code search [8]. Thus, scalable clone detectors have been            scalability of the proposed technique. As a result, when the
 developed [9]. However, such clone detectors seek to detect all          proposed technique and CCFinder [13] detected clones from
 present clones in the target software, so when large-scale clone         128 pseudo projects, each of which includes 100 KLoC, the
 detection is conducted with such a clone detector, the clone             proposed technique was able to detect clones up to 17 times


Copyright © 2019 for this paper by its authors.                                                                                   19
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                           7th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2019)


 faster than CCFinder. In addition, in order to evaluate whether                                  Recall          Precision        F-value
 the proposed technique can detect clones rapidly in the case            100

 of updating a part of the target projects, we measured the
 execution time required to detect clones again.As a result,              80
 our technique was able to detect clones 11 times faster than
 CCFinder in such a situation. Finally, we demonstrated that              60
 adjusting the parameters of our technique can detect clone
 faster.
                                                                          40
    The remainder of this paper is structured as follows. Sec-
 tion II describes the definitions of the terminology used in this
                                                                          20
 paper. Section III presents our approach. Section IV describes
 the implementation of our approach. Section V describes the
                                                                           0
 design of the evaluation experiments. Section VI describes                    0      10     20     30       40      50       60    70       80   90   100
 the datasets used in the evaluation experiments. Section VII                                                 RNR [x 1/100]
 describes the results of the experiments and the discussions.
 Section VIII describes the threats of validity. Section IX                        Fig. 1. Transitions of Recall, Precision, and F-value by RNR
 describes the conclusion and future work.
                        II. PRELIMINARIES                                k-th token in source file Fi (j must be less than k). In this case,
                                                                         the following two clone sets are detected from the source files.
 A. Code clone
                                                                             S1 : C(F1 , 1, 2), C(F1 , 4, 5), C(F2 , 4, 5)
    Given two identical or similar code fragments, a clone                   S2 : C(F2 , 1, 2), C(F2 , 2, 3)
 relation holds between the code fragments. A clone relation                 A RN R value of each clone set is the following:
 is defined as an equivalence relation (i.e., reflexive, transitive,                                           0+0+0
 and symmetric relation) on code fragments.                                           RN R(S1 ) = 1 −                      = 1.0
                                                                                                               2+2+2
    If a clone relation is established in given two code frag-                                                 1+2
 ments, the code fragment pair is called a clone pair. An                             RN R(S2 ) = 1 −                  = 0.25
                                                                                                               2+2
 equivalence class of a clone relation is called a clone set. That
 is, a clone set is a maximal set of code fragments where a clone            The value 0.25 of RN R(S2 ) represents that most of the
 relation exists between any pair of them. A code fragment               tokens in S2 are in the repeated code sequence. Our previous
 within a clone set is called a code clone or just a clone.              research reported that low-RN R clone sets are typically
                                                                         negligible clones such as consecutive variable declarations,
 B. RNR                                                                  consecutive method invocations, and case entries of switch
    Many negligible clones are included in the detection results.        statements.
 In our previous research, we proposed a metric RN R to au-                  In our previous research, we examined how well the RN R
 tomatically filtering out such negligible clones [10]. RN R(S)          filtering worked. We calculated precision, recall, and f-value of
 means the ratio of non-repeated code sequence in a clone set            the filtering. The definitions of the values are the followings:
 S. Here, we assume that a clone set S includes a code fragment
 f . LOSwhole (f ) represents the length of the whole sequence                                                |Snegligible ∩ Sf iltered |
                                                                                      Recall(%)          =    100 ×
 of fragment f , and LOSrepeated (f ) represents the length of                                                        |Sf iltered |
 the repeated sequence of f , then metric RN R(S) is calculated                                                |Sf iltered |
                                                                              P recision(%) = 100 ×
 by the following formula:                                                                                    |Snegligible |
                               ∑                                                                      2 × Recall × P recision
                                                                                   F − value =
                                      LOSrepeated (f )                                                   Recall + P recision
                               f ∈S                                        where Sf iltered is clone sets filtered out by RN R and
            RN R(S) = 1 − ∑
                                       LOSwhole (f )                     Snegligible is all real negligible clone sets.
                                f ∈S
                                                                           Figure 1 illustrates transitions of recall, precision, and f-
                                                                         value when the RN R threshold varies between 0 and 1.
    For example, we assume that we detect clones from follow-            Figure 1 means 65% of negligible clone sets are filtered out
 ing two source files (F1 , F2 ). Each source file consists of the       with no false positive by using 0.5 as the RN R threshold.
 following five tokens.
    F1 : a b c a b                                                                         III. PROPOSED TECHNIQUE
    F2 : c c∗ c∗ a b                                                        The entire process of our approach is summarized in Fig-
    where the superscript “∗” indicates that its token is in a           ure 2. It is composed of three steps: clone detection on each
 repeated code sequence.                                                 project, negligible clones exclusion, and clone curation cross
    We use label C(Fi , j, k) to represent the fragment. The             projects. The following subsections describe the design of each
 fragment C(Fi , j, k) starts at the j-th token and ends at the          step.


Copyright © 2019 for this paper by its authors.                                                                                                   20
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                           7th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2019)


       Input               Step 1                                    Step 2                              Step 3
                       Clone detection                          Negligible clones                    Clone curation
                       on each project                             exclusion                         cross projects


                                                                                                                           Output


                                                                                                                      Detected clone sets


  Target projects                             Clone sets                            Non-negligible
                                                                                      clone sets

                                                     Fig. 2. The Overview of Our Approach


 A. Clone detection on each project                                      clones. Clone sets to be excluded are satisfying RN R < 50
    In this step, metric RN R is calculated from each of the             based on the previous research [10]. There are two reasons to
 detected clone sets after they are detected from each of the            exclude low-RN R clones: (1) users can avoid spending their
 target projects. This process realizes a scalable clone detection       time to analyze negligible clones; (2) the execution time of
 because                                                                 clone detection gets shortened. Hence the number of clone sets
                                                                         that are the targets of similarity calculation gets reduced, and
    • the size of the source code to be detected is reduced, and
                                                                         then similarity calculation can be performed more efficiently.
    • clone detection can be capable of parallel execution in
       each of the target projects.                                      C. Clone curation cross projects
 In the case of detecting clones from a large project, our                  The curation of cross-project clone sets is performed based
 approach can detect clones faster by dividing the project               on the judgment whether the similarity between clone sets is
 into a set of smaller pseudo projects. However, over-division           equal to or greater than the threshold.
 may lead to finding fewer clones to be detected because our                Consequently, our technique omits similarity calculation
 approach cannot detect clone pairs that exist only between the          between a pair of clone sets if the pair satisfies both the
 projects.                                                               following conditions.
    We utilize an existing detection tool. Many detectors output            • The RN R value of a clone set in the pair is largely
 a text file as the detection results. Consequently, our approach              different from the one of the other clone set of the pair.
 can be applied to many detectors such as token-based or PDG-               • The RN R value of either clone set in the pair is suffi-
 based ones because the clone sets, which are the output of                    ciently large.
 this step, can be obtained by parsing the text file outputted by        RN R is based on the code fragment structure. Similar clone
 detecting clones from each of the target projects. Moreover, if         sets may be the almost same RN R values. Thereby the clone
 a clone detector has a function to calculate RN R as well as            detection should get faster by omitting similarity calculation
 detecting clone sets, this step can be performed efficiently. For       between clone sets which have the big difference between each
 example, CCFinder [13] has a function to calculate RN R. In             RN R values and are the one of the clone set pair’s RN R is
 this case that a clone detector does not have the function, we          sufficiently large. More completely, RN Rm , RN Rn values of
 need to calculate RN R as a post-process of clone detection.            given to clone sets and two parameters θdif f and θmax , our
                                                                         approach determinates whether the similarity between clone
 B. Negligible clones exclusion                                          sets are satisfying the following formula.
    It is ineluctable that negligible clones are included in
 clone detection results. The presence of negligible clones has          Omit(θdif f , θmax )   =    (|RN Rm − RN Rn | > θdif f )
 negative impacts from the following two viewpoints: making                                          ∧(max(RN Rm , RN Rn ) > 100 − θmax )
 it more difficult to analyze clone detection results, and taking
 a longer time to finish clone detections. Hence, our approach             where max is a function to return the maximum value in
 excludes the negligible clones from candidates of cross-project         the parameters. Figure 3 shows the area of omitting similarity


Copyright © 2019 for this paper by its authors.                                                                                  21
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                  7th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2019)

   50                                                                                                                              as a single clone set. Otherwise, they are repeated as different
                                                                                                                                   clone sets as they are. At this moment, we use 0.9 as a default
                                                                                                                                   value. The value of 0.9 is the evaluated in Sec.VII.
    Larger RNR of Clone Set in A Clone Set Pair [x 1/100]


                                                                                                                                                       V. E XPERIMENTAL D ESIGN
                                                                                                                                     This research evaluates our proposed approach by answering
                                                                                                                                   the following research questions.
                                                                                                                                   A. RQ1: How fast does our approach detect cross-project
                                                              Area to calculate similarity                                         clones?
                                                                                        𝜃𝑑𝑖𝑓𝑓                                         One primary goal is to detect cross-project clones as fast
                                                                                                                                   as possible. In this evaluation, we create same-size pseudo
                                                                                                                                   projects from BigCloneBench dateset [15]. Then, we compare
                                                                                                                                   the execution time of two clone detections: our proposed
                                                                  𝜃𝑚𝑎𝑥                                                             approach and CCFinder. In the formal detection, in-project
                                                                                                                                   clones are detected by CCFinder from each pseudo project.
                                                                                                                                   And then, cross-project clones are generated by the curation.
                                                                   Area not to calculate similarity                                In the latter detection, cross-project clones are directly detected
  100                                                                                                                              from the whole of the pseudo projects by using CCFinder.
                                                            50      Smaller RNR of Clone Set in A Clone Set Pair [x 1/100]   100
                                                                                                                                   B. RQ2: How fast does our approach follow update of target
                                                            Fig. 3. Area to Calculate Similarity Between A Clone Set Pair
                                                                                                                                   projects?
 calculation by the formula. The first expression is represented                                                                      Our proposed approach utilizes the results of the clone
 that RN R value of a clone set in the pair is largely different                                                                   detection per project. Herein, we assume that the source code
 from the one of the other clone set of the pair. The θdif f                                                                       of a project updated and we need to obtain cross-project clones
 parameter means the threshold of the difference. The second                                                                       again. If we use the proposed approach, we only need to detect
 expression is represented that the RN R value of either clone                                                                     clones again from the updated project. And then, the detection
 set in the pair is sufficiently large. The θmax parameter                                                                         results of the updated project and the detection results of the
 means the threshold of the RN R value. For the relationship                                                                       other projects are input to the curation. If we do not use the
 between the appropriate values of the parameters required for                                                                     proposed approach, we need to run CCFinder for the whole
 the similarity calculation and the detection accuracy and the                                                                     of the target projects. In this evaluation, we compare the two
 speed based on them, the evaluation experiment is described                                                                       kinds of detections: our proposed approach and CCFinder.
 in Section V-C.
                                                                                                                                   C. RQ3: How effective is parameters adjustment to improve
                                                                             IV. IMPLEMENTATION                                    clone detection performance?
 A. Clone detection tool                                                                                                             As described in Section III-C, our proposed approach has
                                                                                                                                   two parameters θdif f and θmax that specify the area not to
    Our proposed approach detects clones using a clone detec-
                                                                                                                                   calculate similarity. We seek for appropriate parameters by
 tion tool. Our implementation utilizes CCFinder [13] .
                                                                                                                                   executing the proposed approach with different parameters.
 B. Coefficient of similarity
                                                                                                                                                                VI. DATA S ETS
   We define the similarity between a pair of clone sets.
                                                                                                                                     In this research, we utilized two datasets, Apache [16] and
 Hereafter we call a pair of clone sets clone set pair.
                                                                                                                                   BigCloneBench [15] datasets.
                                                                                      |N (Sm ) ∩ N (Sn )|                            The Apache dataset is composed of 84 Java projects in
                                                              Sim(Sm , Sn ) =
                                                                                    max(|N (Sm )|, |N (Sn )|)                      Apache Software Foundation1 . Unfortunately, CCFinder re-
   where Sm and Sn are a clone set, respectively. N is a                                                                           ported encoding errors on nine projects. This is because the
 function to return a set of n-grams created from the tokens                                                                       nine projects included at least a file that is encoded with a
 included in a given clone set. The target tokens are extracted                                                                    character code that CCFinder cannot treat. Consequently, we
 by a lexical analysis on the text representation of the clone set                                                                 excluded the nine projects from our target ones.
 and a normalization of the identifiers and literals included in                                                                     The BigCloneBench dataset is one of the largest clone
 the text. N-gram size of our implementation is 5, which is on                                                                     benchmarks available to date. It was created from IJaDataset
 Myles et al. [14]. max is a function to return the maximum                                                                        2.0 2 , which is composed of 25,000 Java systems. The bench-
 value in the parameters. The reason why we use n-gram is                                                                          mark includes 2.9 million source files with 8 million manually-
 that we want to calculate text similarity without considering                                                                     validated clone pairs. The BigCloneBench dataset was used for
 whether each instruction is repeated or not. If two clone sets                                                                      1 http://www.apache.org

 have a higher similarity than a threshold θ, they are regarded                                                                      2 https://sites.google.com/site/asegsecold/projects/seclone


Copyright © 2019 for this paper by its authors.                                                                                                                                                    22
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                           7th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2019)


 clone evaluations and scalability testing in several large-scale
 clone detection and clone search studies [8]. In this research,                                     Random Extraction
 we constructed 256 pseudo projects from the BigCloneBench                                                                 211 (=2) pseudo projects
 dataset. The 256 pseudo projects are composed of two groups,                                                                      of 10K LoC


                                                                                                                                    …
 128 projects of 10 KLoC and 128 projects of 100 KLoC. Each
 pseudo project consists of source files that were randomly                 128 pseudo projects                                      …
 selected from the BigCloneBench dataset.                                       of 10K LoC
                                                                                                                          77
                                                                                                                         2 (=128) pseudo projects
                         VII. E VALUATION                                                                                      of 10K LoC
    We conducted an evaluation of our proposed approach on a
 single workstation, which has the following specification.
    • Windows 10
                                                                                                     Random Extraction
                                                                                                                           211(=2) pseudo projects
    • 1.6GHz processor (6 cores)
                                                                                                                                 of 100K LoC
    • 32 GB Memory


                                                                                                                                    …
    • 512 GB SSD
                                                                            128 pseudo projects
    The proposed approach curates clone sets if the similarity                  of 100K LoC
                                                                                                                                     …
 between the different clone sets exceeds a threshold. Hence,
                                                                         The BigCloneBench dataset                       277 (=128) pseudo projects
 the choice of the similarity threshold affects the curation                                                                     of 100K LoC
 performance . To find the optimal threshold for curation, we
 measured the execution time and the number of the curated                        Fig. 4. The Overview of How to Construct The Dataset
 clone sets when we curate clone sets from the Apache dataset
 described in Sec. VI with different thresholds. Table I shows           cases of 10 KLoC and 100 KLoC. The 1.24 times was the
 the results.                                                            minimum value while the maximum value was 6.24 times in
    As the threshold value was reduced, The execution time               10 KLoC. The greater the number of target projects is, the
 increased significantly. On the other hand, as the threshold            higher the ratio of the execution time between our proposed
 value was increased, the execution time was shortened, but              approach and CCFinder one is. We consider there are two
 the number of clone set pairs whose similarity exceeds the              reasons for that we obtained such results. Our proposed
 threshold value decreases. Hence, the optimal value is 0.9.             approach detects clones in parallel and it detects within-project
 Therefore, we utilized 0.9 as the threshold value for subse-            clones from each project instead of cross-project clones from
 quent evaluations.                                                      the whole dataset.
 A. RQ1: How fast does our approach detect cross-project                    The 11.4 times was the minimum value while the maximum
 clones?                                                                 value was 17.1 times in 100 KLoC. Our proposed approach
                                                                         was able to detect clones from the 128-project dataset of 100
    To answer RQ1, we compared the execution time of two
                                                                         KLoC. On the other hand, CCFinder was not able to finish
 clone detections: our proposed approach and CCFinder. We
                                                                         detecting clones from the same dataset due to out of memory
 used the following thresholds in the application of our pro-
                                                                         error. Detecting clones from vast source code requires huge
 posed technique.
                                                                         memory, which is the reason why CCFinder failed to detect
    • θdif f : 0
                                                                         clones from the dataset. In our approach, CCFinder takes only
    • θmax : 0 (Similarity calculation is not omitted.)
                                                                         a small amount of source code in a single clone detection and
 We utilized the dataset, which had constructed from Big-                CCFinder is executed many times in parallel. Consequently,
 CloneBench dataset described in Section VI. Figure 4 shows              in our approach, out of memory error did not occur. Our
 the overview of how to construct the dataset. In order to detect        approach was able to detect clones from the large dataset
 clones with changing the number of the target projects, we              where CCFinder was not able to finish detecting clones due
 extracted seven sets of the pseudo projects randomly from each          to insufficient memory.
 of the 10 KLoC group and the 100 KLoC group, respectively.                 We also evaluated the ratio of clones that were not detected
 The number of pseudo projects in a set i, i ∈ [1, 7], is 2i .           by our proposed approach. The approach cannot detect clone
    The execution time of each clone detection is shown in               pairs that exist only between different projects because the
 Figure 5. Our approach can detect clones rapidly in the both            approach curates clones that are detected from each of the
                                                                         projects. Hence, we measured the ratio of clones that the
                              TABLE I                                    proposed approach could detect. We utilized the 64 pseudo
     E XECUTION T IME AND C URATED C LONE S ETS WITH D IFFERENT
                        PARAMETERIZATION                                 projects of the 100 KLoC as the target projects, and we
                                                                         compared the number of clones that could be detected by the
       Threshold                      0.85        0.9       0.95         proposed approach and the number of clones that could be
       Execution Time             42m 28s    26m 55s    21m 40s
       #Curated Clone Set Pairs     80,104     52,785     29,762         detected by CCFinder. The results are shown in Fig.6. It can
                                                                         be seen that the ratio of clones that can be detected decreases


Copyright © 2019 for this paper by its authors.                                                                                          23
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                             7th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2019)

                          3                                                                                                  100
                     10
                                                                                                                              90


                                                                                          The Ratio of Detected Clones [%]
                                      CCFinder        Our Approach
                                                                                                                              80
                                                                                                                              70

                          2                                                                                                   60
                     10
  Execute time [s]


                                                                                                                              50
                                                                                                                              40
                                                                                                                              30

                          1
                                                                                                                              20
                     10
                                                                                                                              10
                                                                                                                               0
                                                                                                                                      2           4         8           16          32    64
                                                                                                                                                                #Projects

                      0                                                                                                                      Fig. 6. The Ratio of Detected Clones
                              2         4        8         16        32     64   128
                                                        #Projects                         constructed in RQ1. We did not use the dataset of 128 projects
                                  (a) Execution Time for Projects of 10 KLoC              because CCFinder cannot detect clones from the dataset. Thus,
                     10
                          5                                                               in the case that we directly detect cross-project clones with
                                      CCFinder        Our Approach                        CCFinder, a clone detection from the whole of the 64 projects
                                                                                          is executed 64 times. On the other hand, in the case that
                          4
                     10                                                                   we apply the proposed approach, a clone detection from the
                                                                                          updated project and a clone curation for the whole of 64
  Execute time [s]


                          3
                                                                                          projects are executed 64 times. We measured the execution
                     10
                                                                                          time of the both cases.
                                                                                             The measurement results are shown in Table II. CCFinder
                     10
                          2                                                               took 11,207 seconds while our approach took only 1,006
                                                                                          seconds to detect clones on average. Therefore, our approach
                                                                                          was able to detect clones 11 times faster than CCFinder. We
                          1
                     10                                                                   investigated the ratio of clone detection and clone curation in
                                                                                          our approach and found that 97% of the execution time spent
                                                                                          on clone curation.
                      0                                                                      Our answer to RQ2 is that the proposed approach took only
                              2          4        8         16       32     64   128
                                                                                          9% execution time of detecting cross-project clones from the
                                                         #Projects
                                                                                          whole of 64 pseudo projects with CCFinder.
                                  (b) Execution Time for Projects of 100 KLoC
                                                                                          C. RQ3: How effective is parameters adjustment to improve
                              Fig. 5. Execution Time for Each Project Group
                                                                                          clone detection performance?
 as the number of projects increases. That is because clones                                 In the process of answering RQ1 and RQ2, we found
 that exist only between the different projects increase as the                           that the clone curation step accounts for the majority of the
 number of projects increases. As a result, in the case of 64                             execution time. This is because our approach calculates the
 projects of 100K LoC, the 38 percent of clones were not                                  similarities between all clone sets. To curate clones more
 be detected, but the remaining 62 percent of clones can be                               rapidly, our approach needs to reduce the number of similarity
 detected 17 times faster than CCFinder.                                                  calculations. We considered that omitting similarity calculation
                                                                                          between some clone sets, which have little effect on the
    Our answer to RQ1 is that our proposed approach can detect
                                                                                          curation results, can curate clones more rapidly. We also
 clones up to 17 times faster than CCFinder. Our approach can
                                                                                          considered that we could identify such clone set pairs by
 detect clones from the large target that cannot detect clones
                                                                                          utilizing RN R, which is described in Section III-C. Thus, to
 by CCFinder due to insufficient memory.
                                                                                          seek for such clone set pairs, we investigated the distribution
 B. RQ2: How fast does our approach follow update of target                               of similar clone set pairs in the 75 projects of Apache dataset
 projects?                                                                                described in Section VI. In Figure 7, the color of each cell

    In this evaluation, we assumed a situation: the source code                                                                                         TABLE II
 of each target project is updated once in different dates; a user                                                           E XECUTION T IME OF U PDATING TARGET P ROJECT I NCREMENTALLY
 wants to detect cross-project clones as rapid as possible just
                                                                                                                                                         CCFinder      Our Approach
 after any project is updated. Under this assumption, we utilized                                                                         Average Time    11,207s            1,006s
 the 64 pseudo projects of the 100 KLoC, which had been


Copyright © 2019 for this paper by its authors.                                                                                                                                          24
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                         7th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2019)

                                                                                Percentage of unsimilar clone set pairs [%]                       any calculations. The while cells indicate that the percentage
                                                                    0                                                       99.99     100         of similarity calculations is 60%. The green cell shows that
  50                                                                                                                                              the percentages are higher than 60% while the red cell means
                                                                                                                                                  that the percentage is lower than 60%. We can see that there
                                                                                                                                                  are several value pairs of two parameters where their cells in
   Larger RNR of the clone set in a clone set pair [x 1/100]


                                                                                                                                                  both Figure 8(a) and Figure 8(b) are white. Namely, there
                                                                                                                                                  are parameters which can reduce the number of similarity
                                                                                                                                                  calculations to 60 percents with keeping the 90 percents of
                                                                                                                                                  curated clone set pairs. Thus, we sought for the optimal
                                                                                                                                                  pair of parameters using the heatmaps and we measured the
                                                                                                                                                  number of similarity calculations that could be omitted when
                                                                                                                                                  the optimized parameters were utilized.
                                                                                                                                                     Figure 9 shows the relationship between the ratio of curated
                                                                                                                                                  clone set pairs and the ratio of similarity calculation reduction.
                                                                                                                                                  According to Figure 7, we can see that there are multiple
                                                                                                                                                  pairs of the two thresholds that yield the same ratio of curated
                                                                                                                                                  clone set pairs. Herein, we focus on the pair of the two
                                                                                                                                                  parameters that reduces similarity calculation at a maximum,
                                                                                                                                                  and we investigated the relationship between the two ratios.
                                                                                                                                                  As a result, we found that 41% of similarity calculations can
 100
                                                                                                                                                  be omitted by sacrificing 5% of the curations.
                                                               50        Smaller RNR of the clone set in a clone set pair [x 1/100]         100      Our answer to RQ3 is that using appropriate values of
                                                                                                                                                  parameters can reduce 41% of similarity calculations only by
                                                                        Fig. 7. The Percentage of Similar Clone Set Pairs                         sacrificing 5% of the clone curations.

                                                                                                                                                                  VIII. T HREATS OF VALIDITY
 indicates a percentage of the clone set pairs whose X-axis and
 Y-axis clone sets are not similar than 0.9. The while cells are                                                                                  A. Clone detector
 the percentage is 99.99%. The red cell shows the percentages
 are higher than 99.99% while the green cell means that the                                                                                         In the evaluations, we utilized CCFinder as a clone detector.
 percentage is lower than 99.99%. We considered that clone                                                                                        However, there are many clone detectors besides CCFinder.
 sets in a clone set pair are not similar when the RN R values                                                                                    Applying other clone detectors in our approach does not
 of the clone sets are largely different from each other and the                                                                                  necessarily get more rapid than conventional clone detections
 RN R value of either clone set in the pair is close to large.                                                                                    unlike CCFinder. We are planning on evaluating whether or
 Therefore, our approach omits the similarity calculations of                                                                                     not other clone detectors work well in our approach.
 such clone set pair using two parameters, θdif f and θmax .
    In this research question, we seek for appropriate values of                                                                                  B. Experimental target
 two parameters, θdif f and θmax to curate clones rapidly. We                                                                                        In the evaluations, we succeeded in reducing similarity
 attempted to compare the execution time and the precision                                                                                        calculations by 41% on the Apache projects dataset. However,
 of clone curation for each of the parameter combinations.                                                                                        if we use other datasets, we may obtain different ratios of
 However, it takes too long time to curate clones for all com-                                                                                    similarity calculations.
 binations of parameters. Hence, we measured the number of
 similarity calculations instead of the execution time. Namely,                                                                                                         IX. C ONCLUSION
 by comparing the number of similarity calculations and the
 precision of clone curation, we evaluated the effectiveness of                                                                                      In this research, we proposed a technique to detect clones
 our approach to omit similarity calculations.                                                                                                    rapidly. The proposed technique curates the results of clone
    Figure 8(a) is a heatmap which shows the percentage of                                                                                        detections for each of the projects. As a result of evaluating
 clone set pairs curated for each parameter. The while cells                                                                                      the performance of our technique, it was possible to detect
 indicate that the percentage of curated clone set pairs is                                                                                       clones up to 17 times faster with CCFinder. In addition, the
 90%. The green cell shows that the percentages are higher                                                                                        execution time of incremental updates was able to be reduced
 than 90% while the red cell means that the percentage is                                                                                         to nine percents of previous ones. Furthermore, we showed
 lower than 90%. Figure 8(b) is another heatmap which shows                                                                                       that using appropriate values of parameters can reduce 41%
 how much similarity calculations are omitted for each pair                                                                                       of similarity calculations only by sacrificing 5% of the clone
 of parameters, θdif f and θmax . Each cell shows that the                                                                                        curations. We plan to improve our technique so that it can
 percentage of the similarity calculations with the given two                                                                                     be applied to clone detectors other than CCFinder such as
 parameters against the similarity calculations without omitting                                                                                  SourcererCC [9] or NiCAD [17] in the future.


Copyright © 2019 for this paper by its authors.                                                                                                                                                            25
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                            7th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2019)

                                                                            Percentage of curated clone set pairs [%]                                         Percentage of similarity calculations [%]

                                                                0                                                       90   100                    0                                  60                 100

                                                    50                                                                                   50


                                                                                                                                         𝜃𝑚𝑎𝑥
                                                    𝜃𝑚𝑎𝑥


                                                     0                                                                                    0
                                                           0                                 𝜃𝑑𝑖𝑓𝑓                                  50          0                              𝜃𝑑𝑖𝑓𝑓                            50

                                                                    (a) The Percentage of Curated Clone Set Pairs                                       (b) The Number of Similarity Calculations

                                                                            Fig. 8. Heatmap of Curated Clone Set Pairs and Similarity Calculations for Pairs of Each Parameter

                                              100
                                                                                                                                              Proceedings of the 36th International Conference on Software Engineer-
                                               90                                                                                             ing, ser. ICSE 2014, 2014, pp. 175–186.
  Percentage of similarity calculations [%]


                                                                                                                                          [7] J. Davies, D. M. German, M. W. Godfrey, and A. Hindle, “Software
                                               80                                                                                             bertillonage: finding the provenance of an entity,” in Proceedings of the
                                                                                                                                              8th working conference on mining software repositories (MSR), 2011,
                                               70                                                                                             pp. 183–192.
                                               60
                                                                                                                                          [8] C. Ragkhitwetsagul and J. Krinke, “Siamese: scalable and incremental
                                                                                                                                              code clone search via multiple code representations,” Empirical Software
                                               50                                                                                             Engineering, pp. 1–49, 2019.
                                                                                                                                          [9] H. Sajnani, V. Saini, J. Svajlenko, C. K. Roy, and C. V. Lopes, “Sourcer-
                                               40                                                                                             ercc: scaling code clone detection to big-code,” in 2016 IEEE/ACM 38th
                                                                                                                                              International Conference on Software Engineering (ICSE), 2016, pp.
                                               30
                                                                                                                                              1157–1168.
                                               20                                                                                        [10] Y. Higo, T. Kamiya, S. Kusumoto, and K. Inoue, “Method and
                                                                                                                                              implementation for investigating code clones in a software system,”
                                               10                                                                                             Information and Software Technology, vol. 49, no. 9, pp. 985–998, 2007.
                                                                                                                                         [11] A. Goto, N. Yoshida, M. Ioka, E. Choi, and K. Inoue, “How to extract
                                                0
                                                                                                                                              differences from similar programs? a cohesion metric approach,” in 2013
                                                      50       55      60     65    70     75        80   85     90     95    100
                                                                                                                                              7th International Workshop on Software Clones (IWSC), May 2013, pp.
                                                                    Percentage of curated clone set pairs[%]                                  23–29.
                                                                                                                                         [12] R. Yue, Z. Gao, N. Meng, Y. Xiong, X. Wang, and J. D. Morgenthaler,
  Fig. 9. Transitions of Similarity Calculations and Curated Clone Set Pairs                                                                  “Automatic clone recommendation for refactoring based on the present
                                                                                                                                              and the past,” in 2018 IEEE International Conference on Software
                                                                             R EFERENCES                                                      Maintenance and Evolution (ICSME), 2018, pp. 115–126.
                                                                                                                                         [13] T. Kamiya, S. Kusumoto, and K. Inoue, “Ccfinder: a multilinguistic
  [1] A. Lozano and M. Wermelinger, “Assessing the effect of clones on                                                                        token-based code clone detection system for large scale source code,”
      changeability,” in 2008 IEEE International Conference on Software                                                                       IEEE Transactions on Software Engineering, vol. 28, no. 7, pp. 654–670,
      Maintenance, 2008, pp. 227–236.                                                                                                         2002.
  [2] M. Mondal, M. S. Rahman, R. K. Saha, C. K. Roy, J. Krinke, and K. A.                                                               [14] G. Myles and C. Collberg, “K-gram based software birthmarks,” in
      Schneider, “An empirical study of the impacts of clones in software                                                                     Proceedings of the 2005 ACM Symposium on Applied Computing, ser.
      maintenance,” in 2011 IEEE 19th International Conference on Program                                                                     SAC ’05, 2005, pp. 314–318.
      Comprehension, 2011, pp. 242–245.                                                                                                  [15] J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, and M. M. Mia,
  [3] F. Rahman, C. Bird, and P. Devanbu, “Clones: What is that smell?” in                                                                    “Towards a big data curated benchmark of inter-project code clones,”
      Conference on Mining Software Repositories (MSR 2010), May 2010,                                                                        in In Proceedings of the Early Research Achievements track of the
      pp. 72–81.                                                                                                                              30th International Conference on Software Maintenance and Evolution
  [4] T. Zhang and M. Kim, “Automated transplantation and differential                                                                        (ICSME 2014), 2014, pp. 476–480.
      testing for clones,” in Proceedings of the 39th International Conference                                                           [16] Y. Higo and S. Kusumoto, “How should we measure functional sameness
      on Software Engineering, ser. ICSE ’17, 2017, pp. 665–676.                                                                              from program source code? an exploratory study on java methods,” in
  [5] A. Sheneamer and J. Kalita, “A survey of software clone detection                                                                       Proceedings of the 22Nd ACM SIGSOFT International Symposium on
      techniques,” International Journal of Computer Applications, vol. 137,                                                                  Foundations of Software Engineering, ser. FSE 2014, 2014, pp. 294–305.
      no. 10, pp. 1–21, 2016.                                                                                                            [17] J. Cordy and C. Roy, “The nicad clone detector,” IEEE International
  [6] K. Chen, P. Liu, and Y. Zhang, “Achieving accuracy and scalability                                                                      Conference on Program Comprehension, pp. 219–220, June 2011.
      simultaneously in detecting application clones on android markets,” in


Copyright © 2019 for this paper by its authors.                                                                                                                                                                      26
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

</pre>