<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Information and Software Technology</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Code Clone Curation - Towards Scalable and Incremental Clone Detection -</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Masayuki Doi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yoshiki Higo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shinji Kusumoto</string-name>
          <email>kusumotog@ist.osaka-u.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Graduate School of Information Science and Technology, Osaka University</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>49</volume>
      <issue>9</issue>
      <fpage>72</fpage>
      <lpage>81</lpage>
      <abstract>
        <p>-Code clones have a negative impact on software maintenance. Code clone detection on large-scale source code takes a long time, and even worse, such detections occasionally aborted due to too much target size. Herein, we consider that we detect cross-project code clones from a set of many projects. In such a detection situation, even if a project is updated, we need to detect cross-project code clones again from the whole of the projects if we simply use a clone detection tool. Therefore we need a new clone detection scheme that rapidly detects code clones from a set of many projects with incremental detection functionality. In this paper, we propose an approach that rapidly obtains code clones from many projects. Our approach includes a strategy of multi-stage code clone detection unlike other clone detection techniques: in the first-stage detection, code clones are detected from each of the projects, respectively; then in the second-stage detection, the code clones detected in the first-stage are unified by using our clone curation technique. This multistage detection strategy has a capability of incremental clone detection: if the source code of a project is updated, the first-stage detection is applied only to the project and then the second-stage curation is performed. We constructed a software system based on the proposed approach. At present, the system utilizes an existing detection tool, CCFinder. We have applied the system to the large-scale source code (12 million LoC) of 128 projects. We also detected clones from the target source code by simply using CCFinder and compared the detection time. As a result, the clone detection with our system was 17 times shorter than CCFinder's detection. We also compared detection time under an assumption that each project of the targets is updated once. The experimental results showed that our incremental clone detection shortened the detection time by 91% compared to non-incremental clone detection. Index Terms-code clone detection, large-scale, incrementabil-</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>ity</p>
      <p>I. INTRODUCTION</p>
      <p>A code clone (hereafter, clone) is a code fragment which is
identical or similar to another code fragment in source code.
Clones may lead to software maintenance problems [1], [2]
and bug propagation [3], [4]. Thus, researchers have been
actively studying techniques for clone detection. For example,
a number of tools for automatic clone detection from source
code have been developed and released [5].</p>
      <p>The amount of source code is steadily increasing, and
large-scale clone detection has become a necessity.
Largescale clone detection can be used for detecting similar mobile
applications [6], finding the provenance of components [7],
and code search [8]. Thus, scalable clone detectors have been
developed [9]. However, such clone detectors seek to detect all
present clones in the target software, so when large-scale clone
detection is conducted with such a clone detector, the clone
detector outputs a huge number of clones as the results of the
clone detection. It is impractical to check manually whether
or not each of all detected clones is worth more than a glance.
Furthermore, if a new project is added to the target projects
of clone detection or a part of the target projects is updated,
the clone detector needs to detect clones from all the target
projects again. Such a clone detection takes a too long time. In
order to solve those problems, a new technique for large-scale
clone detection is required.</p>
      <p>The results of clone detections may include many
negligible clones. Negligible clones are worthless when dealing
with clone information in software development and
maintenance. For example, language-dependent clones are negligible
clones [10]. When a specific programming language is used, a
programmer cannot help writing some similar code fragments
which cannot be merged into code fragments due to language
limitations. A language-dependent clone consists of such code
fragments, so that language-dependent clones are negligible.
Many techniques have been proposed to remove the negligible
clones from the clone detection results. The techniques classify
clones according to whether code fragments are high cohesion
and low coupling [11], according to the ratio of repetitive
structures included in code fragments [10], and according to
the machine learning using some metrics [12]. By using those
techniques, negligible clones can be filtered out from the clone
detection results. We thought that incorporating such filtering
technique in clone detection process should make it possible
to detect clones on a larger scale more efficiently.</p>
      <p>
        In this paper, we propose a clone curation approach which
rapidly detects clones from many projects. Our approach
detects clones in three stages: (1) detecting intra-project clones,
(
        <xref ref-type="bibr" rid="ref1">2</xref>
        ) filtering out negligible clones from the detection results,
and (3) consolidating different sets of similar intra-project
clones into a set of cross-project clones. The above process
realizes a scalable clone detection because
traditional clone detection, which requires a high
computational complexity, is performed on small-size source
code (source code of a single project)
negligible clones are filtered out before generating
crossproject clones.
      </p>
      <p>We implemented the proposed technique and evaluated the
scalability of the proposed technique. As a result, when the
proposed technique and CCFinder [13] detected clones from
128 pseudo projects, each of which includes 100 KLoC, the
proposed technique was able to detect clones up to 17 times
faster than CCFinder. In addition, in order to evaluate whether
the proposed technique can detect clones rapidly in the case
of updating a part of the target projects, we measured the
execution time required to detect clones again.As a result,
our technique was able to detect clones 11 times faster than
CCFinder in such a situation. Finally, we demonstrated that
adjusting the parameters of our technique can detect clone
faster.</p>
      <p>The remainder of this paper is structured as follows.
Section II describes the definitions of the terminology used in this
paper. Section III presents our approach. Section IV describes
the implementation of our approach. Section V describes the
design of the evaluation experiments. Section VI describes
the datasets used in the evaluation experiments. Section VII
describes the results of the experiments and the discussions.
Section VIII describes the threats of validity. Section IX
describes the conclusion and future work.</p>
      <p>II. PRELIMINARIES</p>
    </sec>
    <sec id="sec-2">
      <title>A. Code clone</title>
      <p>Given two identical or similar code fragments, a clone
relation holds between the code fragments. A clone relation
is defined as an equivalence relation (i.e., reflexive, transitive,
and symmetric relation) on code fragments.</p>
      <p>If a clone relation is established in given two code
fragments, the code fragment pair is called a clone pair. An
equivalence class of a clone relation is called a clone set. That
is, a clone set is a maximal set of code fragments where a clone
relation exists between any pair of them. A code fragment
within a clone set is called a code clone or just a clone.
B. RNR</p>
      <p>Many negligible clones are included in the detection results.
In our previous research, we proposed a metric RN R to
automatically filtering out such negligible clones [10]. RN R(S)
means the ratio of non-repeated code sequence in a clone set
S. Here, we assume that a clone set S includes a code fragment
f . LOSwhole(f ) represents the length of the whole sequence
of fragment f , and LOSrepeated(f ) represents the length of
the repeated sequence of f , then metric RN R(S) is calculated
by the following formula:</p>
      <p>RN R(S) = 1
∑ LOSrepeated(f )
f2S
∑ LOSwhole(f )
f2S</p>
      <p>For example, we assume that we detect clones from
following two source files (F1; F2). Each source file consists of the
following five tokens.</p>
      <p>F1 : a b c a b
F2 : c c c a b
where the superscript “ ” indicates that its token is in a
repeated code sequence.</p>
      <p>We use label C(Fi; j; k) to represent the fragment. The
fragment C(Fi; j; k) starts at the j-th token and ends at the
100
80
60
40
20
0
0
10
20
30
70
80
90
100
40 50 60</p>
      <p>RNR [x 1/100]
k-th token in source file Fi(j must be less than k). In this case,
the following two clone sets are detected from the source files.</p>
      <p>S1 : C(F1; 1; 2); C(F1; 4; 5); C(F2; 4; 5)
S2 : C(F2; 1; 2); C(F2; 2; 3)
A RN R value of each clone set is the following:
0 + 0 + 0
RN R(S1) = 1 = 1:0
2 + 2 + 2
1 + 2
RN R(S2) = 1 = 0:25
2 + 2</p>
      <p>The value 0.25 of RN R(S2) represents that most of the
tokens in S2 are in the repeated code sequence. Our previous
research reported that low-RN R clone sets are typically
negligible clones such as consecutive variable declarations,
consecutive method invocations, and case entries of switch
statements.</p>
      <p>In our previous research, we examined how well the RN R
filtering worked. We calculated precision, recall, and f-value of
the filtering. The definitions of the values are the followings:</p>
      <p>Recall(%)
P recision(%)
=
=
100
100
jSnegligible \ Sfilteredj</p>
      <p>jSfilteredj
jSfilteredj
jSnegligiblej
2 Recall P recision
F value =</p>
      <p>Recall + P recision
where Sfiltered is clone sets filtered out by RN R and
Snegligible is all real negligible clone sets.</p>
      <p>Figure 1 illustrates transitions of recall, precision, and
fvalue when the RN R threshold varies between 0 and 1.
Figure 1 means 65% of negligible clone sets are filtered out
with no false positive by using 0.5 as the RN R threshold.</p>
      <p>III. PROPOSED TECHNIQUE</p>
      <p>The entire process of our approach is summarized in
Figure 2. It is composed of three steps: clone detection on each
project, negligible clones exclusion, and clone curation cross
projects. The following subsections describe the design of each
step.</p>
      <sec id="sec-2-1">
        <title>Input</title>
        <sec id="sec-2-1-1">
          <title>Step 1</title>
          <p>Clone detection
on each project</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Step 2</title>
          <p>Negligible clones
exclusion</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Step 3</title>
          <p>Clone curation
cross projects</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Output</title>
        <p>Detected clone sets
Target projects
Clone sets
Non-negligible</p>
        <p>clone sets</p>
        <p>In this step, metric RN R is calculated from each of the
detected clone sets after they are detected from each of the
target projects. This process realizes a scalable clone detection
because
the size of the source code to be detected is reduced, and
clone detection can be capable of parallel execution in
each of the target projects.</p>
        <p>In the case of detecting clones from a large project, our
approach can detect clones faster by dividing the project
into a set of smaller pseudo projects. However, over-division
may lead to finding fewer clones to be detected because our
approach cannot detect clone pairs that exist only between the
projects.</p>
        <p>
          We utilize an existing detection tool. Many detectors output
a text file as the detection results. Consequently, our approach
can be applied to many detectors such as token-based or
PDGbased ones because the clone sets, which are the output of
this step, can be obtained by parsing the text file outputted by
detecting clones from each of the target projects. Moreover, if
a clone detector has a function to calculate RN R as well as
detecting clone sets, this step can be performed efficiently. For
example, CCFinder [13] has a function to calculate RN R. In
this case that a clone detector does not have the function, we
need to calculate RN R as a post-process of clone detection.
B. Negligible clones exclusion
clones. Clone sets to be excluded are satisfying RN R &lt; 50
based on the previous research [10]. There are two reasons to
exclude low-RN R clones: (1) users can avoid spending their
time to analyze negligible clones; (
          <xref ref-type="bibr" rid="ref1">2</xref>
          ) the execution time of
clone detection gets shortened. Hence the number of clone sets
that are the targets of similarity calculation gets reduced, and
then similarity calculation can be performed more efficiently.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>C. Clone curation cross projects</title>
      <p>The curation of cross-project clone sets is performed based
on the judgment whether the similarity between clone sets is
equal to or greater than the threshold.</p>
      <p>Consequently, our technique omits similarity calculation
between a pair of clone sets if the pair satisfies both the
following conditions.</p>
      <p>The RN R value of a clone set in the pair is largely
different from the one of the other clone set of the pair.</p>
      <p>The RN R value of either clone set in the pair is
sufficiently large.</p>
      <p>RN R is based on the code fragment structure. Similar clone
sets may be the almost same RN R values. Thereby the clone
detection should get faster by omitting similarity calculation
between clone sets which have the big difference between each
RN R values and are the one of the clone set pair’s RN R is
sufficiently large. More completely, RN Rm; RN Rn values of
given to clone sets and two parameters diff and max, our
approach determinates whether the similarity between clone
sets are satisfying the following formula.</p>
      <p>It is ineluctable that negligible clones are included in
clone detection results. The presence of negligible clones has Omit( diff ; max) = (jRN Rm RN Rnj &gt; diff )
negative impacts from the following two viewpoints: making ^(max(RN Rm; RN Rn) &gt; 100
it more difficult to analyze clone detection results, and taking
a longer time to finish clone detections. Hence, our approach where max is a function to return the maximum value in
excludes the negligible clones from candidates of cross-project the parameters. Figure 3 shows the area of omitting similarity
max)
]
0
0
1
/
1
x
i[
r
a
P
t
e
S
e
lno Area to calculate similarity
C
A
itn   
e
S
e
n
o
l
C
f
o
R
rRN  
e
g
r
a
L</p>
      <p>Area not to calculate similarity
50</p>
      <p>Smaller RNR of Clone Set in A Clone Set Pair [x 1/100]</p>
      <p>100</p>
      <p>Fig. 3. Area to Calculate Similarity Between A Clone Set Pair
calculation by the formula. The first expression is represented
that RN R value of a clone set in the pair is largely different
from the one of the other clone set of the pair. The diff
parameter means the threshold of the difference. The second
expression is represented that the RN R value of either clone
set in the pair is sufficiently large. The max parameter
means the threshold of the RN R value. For the relationship
between the appropriate values of the parameters required for
the similarity calculation and the detection accuracy and the
speed based on them, the evaluation experiment is described
in Section V-C.</p>
      <p>IV. IMPLEMENTATION</p>
    </sec>
    <sec id="sec-4">
      <title>A. Clone detection tool</title>
      <p>Our proposed approach detects clones using a clone
detection tool. Our implementation utilizes CCFinder [13] .</p>
    </sec>
    <sec id="sec-5">
      <title>B. Coefficient of similarity</title>
      <p>We define the similarity between a pair of clone sets.
Hereafter we call a pair of clone sets clone set pair.</p>
      <p>Sim(Sm; Sn) =</p>
      <p>jN (Sm) \ N (Sn)j
max(jN (Sm)j; jN (Sn)j)
where Sm and Sn are a clone set, respectively. N is a
function to return a set of n-grams created from the tokens
included in a given clone set. The target tokens are extracted
by a lexical analysis on the text representation of the clone set
and a normalization of the identifiers and literals included in
the text. N-gram size of our implementation is 5, which is on
Myles et al. [14]. max is a function to return the maximum
value in the parameters. The reason why we use n-gram is
that we want to calculate text similarity without considering
whether each instruction is repeated or not. If two clone sets
have a higher similarity than a threshold , they are regarded
as a single clone set. Otherwise, they are repeated as different
clone sets as they are. At this moment, we use 0.9 as a default
value. The value of 0.9 is the evaluated in Sec.VII.</p>
      <p>V. EXPERIMENTAL DESIGN</p>
      <p>This research evaluates our proposed approach by answering
the following research questions.</p>
      <p>One primary goal is to detect cross-project clones as fast
as possible. In this evaluation, we create same-size pseudo
projects from BigCloneBench dateset [15]. Then, we compare
the execution time of two clone detections: our proposed
approach and CCFinder. In the formal detection, in-project
clones are detected by CCFinder from each pseudo project.
And then, cross-project clones are generated by the curation.
In the latter detection, cross-project clones are directly detected
from the whole of the pseudo projects by using CCFinder.</p>
      <p>Our proposed approach utilizes the results of the clone
detection per project. Herein, we assume that the source code
of a project updated and we need to obtain cross-project clones
again. If we use the proposed approach, we only need to detect
clones again from the updated project. And then, the detection
results of the updated project and the detection results of the
other projects are input to the curation. If we do not use the
proposed approach, we need to run CCFinder for the whole
of the target projects. In this evaluation, we compare the two
kinds of detections: our proposed approach and CCFinder.
C. RQ3: How effective is parameters adjustment to improve
clone detection performance?</p>
      <p>As described in Section III-C, our proposed approach has
two parameters diff and max that specify the area not to
calculate similarity. We seek for appropriate parameters by
executing the proposed approach with different parameters.</p>
      <p>VI. DATA SETS</p>
      <p>In this research, we utilized two datasets, Apache [16] and
BigCloneBench [15] datasets.</p>
      <p>The Apache dataset is composed of 84 Java projects in
Apache Software Foundation1. Unfortunately, CCFinder
reported encoding errors on nine projects. This is because the
nine projects included at least a file that is encoded with a
character code that CCFinder cannot treat. Consequently, we
excluded the nine projects from our target ones.</p>
      <p>The BigCloneBench dataset is one of the largest clone
benchmarks available to date. It was created from IJaDataset
2.0 2, which is composed of 25,000 Java systems. The
benchmark includes 2.9 million source files with 8 million
manuallyvalidated clone pairs. The BigCloneBench dataset was used for
1http://www.apache.org
2https://sites.google.com/site/asegsecold/projects/seclone
clone evaluations and scalability testing in several large-scale
clone detection and clone search studies [8]. In this research,
we constructed 256 pseudo projects from the BigCloneBench
dataset. The 256 pseudo projects are composed of two groups,
128 projects of 10 KLoC and 128 projects of 100 KLoC. Each
pseudo project consists of source files that were randomly
selected from the BigCloneBench dataset.</p>
      <p>VII. EVALUATION</p>
      <p>We conducted an evaluation of our proposed approach on a
single workstation, which has the following specification.</p>
      <p>Windows 10
1.6GHz processor (6 cores)
32 GB Memory
512 GB SSD</p>
      <p>The proposed approach curates clone sets if the similarity
between the different clone sets exceeds a threshold. Hence,
the choice of the similarity threshold affects the curation
performance . To find the optimal threshold for curation, we
measured the execution time and the number of the curated
clone sets when we curate clone sets from the Apache dataset
described in Sec. VI with different thresholds. Table I shows
the results.</p>
      <p>As the threshold value was reduced, The execution time
increased significantly. On the other hand, as the threshold
value was increased, the execution time was shortened, but
the number of clone set pairs whose similarity exceeds the
threshold value decreases. Hence, the optimal value is 0.9.
Therefore, we utilized 0.9 as the threshold value for
subsequent evaluations.</p>
      <p>A. RQ1: How fast does our approach detect cross-project
clones?</p>
      <p>To answer RQ1, we compared the execution time of two
clone detections: our proposed approach and CCFinder. We
used the following thresholds in the application of our
proposed technique.</p>
      <p>diff : 0
max: 0 (Similarity calculation is not omitted.)
We utilized the dataset, which had constructed from
BigCloneBench dataset described in Section VI. Figure 4 shows
the overview of how to construct the dataset. In order to detect
clones with changing the number of the target projects, we
extracted seven sets of the pseudo projects randomly from each
of the 10 KLoC group and the 100 KLoC group, respectively.
The number of pseudo projects in a set i, i 2 [1; 7], is 2i.</p>
      <p>The execution time of each clone detection is shown in
Figure 5. Our approach can detect clones rapidly in the both
cases of 10 KLoC and 100 KLoC. The 1.24 times was the
minimum value while the maximum value was 6.24 times in
10 KLoC. The greater the number of target projects is, the
higher the ratio of the execution time between our proposed
approach and CCFinder one is. We consider there are two
reasons for that we obtained such results. Our proposed
approach detects clones in parallel and it detects within-project
clones from each project instead of cross-project clones from
the whole dataset.</p>
      <p>The 11.4 times was the minimum value while the maximum
value was 17.1 times in 100 KLoC. Our proposed approach
was able to detect clones from the 128-project dataset of 100
KLoC. On the other hand, CCFinder was not able to finish
detecting clones from the same dataset due to out of memory
error. Detecting clones from vast source code requires huge
memory, which is the reason why CCFinder failed to detect
clones from the dataset. In our approach, CCFinder takes only
a small amount of source code in a single clone detection and
CCFinder is executed many times in parallel. Consequently,
in our approach, out of memory error did not occur. Our
approach was able to detect clones from the large dataset
where CCFinder was not able to finish detecting clones due
to insufficient memory.</p>
      <p>We also evaluated the ratio of clones that were not detected
by our proposed approach. The approach cannot detect clone
pairs that exist only between different projects because the
approach curates clones that are detected from each of the
projects. Hence, we measured the ratio of clones that the
proposed approach could detect. We utilized the 64 pseudo
projects of the 100 KLoC as the target projects, and we
compared the number of clones that could be detected by the
proposed approach and the number of clones that could be
detected by CCFinder. The results are shown in Fig.6. It can
be seen that the ratio of clones that can be detected decreases
(a) Execution Time for Projects of 10 KLoC</p>
      <p>CCFinder</p>
      <p>Our Approach
2
4
8
32
64</p>
      <p>128
16
#Projects
(b) Execution Time for Projects of 100 KLoC
as the number of projects increases. That is because clones
that exist only between the different projects increase as the
number of projects increases. As a result, in the case of 64
projects of 100K LoC, the 38 percent of clones were not
be detected, but the remaining 62 percent of clones can be
detected 17 times faster than CCFinder.</p>
      <p>Our answer to RQ1 is that our proposed approach can detect
clones up to 17 times faster than CCFinder. Our approach can
detect clones from the large target that cannot detect clones
by CCFinder due to insufficient memory.</p>
      <p>B. RQ2: How fast does our approach follow update of target
projects?</p>
      <p>In this evaluation, we assumed a situation: the source code
of each target project is updated once in different dates; a user
wants to detect cross-project clones as rapid as possible just
after any project is updated. Under this assumption, we utilized
the 64 pseudo projects of the 100 KLoC, which had been
103
CCFinder</p>
      <p>Our Approach
100
constructed in RQ1. We did not use the dataset of 128 projects
because CCFinder cannot detect clones from the dataset. Thus,
in the case that we directly detect cross-project clones with
CCFinder, a clone detection from the whole of the 64 projects
is executed 64 times. On the other hand, in the case that
we apply the proposed approach, a clone detection from the
updated project and a clone curation for the whole of 64
projects are executed 64 times. We measured the execution
time of the both cases.</p>
      <p>The measurement results are shown in Table II. CCFinder
took 11,207 seconds while our approach took only 1,006
seconds to detect clones on average. Therefore, our approach
was able to detect clones 11 times faster than CCFinder. We
investigated the ratio of clone detection and clone curation in
our approach and found that 97% of the execution time spent
on clone curation.</p>
      <p>Our answer to RQ2 is that the proposed approach took only
9% execution time of detecting cross-project clones from the
whole of 64 pseudo projects with CCFinder.</p>
      <p>C. RQ3: How effective is parameters adjustment to improve
clone detection performance?</p>
      <p>In the process of answering RQ1 and RQ2, we found
that the clone curation step accounts for the majority of the
execution time. This is because our approach calculates the
similarities between all clone sets. To curate clones more
rapidly, our approach needs to reduce the number of similarity
calculations. We considered that omitting similarity calculation
between some clone sets, which have little effect on the
curation results, can curate clones more rapidly. We also
considered that we could identify such clone set pairs by
utilizing RN R, which is described in Section III-C. Thus, to
seek for such clone set pairs, we investigated the distribution
of similar clone set pairs in the 75 projects of Apache dataset
described in Section VI. In Figure 7, the color of each cell</p>
      <p>Percentage of unsimilar clone set pairs [%]
99.99
100
50</p>
      <p>Smaller RNR of the clone set in a clone set pair [x 1/100]
100</p>
      <p>Fig. 7. The Percentage of Similar Clone Set Pairs
indicates a percentage of the clone set pairs whose X-axis and
Y-axis clone sets are not similar than 0.9. The while cells are
the percentage is 99.99%. The red cell shows the percentages
are higher than 99.99% while the green cell means that the
percentage is lower than 99.99%. We considered that clone
sets in a clone set pair are not similar when the RN R values
of the clone sets are largely different from each other and the
RN R value of either clone set in the pair is close to large.
Therefore, our approach omits the similarity calculations of
such clone set pair using two parameters, diff and max.</p>
      <p>In this research question, we seek for appropriate values of
two parameters, diff and max to curate clones rapidly. We
attempted to compare the execution time and the precision
of clone curation for each of the parameter combinations.
However, it takes too long time to curate clones for all
combinations of parameters. Hence, we measured the number of
similarity calculations instead of the execution time. Namely,
by comparing the number of similarity calculations and the
precision of clone curation, we evaluated the effectiveness of
our approach to omit similarity calculations.</p>
      <p>Figure 8(a) is a heatmap which shows the percentage of
clone set pairs curated for each parameter. The while cells
indicate that the percentage of curated clone set pairs is
90%. The green cell shows that the percentages are higher
than 90% while the red cell means that the percentage is
lower than 90%. Figure 8(b) is another heatmap which shows
how much similarity calculations are omitted for each pair
of parameters, diff and max. Each cell shows that the
percentage of the similarity calculations with the given two
parameters against the similarity calculations without omitting
any calculations. The while cells indicate that the percentage
of similarity calculations is 60%. The green cell shows that
the percentages are higher than 60% while the red cell means
that the percentage is lower than 60%. We can see that there
are several value pairs of two parameters where their cells in
both Figure 8(a) and Figure 8(b) are white. Namely, there
are parameters which can reduce the number of similarity
calculations to 60 percents with keeping the 90 percents of
curated clone set pairs. Thus, we sought for the optimal
pair of parameters using the heatmaps and we measured the
number of similarity calculations that could be omitted when
the optimized parameters were utilized.</p>
      <p>Figure 9 shows the relationship between the ratio of curated
clone set pairs and the ratio of similarity calculation reduction.
According to Figure 7, we can see that there are multiple
pairs of the two thresholds that yield the same ratio of curated
clone set pairs. Herein, we focus on the pair of the two
parameters that reduces similarity calculation at a maximum,
and we investigated the relationship between the two ratios.
As a result, we found that 41% of similarity calculations can
be omitted by sacrificing 5% of the curations.</p>
      <p>Our answer to RQ3 is that using appropriate values of
parameters can reduce 41% of similarity calculations only by
sacrificing 5% of the clone curations.</p>
      <p>VIII. THREATS OF VALIDITY</p>
    </sec>
    <sec id="sec-6">
      <title>A. Clone detector</title>
      <p>In the evaluations, we utilized CCFinder as a clone detector.
However, there are many clone detectors besides CCFinder.
Applying other clone detectors in our approach does not
necessarily get more rapid than conventional clone detections
unlike CCFinder. We are planning on evaluating whether or
not other clone detectors work well in our approach.</p>
    </sec>
    <sec id="sec-7">
      <title>B. Experimental target</title>
      <p>In the evaluations, we succeeded in reducing similarity
calculations by 41% on the Apache projects dataset. However,
if we use other datasets, we may obtain different ratios of
similarity calculations.</p>
      <p>IX. CONCLUSION</p>
      <p>In this research, we proposed a technique to detect clones
rapidly. The proposed technique curates the results of clone
detections for each of the projects. As a result of evaluating
the performance of our technique, it was possible to detect
clones up to 17 times faster with CCFinder. In addition, the
execution time of incremental updates was able to be reduced
to nine percents of previous ones. Furthermore, we showed
that using appropriate values of parameters can reduce 41%
of similarity calculations only by sacrificing 5% of the clone
curations. We plan to improve our technique so that it can
be applied to clone detectors other than CCFinder such as
SourcererCC [9] or NiCAD [17] in the future.
i
Percentage of curated clone set pairs [%]
Percentage of similarity calculations [%]</p>
      <p>Percentage of curated clone set pairs[%]
[1] A. Lozano and M. Wermelinger, “Assessing the effect of clones on</p>
      <p>changeability,” in 2008 IEEE International Conference on Software</p>
      <p>Schneider, “An empirical study of the impacts of clones in software</p>
      <p>Conference on Mining Software Repositories (MSR 2010), May 2010,
[4] T. Zhang and M. Kim, “Automated transplantation and differential
testing for clones,” in Proceedings of the 39th International Conference
on Software Engineering, ser. ICSE ’17, 2017, pp. 665–676.
[5] A. Sheneamer and J. Kalita, “A survey of software clone detection</p>
      <p>simultaneously in detecting application clones on android markets,” in
Copyright © 2019 for this paper by its authors.</p>
      <p>Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Proceedings of the 36th International Conference on Software
Engineerbertillonage: finding the provenance of an entity,” in Proceedings of the
8th working conference on mining software repositories (MSR), 2011,</p>
      <p>code clone search via multiple code representations,” Empirical Software
ercc: scaling code clone detection to big-code,” in 2016 IEEE/ACM 38th
International Conference on Software Engineering (ICSE), 2016, pp.</p>
      <p>1157–1168.
implementation for investigating code clones in a software system,”
differences from similar programs? a cohesion metric approach,” in 2013
7th International Workshop on Software Clones (IWSC), May 2013, pp.
23–29.</p>
      <p>2002.
“Automatic clone recommendation for refactoring based on the present</p>
      <p>Maintenance and Evolution (ICSME), 2018, pp. 115–126.</p>
      <p>Proceedings of the 2005 ACM Symposium on Applied Computing, ser.</p>
      <p>SAC ’05, 2005, pp. 314–318.
“Towards a big data curated benchmark of inter-project code clones,”
in In Proceedings of the Early Research Achievements track of the
(ICSME 2014), 2014, pp. 476–480.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mondal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. K.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Krinke</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          K. A. maintenance,” in
          <source>2011 IEEE 19th International Conference on Program</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>