=Paper=
{{Paper
|id=Vol-2511/QuASoQ-03
|storemode=property
|title=A Code Clone Curation - Towards Scalable and Incremental Clone Detection -
|pdfUrl=https://ceur-ws.org/Vol-2511/QuASoQ-03.pdf
|volume=Vol-2511
|authors=Masayuki Doi,Yoshiki Higo,Shinji Kusumoto
|dblpUrl=https://dblp.org/rec/conf/apsec/DoiHK19
}}
==A Code Clone Curation - Towards Scalable and Incremental Clone Detection -==
7th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2019)
A Code Clone Curation
— Towards Scalable and Incremental Clone Detection —
Masayuki Doi, Yoshiki Higo, and Shinji Kusumoto
Graduate School of Information Science and Technology, Osaka University, Japan
{m-doi, higo, kusumoto}@ist.osaka-u.ac.jp
Abstract—Code clones have a negative impact on software detector outputs a huge number of clones as the results of the
maintenance. Code clone detection on large-scale source code clone detection. It is impractical to check manually whether
takes a long time, and even worse, such detections occasionally or not each of all detected clones is worth more than a glance.
aborted due to too much target size. Herein, we consider that
we detect cross-project code clones from a set of many projects. Furthermore, if a new project is added to the target projects
In such a detection situation, even if a project is updated, we of clone detection or a part of the target projects is updated,
need to detect cross-project code clones again from the whole of the clone detector needs to detect clones from all the target
the projects if we simply use a clone detection tool. Therefore projects again. Such a clone detection takes a too long time. In
we need a new clone detection scheme that rapidly detects code order to solve those problems, a new technique for large-scale
clones from a set of many projects with incremental detection
functionality. In this paper, we propose an approach that rapidly clone detection is required.
obtains code clones from many projects. Our approach includes The results of clone detections may include many negli-
a strategy of multi-stage code clone detection unlike other clone gible clones. Negligible clones are worthless when dealing
detection techniques: in the first-stage detection, code clones are with clone information in software development and mainte-
detected from each of the projects, respectively; then in the
second-stage detection, the code clones detected in the first-stage
nance. For example, language-dependent clones are negligible
are unified by using our clone curation technique. This multi- clones [10]. When a specific programming language is used, a
stage detection strategy has a capability of incremental clone programmer cannot help writing some similar code fragments
detection: if the source code of a project is updated, the first-stage which cannot be merged into code fragments due to language
detection is applied only to the project and then the second-stage limitations. A language-dependent clone consists of such code
curation is performed. We constructed a software system based
on the proposed approach. At present, the system utilizes an
fragments, so that language-dependent clones are negligible.
existing detection tool, CCFinder. We have applied the system to Many techniques have been proposed to remove the negligible
the large-scale source code (12 million LoC) of 128 projects. We clones from the clone detection results. The techniques classify
also detected clones from the target source code by simply using clones according to whether code fragments are high cohesion
CCFinder and compared the detection time. As a result, the clone and low coupling [11], according to the ratio of repetitive
detection with our system was 17 times shorter than CCFinder’s
detection. We also compared detection time under an assumption
structures included in code fragments [10], and according to
that each project of the targets is updated once. The experimental the machine learning using some metrics [12]. By using those
results showed that our incremental clone detection shortened techniques, negligible clones can be filtered out from the clone
the detection time by 91% compared to non-incremental clone detection results. We thought that incorporating such filtering
detection. technique in clone detection process should make it possible
Index Terms—code clone detection, large-scale, incrementabil-
to detect clones on a larger scale more efficiently.
ity
In this paper, we propose a clone curation approach which
I. I NTRODUCTION rapidly detects clones from many projects. Our approach de-
tects clones in three stages: (1) detecting intra-project clones,
A code clone (hereafter, clone) is a code fragment which is (2) filtering out negligible clones from the detection results,
identical or similar to another code fragment in source code. and (3) consolidating different sets of similar intra-project
Clones may lead to software maintenance problems [1], [2] clones into a set of cross-project clones. The above process
and bug propagation [3], [4]. Thus, researchers have been realizes a scalable clone detection because
actively studying techniques for clone detection. For example,
a number of tools for automatic clone detection from source • traditional clone detection, which requires a high com-
code have been developed and released [5]. putational complexity, is performed on small-size source
The amount of source code is steadily increasing, and code (source code of a single project)
• negligible clones are filtered out before generating cross-
large-scale clone detection has become a necessity. Large-
scale clone detection can be used for detecting similar mobile project clones.
applications [6], finding the provenance of components [7], We implemented the proposed technique and evaluated the
and code search [8]. Thus, scalable clone detectors have been scalability of the proposed technique. As a result, when the
developed [9]. However, such clone detectors seek to detect all proposed technique and CCFinder [13] detected clones from
present clones in the target software, so when large-scale clone 128 pseudo projects, each of which includes 100 KLoC, the
detection is conducted with such a clone detector, the clone proposed technique was able to detect clones up to 17 times
Copyright © 2019 for this paper by its authors. 19
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
7th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2019)
faster than CCFinder. In addition, in order to evaluate whether Recall Precision F-value
the proposed technique can detect clones rapidly in the case 100
of updating a part of the target projects, we measured the
execution time required to detect clones again.As a result, 80
our technique was able to detect clones 11 times faster than
CCFinder in such a situation. Finally, we demonstrated that 60
adjusting the parameters of our technique can detect clone
faster.
40
The remainder of this paper is structured as follows. Sec-
tion II describes the definitions of the terminology used in this
20
paper. Section III presents our approach. Section IV describes
the implementation of our approach. Section V describes the
0
design of the evaluation experiments. Section VI describes 0 10 20 30 40 50 60 70 80 90 100
the datasets used in the evaluation experiments. Section VII RNR [x 1/100]
describes the results of the experiments and the discussions.
Section VIII describes the threats of validity. Section IX Fig. 1. Transitions of Recall, Precision, and F-value by RNR
describes the conclusion and future work.
II. PRELIMINARIES k-th token in source file Fi (j must be less than k). In this case,
the following two clone sets are detected from the source files.
A. Code clone
S1 : C(F1 , 1, 2), C(F1 , 4, 5), C(F2 , 4, 5)
Given two identical or similar code fragments, a clone S2 : C(F2 , 1, 2), C(F2 , 2, 3)
relation holds between the code fragments. A clone relation A RN R value of each clone set is the following:
is defined as an equivalence relation (i.e., reflexive, transitive, 0+0+0
and symmetric relation) on code fragments. RN R(S1 ) = 1 − = 1.0
2+2+2
If a clone relation is established in given two code frag- 1+2
ments, the code fragment pair is called a clone pair. An RN R(S2 ) = 1 − = 0.25
2+2
equivalence class of a clone relation is called a clone set. That
is, a clone set is a maximal set of code fragments where a clone The value 0.25 of RN R(S2 ) represents that most of the
relation exists between any pair of them. A code fragment tokens in S2 are in the repeated code sequence. Our previous
within a clone set is called a code clone or just a clone. research reported that low-RN R clone sets are typically
negligible clones such as consecutive variable declarations,
B. RNR consecutive method invocations, and case entries of switch
Many negligible clones are included in the detection results. statements.
In our previous research, we proposed a metric RN R to au- In our previous research, we examined how well the RN R
tomatically filtering out such negligible clones [10]. RN R(S) filtering worked. We calculated precision, recall, and f-value of
means the ratio of non-repeated code sequence in a clone set the filtering. The definitions of the values are the followings:
S. Here, we assume that a clone set S includes a code fragment
f . LOSwhole (f ) represents the length of the whole sequence |Snegligible ∩ Sf iltered |
Recall(%) = 100 ×
of fragment f , and LOSrepeated (f ) represents the length of |Sf iltered |
the repeated sequence of f , then metric RN R(S) is calculated |Sf iltered |
P recision(%) = 100 ×
by the following formula: |Snegligible |
∑ 2 × Recall × P recision
F − value =
LOSrepeated (f ) Recall + P recision
f ∈S where Sf iltered is clone sets filtered out by RN R and
RN R(S) = 1 − ∑
LOSwhole (f ) Snegligible is all real negligible clone sets.
f ∈S
Figure 1 illustrates transitions of recall, precision, and f-
value when the RN R threshold varies between 0 and 1.
For example, we assume that we detect clones from follow- Figure 1 means 65% of negligible clone sets are filtered out
ing two source files (F1 , F2 ). Each source file consists of the with no false positive by using 0.5 as the RN R threshold.
following five tokens.
F1 : a b c a b III. PROPOSED TECHNIQUE
F2 : c c∗ c∗ a b The entire process of our approach is summarized in Fig-
where the superscript “∗” indicates that its token is in a ure 2. It is composed of three steps: clone detection on each
repeated code sequence. project, negligible clones exclusion, and clone curation cross
We use label C(Fi , j, k) to represent the fragment. The projects. The following subsections describe the design of each
fragment C(Fi , j, k) starts at the j-th token and ends at the step.
Copyright © 2019 for this paper by its authors. 20
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
7th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2019)
Input Step 1 Step 2 Step 3
Clone detection Negligible clones Clone curation
on each project exclusion cross projects
Output
Detected clone sets
Target projects Clone sets Non-negligible
clone sets
Fig. 2. The Overview of Our Approach
A. Clone detection on each project clones. Clone sets to be excluded are satisfying RN R < 50
In this step, metric RN R is calculated from each of the based on the previous research [10]. There are two reasons to
detected clone sets after they are detected from each of the exclude low-RN R clones: (1) users can avoid spending their
target projects. This process realizes a scalable clone detection time to analyze negligible clones; (2) the execution time of
because clone detection gets shortened. Hence the number of clone sets
that are the targets of similarity calculation gets reduced, and
• the size of the source code to be detected is reduced, and
then similarity calculation can be performed more efficiently.
• clone detection can be capable of parallel execution in
each of the target projects. C. Clone curation cross projects
In the case of detecting clones from a large project, our The curation of cross-project clone sets is performed based
approach can detect clones faster by dividing the project on the judgment whether the similarity between clone sets is
into a set of smaller pseudo projects. However, over-division equal to or greater than the threshold.
may lead to finding fewer clones to be detected because our Consequently, our technique omits similarity calculation
approach cannot detect clone pairs that exist only between the between a pair of clone sets if the pair satisfies both the
projects. following conditions.
We utilize an existing detection tool. Many detectors output • The RN R value of a clone set in the pair is largely
a text file as the detection results. Consequently, our approach different from the one of the other clone set of the pair.
can be applied to many detectors such as token-based or PDG- • The RN R value of either clone set in the pair is suffi-
based ones because the clone sets, which are the output of ciently large.
this step, can be obtained by parsing the text file outputted by RN R is based on the code fragment structure. Similar clone
detecting clones from each of the target projects. Moreover, if sets may be the almost same RN R values. Thereby the clone
a clone detector has a function to calculate RN R as well as detection should get faster by omitting similarity calculation
detecting clone sets, this step can be performed efficiently. For between clone sets which have the big difference between each
example, CCFinder [13] has a function to calculate RN R. In RN R values and are the one of the clone set pair’s RN R is
this case that a clone detector does not have the function, we sufficiently large. More completely, RN Rm , RN Rn values of
need to calculate RN R as a post-process of clone detection. given to clone sets and two parameters θdif f and θmax , our
approach determinates whether the similarity between clone
B. Negligible clones exclusion sets are satisfying the following formula.
It is ineluctable that negligible clones are included in
clone detection results. The presence of negligible clones has Omit(θdif f , θmax ) = (|RN Rm − RN Rn | > θdif f )
negative impacts from the following two viewpoints: making ∧(max(RN Rm , RN Rn ) > 100 − θmax )
it more difficult to analyze clone detection results, and taking
a longer time to finish clone detections. Hence, our approach where max is a function to return the maximum value in
excludes the negligible clones from candidates of cross-project the parameters. Figure 3 shows the area of omitting similarity
Copyright © 2019 for this paper by its authors. 21
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
7th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2019)
50 as a single clone set. Otherwise, they are repeated as different
clone sets as they are. At this moment, we use 0.9 as a default
value. The value of 0.9 is the evaluated in Sec.VII.
Larger RNR of Clone Set in A Clone Set Pair [x 1/100]
V. E XPERIMENTAL D ESIGN
This research evaluates our proposed approach by answering
the following research questions.
A. RQ1: How fast does our approach detect cross-project
Area to calculate similarity clones?
𝜃𝑑𝑖𝑓𝑓 One primary goal is to detect cross-project clones as fast
as possible. In this evaluation, we create same-size pseudo
projects from BigCloneBench dateset [15]. Then, we compare
the execution time of two clone detections: our proposed
𝜃𝑚𝑎𝑥 approach and CCFinder. In the formal detection, in-project
clones are detected by CCFinder from each pseudo project.
And then, cross-project clones are generated by the curation.
Area not to calculate similarity In the latter detection, cross-project clones are directly detected
100 from the whole of the pseudo projects by using CCFinder.
50 Smaller RNR of Clone Set in A Clone Set Pair [x 1/100] 100
B. RQ2: How fast does our approach follow update of target
Fig. 3. Area to Calculate Similarity Between A Clone Set Pair
projects?
calculation by the formula. The first expression is represented Our proposed approach utilizes the results of the clone
that RN R value of a clone set in the pair is largely different detection per project. Herein, we assume that the source code
from the one of the other clone set of the pair. The θdif f of a project updated and we need to obtain cross-project clones
parameter means the threshold of the difference. The second again. If we use the proposed approach, we only need to detect
expression is represented that the RN R value of either clone clones again from the updated project. And then, the detection
set in the pair is sufficiently large. The θmax parameter results of the updated project and the detection results of the
means the threshold of the RN R value. For the relationship other projects are input to the curation. If we do not use the
between the appropriate values of the parameters required for proposed approach, we need to run CCFinder for the whole
the similarity calculation and the detection accuracy and the of the target projects. In this evaluation, we compare the two
speed based on them, the evaluation experiment is described kinds of detections: our proposed approach and CCFinder.
in Section V-C.
C. RQ3: How effective is parameters adjustment to improve
IV. IMPLEMENTATION clone detection performance?
A. Clone detection tool As described in Section III-C, our proposed approach has
two parameters θdif f and θmax that specify the area not to
Our proposed approach detects clones using a clone detec-
calculate similarity. We seek for appropriate parameters by
tion tool. Our implementation utilizes CCFinder [13] .
executing the proposed approach with different parameters.
B. Coefficient of similarity
VI. DATA S ETS
We define the similarity between a pair of clone sets.
In this research, we utilized two datasets, Apache [16] and
Hereafter we call a pair of clone sets clone set pair.
BigCloneBench [15] datasets.
|N (Sm ) ∩ N (Sn )| The Apache dataset is composed of 84 Java projects in
Sim(Sm , Sn ) =
max(|N (Sm )|, |N (Sn )|) Apache Software Foundation1 . Unfortunately, CCFinder re-
where Sm and Sn are a clone set, respectively. N is a ported encoding errors on nine projects. This is because the
function to return a set of n-grams created from the tokens nine projects included at least a file that is encoded with a
included in a given clone set. The target tokens are extracted character code that CCFinder cannot treat. Consequently, we
by a lexical analysis on the text representation of the clone set excluded the nine projects from our target ones.
and a normalization of the identifiers and literals included in The BigCloneBench dataset is one of the largest clone
the text. N-gram size of our implementation is 5, which is on benchmarks available to date. It was created from IJaDataset
Myles et al. [14]. max is a function to return the maximum 2.0 2 , which is composed of 25,000 Java systems. The bench-
value in the parameters. The reason why we use n-gram is mark includes 2.9 million source files with 8 million manually-
that we want to calculate text similarity without considering validated clone pairs. The BigCloneBench dataset was used for
whether each instruction is repeated or not. If two clone sets 1 http://www.apache.org
have a higher similarity than a threshold θ, they are regarded 2 https://sites.google.com/site/asegsecold/projects/seclone
Copyright © 2019 for this paper by its authors. 22
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
7th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2019)
clone evaluations and scalability testing in several large-scale
clone detection and clone search studies [8]. In this research, Random Extraction
we constructed 256 pseudo projects from the BigCloneBench 211 (=2) pseudo projects
dataset. The 256 pseudo projects are composed of two groups, of 10K LoC
…
128 projects of 10 KLoC and 128 projects of 100 KLoC. Each
pseudo project consists of source files that were randomly 128 pseudo projects …
selected from the BigCloneBench dataset. of 10K LoC
77
2 (=128) pseudo projects
VII. E VALUATION of 10K LoC
We conducted an evaluation of our proposed approach on a
single workstation, which has the following specification.
• Windows 10
Random Extraction
211(=2) pseudo projects
• 1.6GHz processor (6 cores)
of 100K LoC
• 32 GB Memory
…
• 512 GB SSD
128 pseudo projects
The proposed approach curates clone sets if the similarity of 100K LoC
…
between the different clone sets exceeds a threshold. Hence,
The BigCloneBench dataset 277 (=128) pseudo projects
the choice of the similarity threshold affects the curation of 100K LoC
performance . To find the optimal threshold for curation, we
measured the execution time and the number of the curated Fig. 4. The Overview of How to Construct The Dataset
clone sets when we curate clone sets from the Apache dataset
described in Sec. VI with different thresholds. Table I shows cases of 10 KLoC and 100 KLoC. The 1.24 times was the
the results. minimum value while the maximum value was 6.24 times in
As the threshold value was reduced, The execution time 10 KLoC. The greater the number of target projects is, the
increased significantly. On the other hand, as the threshold higher the ratio of the execution time between our proposed
value was increased, the execution time was shortened, but approach and CCFinder one is. We consider there are two
the number of clone set pairs whose similarity exceeds the reasons for that we obtained such results. Our proposed
threshold value decreases. Hence, the optimal value is 0.9. approach detects clones in parallel and it detects within-project
Therefore, we utilized 0.9 as the threshold value for subse- clones from each project instead of cross-project clones from
quent evaluations. the whole dataset.
A. RQ1: How fast does our approach detect cross-project The 11.4 times was the minimum value while the maximum
clones? value was 17.1 times in 100 KLoC. Our proposed approach
was able to detect clones from the 128-project dataset of 100
To answer RQ1, we compared the execution time of two
KLoC. On the other hand, CCFinder was not able to finish
clone detections: our proposed approach and CCFinder. We
detecting clones from the same dataset due to out of memory
used the following thresholds in the application of our pro-
error. Detecting clones from vast source code requires huge
posed technique.
memory, which is the reason why CCFinder failed to detect
• θdif f : 0
clones from the dataset. In our approach, CCFinder takes only
• θmax : 0 (Similarity calculation is not omitted.)
a small amount of source code in a single clone detection and
We utilized the dataset, which had constructed from Big- CCFinder is executed many times in parallel. Consequently,
CloneBench dataset described in Section VI. Figure 4 shows in our approach, out of memory error did not occur. Our
the overview of how to construct the dataset. In order to detect approach was able to detect clones from the large dataset
clones with changing the number of the target projects, we where CCFinder was not able to finish detecting clones due
extracted seven sets of the pseudo projects randomly from each to insufficient memory.
of the 10 KLoC group and the 100 KLoC group, respectively. We also evaluated the ratio of clones that were not detected
The number of pseudo projects in a set i, i ∈ [1, 7], is 2i . by our proposed approach. The approach cannot detect clone
The execution time of each clone detection is shown in pairs that exist only between different projects because the
Figure 5. Our approach can detect clones rapidly in the both approach curates clones that are detected from each of the
projects. Hence, we measured the ratio of clones that the
TABLE I proposed approach could detect. We utilized the 64 pseudo
E XECUTION T IME AND C URATED C LONE S ETS WITH D IFFERENT
PARAMETERIZATION projects of the 100 KLoC as the target projects, and we
compared the number of clones that could be detected by the
Threshold 0.85 0.9 0.95 proposed approach and the number of clones that could be
Execution Time 42m 28s 26m 55s 21m 40s
#Curated Clone Set Pairs 80,104 52,785 29,762 detected by CCFinder. The results are shown in Fig.6. It can
be seen that the ratio of clones that can be detected decreases
Copyright © 2019 for this paper by its authors. 23
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
7th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2019)
3 100
10
90
The Ratio of Detected Clones [%]
CCFinder Our Approach
80
70
2 60
10
Execute time [s]
50
40
30
1
20
10
10
0
2 4 8 16 32 64
#Projects
0 Fig. 6. The Ratio of Detected Clones
2 4 8 16 32 64 128
#Projects constructed in RQ1. We did not use the dataset of 128 projects
(a) Execution Time for Projects of 10 KLoC because CCFinder cannot detect clones from the dataset. Thus,
10
5 in the case that we directly detect cross-project clones with
CCFinder Our Approach CCFinder, a clone detection from the whole of the 64 projects
is executed 64 times. On the other hand, in the case that
4
10 we apply the proposed approach, a clone detection from the
updated project and a clone curation for the whole of 64
Execute time [s]
3
projects are executed 64 times. We measured the execution
10
time of the both cases.
The measurement results are shown in Table II. CCFinder
10
2 took 11,207 seconds while our approach took only 1,006
seconds to detect clones on average. Therefore, our approach
was able to detect clones 11 times faster than CCFinder. We
1
10 investigated the ratio of clone detection and clone curation in
our approach and found that 97% of the execution time spent
on clone curation.
0 Our answer to RQ2 is that the proposed approach took only
2 4 8 16 32 64 128
9% execution time of detecting cross-project clones from the
#Projects
whole of 64 pseudo projects with CCFinder.
(b) Execution Time for Projects of 100 KLoC
C. RQ3: How effective is parameters adjustment to improve
Fig. 5. Execution Time for Each Project Group
clone detection performance?
as the number of projects increases. That is because clones In the process of answering RQ1 and RQ2, we found
that exist only between the different projects increase as the that the clone curation step accounts for the majority of the
number of projects increases. As a result, in the case of 64 execution time. This is because our approach calculates the
projects of 100K LoC, the 38 percent of clones were not similarities between all clone sets. To curate clones more
be detected, but the remaining 62 percent of clones can be rapidly, our approach needs to reduce the number of similarity
detected 17 times faster than CCFinder. calculations. We considered that omitting similarity calculation
between some clone sets, which have little effect on the
Our answer to RQ1 is that our proposed approach can detect
curation results, can curate clones more rapidly. We also
clones up to 17 times faster than CCFinder. Our approach can
considered that we could identify such clone set pairs by
detect clones from the large target that cannot detect clones
utilizing RN R, which is described in Section III-C. Thus, to
by CCFinder due to insufficient memory.
seek for such clone set pairs, we investigated the distribution
B. RQ2: How fast does our approach follow update of target of similar clone set pairs in the 75 projects of Apache dataset
projects? described in Section VI. In Figure 7, the color of each cell
In this evaluation, we assumed a situation: the source code TABLE II
of each target project is updated once in different dates; a user E XECUTION T IME OF U PDATING TARGET P ROJECT I NCREMENTALLY
wants to detect cross-project clones as rapid as possible just
CCFinder Our Approach
after any project is updated. Under this assumption, we utilized Average Time 11,207s 1,006s
the 64 pseudo projects of the 100 KLoC, which had been
Copyright © 2019 for this paper by its authors. 24
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
7th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2019)
Percentage of unsimilar clone set pairs [%] any calculations. The while cells indicate that the percentage
0 99.99 100 of similarity calculations is 60%. The green cell shows that
50 the percentages are higher than 60% while the red cell means
that the percentage is lower than 60%. We can see that there
are several value pairs of two parameters where their cells in
Larger RNR of the clone set in a clone set pair [x 1/100]
both Figure 8(a) and Figure 8(b) are white. Namely, there
are parameters which can reduce the number of similarity
calculations to 60 percents with keeping the 90 percents of
curated clone set pairs. Thus, we sought for the optimal
pair of parameters using the heatmaps and we measured the
number of similarity calculations that could be omitted when
the optimized parameters were utilized.
Figure 9 shows the relationship between the ratio of curated
clone set pairs and the ratio of similarity calculation reduction.
According to Figure 7, we can see that there are multiple
pairs of the two thresholds that yield the same ratio of curated
clone set pairs. Herein, we focus on the pair of the two
parameters that reduces similarity calculation at a maximum,
and we investigated the relationship between the two ratios.
As a result, we found that 41% of similarity calculations can
100
be omitted by sacrificing 5% of the curations.
50 Smaller RNR of the clone set in a clone set pair [x 1/100] 100 Our answer to RQ3 is that using appropriate values of
parameters can reduce 41% of similarity calculations only by
Fig. 7. The Percentage of Similar Clone Set Pairs sacrificing 5% of the clone curations.
VIII. T HREATS OF VALIDITY
indicates a percentage of the clone set pairs whose X-axis and
Y-axis clone sets are not similar than 0.9. The while cells are A. Clone detector
the percentage is 99.99%. The red cell shows the percentages
are higher than 99.99% while the green cell means that the In the evaluations, we utilized CCFinder as a clone detector.
percentage is lower than 99.99%. We considered that clone However, there are many clone detectors besides CCFinder.
sets in a clone set pair are not similar when the RN R values Applying other clone detectors in our approach does not
of the clone sets are largely different from each other and the necessarily get more rapid than conventional clone detections
RN R value of either clone set in the pair is close to large. unlike CCFinder. We are planning on evaluating whether or
Therefore, our approach omits the similarity calculations of not other clone detectors work well in our approach.
such clone set pair using two parameters, θdif f and θmax .
In this research question, we seek for appropriate values of B. Experimental target
two parameters, θdif f and θmax to curate clones rapidly. We In the evaluations, we succeeded in reducing similarity
attempted to compare the execution time and the precision calculations by 41% on the Apache projects dataset. However,
of clone curation for each of the parameter combinations. if we use other datasets, we may obtain different ratios of
However, it takes too long time to curate clones for all com- similarity calculations.
binations of parameters. Hence, we measured the number of
similarity calculations instead of the execution time. Namely, IX. C ONCLUSION
by comparing the number of similarity calculations and the
precision of clone curation, we evaluated the effectiveness of In this research, we proposed a technique to detect clones
our approach to omit similarity calculations. rapidly. The proposed technique curates the results of clone
Figure 8(a) is a heatmap which shows the percentage of detections for each of the projects. As a result of evaluating
clone set pairs curated for each parameter. The while cells the performance of our technique, it was possible to detect
indicate that the percentage of curated clone set pairs is clones up to 17 times faster with CCFinder. In addition, the
90%. The green cell shows that the percentages are higher execution time of incremental updates was able to be reduced
than 90% while the red cell means that the percentage is to nine percents of previous ones. Furthermore, we showed
lower than 90%. Figure 8(b) is another heatmap which shows that using appropriate values of parameters can reduce 41%
how much similarity calculations are omitted for each pair of similarity calculations only by sacrificing 5% of the clone
of parameters, θdif f and θmax . Each cell shows that the curations. We plan to improve our technique so that it can
percentage of the similarity calculations with the given two be applied to clone detectors other than CCFinder such as
parameters against the similarity calculations without omitting SourcererCC [9] or NiCAD [17] in the future.
Copyright © 2019 for this paper by its authors. 25
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
7th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2019)
Percentage of curated clone set pairs [%] Percentage of similarity calculations [%]
0 90 100 0 60 100
50 50
𝜃𝑚𝑎𝑥
𝜃𝑚𝑎𝑥
0 0
0 𝜃𝑑𝑖𝑓𝑓 50 0 𝜃𝑑𝑖𝑓𝑓 50
(a) The Percentage of Curated Clone Set Pairs (b) The Number of Similarity Calculations
Fig. 8. Heatmap of Curated Clone Set Pairs and Similarity Calculations for Pairs of Each Parameter
100
Proceedings of the 36th International Conference on Software Engineer-
90 ing, ser. ICSE 2014, 2014, pp. 175–186.
Percentage of similarity calculations [%]
[7] J. Davies, D. M. German, M. W. Godfrey, and A. Hindle, “Software
80 bertillonage: finding the provenance of an entity,” in Proceedings of the
8th working conference on mining software repositories (MSR), 2011,
70 pp. 183–192.
60
[8] C. Ragkhitwetsagul and J. Krinke, “Siamese: scalable and incremental
code clone search via multiple code representations,” Empirical Software
50 Engineering, pp. 1–49, 2019.
[9] H. Sajnani, V. Saini, J. Svajlenko, C. K. Roy, and C. V. Lopes, “Sourcer-
40 ercc: scaling code clone detection to big-code,” in 2016 IEEE/ACM 38th
International Conference on Software Engineering (ICSE), 2016, pp.
30
1157–1168.
20 [10] Y. Higo, T. Kamiya, S. Kusumoto, and K. Inoue, “Method and
implementation for investigating code clones in a software system,”
10 Information and Software Technology, vol. 49, no. 9, pp. 985–998, 2007.
[11] A. Goto, N. Yoshida, M. Ioka, E. Choi, and K. Inoue, “How to extract
0
differences from similar programs? a cohesion metric approach,” in 2013
50 55 60 65 70 75 80 85 90 95 100
7th International Workshop on Software Clones (IWSC), May 2013, pp.
Percentage of curated clone set pairs[%] 23–29.
[12] R. Yue, Z. Gao, N. Meng, Y. Xiong, X. Wang, and J. D. Morgenthaler,
Fig. 9. Transitions of Similarity Calculations and Curated Clone Set Pairs “Automatic clone recommendation for refactoring based on the present
and the past,” in 2018 IEEE International Conference on Software
R EFERENCES Maintenance and Evolution (ICSME), 2018, pp. 115–126.
[13] T. Kamiya, S. Kusumoto, and K. Inoue, “Ccfinder: a multilinguistic
[1] A. Lozano and M. Wermelinger, “Assessing the effect of clones on token-based code clone detection system for large scale source code,”
changeability,” in 2008 IEEE International Conference on Software IEEE Transactions on Software Engineering, vol. 28, no. 7, pp. 654–670,
Maintenance, 2008, pp. 227–236. 2002.
[2] M. Mondal, M. S. Rahman, R. K. Saha, C. K. Roy, J. Krinke, and K. A. [14] G. Myles and C. Collberg, “K-gram based software birthmarks,” in
Schneider, “An empirical study of the impacts of clones in software Proceedings of the 2005 ACM Symposium on Applied Computing, ser.
maintenance,” in 2011 IEEE 19th International Conference on Program SAC ’05, 2005, pp. 314–318.
Comprehension, 2011, pp. 242–245. [15] J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, and M. M. Mia,
[3] F. Rahman, C. Bird, and P. Devanbu, “Clones: What is that smell?” in “Towards a big data curated benchmark of inter-project code clones,”
Conference on Mining Software Repositories (MSR 2010), May 2010, in In Proceedings of the Early Research Achievements track of the
pp. 72–81. 30th International Conference on Software Maintenance and Evolution
[4] T. Zhang and M. Kim, “Automated transplantation and differential (ICSME 2014), 2014, pp. 476–480.
testing for clones,” in Proceedings of the 39th International Conference [16] Y. Higo and S. Kusumoto, “How should we measure functional sameness
on Software Engineering, ser. ICSE ’17, 2017, pp. 665–676. from program source code? an exploratory study on java methods,” in
[5] A. Sheneamer and J. Kalita, “A survey of software clone detection Proceedings of the 22Nd ACM SIGSOFT International Symposium on
techniques,” International Journal of Computer Applications, vol. 137, Foundations of Software Engineering, ser. FSE 2014, 2014, pp. 294–305.
no. 10, pp. 1–21, 2016. [17] J. Cordy and C. Roy, “The nicad clone detector,” IEEE International
[6] K. Chen, P. Liu, and Y. Zhang, “Achieving accuracy and scalability Conference on Program Comprehension, pp. 219–220, June 2011.
simultaneously in detecting application clones on android markets,” in
Copyright © 2019 for this paper by its authors. 26
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).