<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Proceedings of the SQAMIA</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Code Clone Benchmarks Overview</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>TIJANA VISLAVSKI</string-name>
          <email>tijana.vislavski@dmi.uns.ac.rs</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>GORDANA RAKIC´</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>University of Novi Sad</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>7</volume>
      <fpage>27</fpage>
      <lpage>30</lpage>
      <abstract>
        <p>Traditionally, when a new code clone detection tool is developed, few well-known and popular benchmarks are being used to evaluate the results that are achieved. These benchmarks have typically been created by cross-running several state-of-the-art clone detection tools, in order to overcome the bias of using just one tool, and combining their result sets in some fashion. These candidate clones, or more speci cally their subsets, have then been manually examined by clone experts or other participants, who would judge whether a candidate is a true clone or not. Many authors dealt with the problem of creating most objective benchmarks, how the candidate sets should be created, who should judge them, whether the judgment of these participants can be trusted or not. One of the main pitfalls, as with development of a clone detection tool, is the inherent lack of formal de nitions and standards when it comes to clones and their classi cation. Recently, some new approaches were presented which do not depend on any clone tool, but utilize search heuristics in order to nd speci c functionalities, but these candidates are also manually examined by judges to classify them as true or false clones. This paper has a goal of examining state-of-the-art code clone benchmarks, as well as studies regarding clone judges reliability (and subsequently reliability of the benchmarks themselves) and their possible usage in a cross-language clone detection context.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>20:2
clones among all the reported clone candidates:
precision =</p>
      <p>reported real clones
all reported clone candidates
;
while recall represents a ratio of reported real clones among all clones in the analyzed source code:
reported real clones
recall =</p>
      <p>all clones
Clone benchmark creation usually has following workflow:
:
(1) Source code is analyzed by one or more code detection tools and candidate clones are detected,
(2) Candidate clones (or a fraction of them) are evaluated by one or more human judges who decide
whether they are true clones or not.</p>
      <p>Many methods have been applied to try to overcome the bias of subjectivity of judges and tools used
for creation of a benchmark, but the objectivity of such benchmarks still remains an open topic. This
paper aims to provide an overview of current state-of-the-art code clone benchmarks and papers that
investigate validity and objectivity of these benchmarks and their described creation process, among
them works presented in [Charpentier et al. 2017] [Kapser et al. 2007] [Walenstein et al. 2003]. Other
processes of creating clone benchmarks have also been proposed such as artificial mutation of arbitrary
code fragments [Svajlenko and Roy 2014], and these will also be described in subsequent chapters.</p>
      <p>Final goal of this overview is to investigate to which extent, if any, could current benchmarks be
used (with some adjustments) for evaluating results of clone detection in a cross-language context.
LICCA [Vislavski et al. 2018] is a language independent tool for code clone detection with main focus
on detecting clones in such context. Even though finding similar code fragments between languages is
a challenging task, it is very important to have means to evaluate even a slightest progress.</p>
      <p>Rest of the paper is organized as follows: In section 2 an industry standard benchmark is described,
followed by a critique of this benchmark. Section 3 introduces other proposed benchmarks, including
some new and different approaches. Section 4 gives an overview of works that deal with reliability of
judges included in benchmark creation. Finally, section 5 concludes this paper with authors’ findings.</p>
    </sec>
    <sec id="sec-2">
      <title>2. BELLON’S CLONE BENCHMARK</title>
      <p>
        Work presented in [Bellon et al. 2007] by Bellon et al. was first benchmark for code clones and has
since become an industry standard for evaluating code clone detection tools [Charpentier et al. 2015].
In this paper, authors presented results of the experiment in which they evaluated six clone detection
tools
        <xref ref-type="bibr" rid="ref1 ref10 ref12 ref2 ref6 ref7 ref9">(Dup [Baker 2007], CloneDR [Baxter et al. 1998], CCFinder[Kamiya et al. 2002], Duplix [Krinke
2001], CLAN [Merlo 2007], Duploc [Ducasse et al. 1999])</xref>
        using eight C and Java projects. These tools
use a variety of approaches, including metric comparison [Merlo 2007], comparison of abstract syntax
trees [Baxter et al. 1998], comparison of program dependency graphs [Krinke 2001] etc. Although tools
support detection of clones in different languages, none of them supports detection between different
languages.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2.1 Benchmark’s creation process</title>
      <p>During the experiment that produced Bellon’s benchmark, authors of clone detection tools that
participated in this experiment, applied their tools to followwing C and Java projects and reported clone
candidates:
(1) weltab1
1No link for this project was found</p>
      <p>Candidate clones were then evaluated by first author of the paper, Stefan Bellon, as an independent
third-party. LOC (Lines of code) for these projects varied from 11K to 235K. Each project was analyzed
with tool’s default settings (mandatory part), and with tuned settings (voluntary part). Only clones of
six lines or longer were reported.</p>
      <p>The reference corpus of ’real’ clones was created manually by Stefan Bellon. He looked at 2% of all
submitted candidates (approx. 6K) and chose real clones among them (sometimes he modified them
slightly). He did not know which candidates were found by which tool and 2% of candidates was
distributed equally among the six tools. Additionaly, some clones were injected back in each of the
analyzed programs in order to get a better information about the recall.</p>
      <p>When evaluating whether a candidate matches a reference, authors did not insist on complete
overlapping of candidate and reference clone, but rather on sufficiently large overlapping. In order to
precisely and consistently measure this overlap, authors introduced some definitions and metrics. For two
code fragments CF1 and CF2 they define:
(1) overlap(CF1; CF2) = jjlliinneess((CCFF11))\[lliinneess((CCFF22))jj ,
(2) contained(CF1; CF2) = jlines(CF1)\lines(CF2)j ,</p>
      <p>jlines(CF1)j</p>
      <p>Clone pair CP is defined as a tuple CP = (CF 1; CF 2; t), i.e. a tuple containing two code fragments
and a clone type. For two clone pairs CP1 and CP2 authors define:
(1) good-value: good(CP1; CP2) = min(overlap(CP1:CF1; CP2:CF1); overlap(CP1:CF2; CP2:CF2)),
(2) ok-value: ok(CP1; CP2) = min(max(contained(CP1:CF1; CP2:CF1); contained(CP2:CF1; CP1:CF1));
max(contained(CP1:CF2; CP2:CF2); contained(CP2:CF2; CP1:CF2)))</p>
      <p>Obviously, ok-value is less strict than good-value (ok-value requires for one clone pair to be contained
in other for at least p 100% which can lead to one clone pair being much longer than the other, while in
good-value this cannot happen since it looks at the minimum of intersections in clone pair fragments).
For both metrics, a threshold of 0.7 was used. Since the minimum number of lines in a clone was set
to 6, this value allows for two six-line-long fragments to be shifted by one line. The same threshold
value was chosen for both metrics for uniformity: both are measures of overlap, the difference is in the
perspective (good-value from the perspective of both fragments, and ok-value from the perspective of
the smaller fragment).
2http://miller.emu.id.au/pmiller/software/cook/ (Accessed: October, 2017)
3http://www.ra.cs.uni-tuebingen.de/SNNS/ (Accessed: October, 2017)
4https://www.postgresql.org/ (Accessed: October, 2017)
5https://netbeans.org/ (Accessed: October, 2017)
6http://www.eclipse.org/eclipse/ant/ (Accessed: October, 2017)
7https://www.eclipse.org/jdt/core/ (Accessed: October, 2017)
8http://www.oracle.com/technetwork/java/index.html (Accessed: October, 2017)</p>
      <p>From approximately 6 thousand investigated clone candidates, Bellon accepted around 4 thousand
(66%) as true clones. However, this is a percentage across all analyzed programs. For individual
programs, percentage of accepted true clones varied from 35% to 90%. While analyzing specific tools, it
has been found that tools which have a higher number of reported candidates, have a higher recall and
also a higher number of rejected candidates. Additionally, number of CLAN’s candidates that hit the
reference corpus was the highest, although it reported the smallest number of candidates. Although
authors report some values for recall and precision for all tools, they have to be taken with caution
since only a portion of all candidates was examined. Nevertheless, since authors compared these
numbers with results obtained after examination of first 1% of candidates, they concluded that numbers
are pretty much stable.</p>
      <p>Authors also found that different tools yield different sizes of clone candidates, which can be
explained by the fact they use different techniques. Only 24%-46% of injected clones was found.</p>
      <p>Overall conclusions of the experiment were:
(1) Text and token based tools have higher recall, while metric and tree based tools have higher
precision.
(2) The PDG (Program Dependency Graph) tool doesn’t perform well (except for type-3 clones).
(3) There were a large number of false positives. Type-1 and type-2 clones can be found reliably, while
type-3 clones are more problematic.
(4) Although Stefan Bellon was an independent party, the judgment was still biased by his opinion
and experience.</p>
    </sec>
    <sec id="sec-4">
      <title>2.2 Critique of Bellon’s benchmark</title>
      <p>Several years later, authors of [Charpentier et al. 2015] empirically examined Bellon’s benchmark
validity. Their main argument was that the benchmark has the probability of being biased, since it
was created by Bellon alone, who was not an expert for the systems being analyzed.</p>
      <p>They seeked an opinion of 18 participants on a subset of reference clones determined by Bellon.
Conclusion is that some clones are debatable, as well as that the fact whether a candidate is considered
a clone or not is subjective: it largely depends on perspective and intent of a user, since the definition
of the clones itself is open to interpretation (similar conclusion was come up with by Yang et al. in their
paper [Yang et al. 2015]).</p>
      <p>In more detail, experimental setup of this empirical assessment consisted of nine groups of two
participants, where each group analyzed randomly chosen subset of 120 reference clones defined by
Bellon (total of 1080 reference clones). All participants were students of undergraduate (last year)
and graduate studies with IT background (both for Java and C). Participants judged whether the
references were true clones or not in order to analyze whether there are some reference clones for
which participants had different opinion than Bellon.</p>
      <p>After the collections of participants‘ answers, there were in total 3 answers for each examined
reference clone (one positive from Bellon and two given by the students). Each clone was assigned a trust
level based on these answers:
(1) good for clones with 3 positive answers
(2) fair for clones with 2 positive and 1 negative answer,
(3) small for clones with 1 positive and 2 negative answers.</p>
      <p>Using this data, authors calculated, with 95% confidence intervals, proportions of clones in each of
the trust levels. According to this calculation, about half of clones have trust level less than good, and
about 10-15% of clones have only trust level small, which are not negligible percentages.</p>
      <p>Next, authors examined how do trust levels affect recall and precision values that are calculated by
the benchmark. Two scenarios were considered: one that took into account only clones with trust level
good as true clones, and other that took into account clones with both trust levels good and fair. Then
precision and recall were calculated by the Bellon’s formula with new reference set. Conclusion was
that both recall and precision decrease significantly (up to 0.49 for the first scenario, and up to 0.12 for
the second scenario).</p>
      <p>In the end, authors investigated whether clones with trust level good have some common
characteristics that separate them from other clones. 3 features were taken into account: type, size and language.
They statistically tested hypotheses about correlation between these features and clone‘s trust level.
For size, only a loose correlation has been calculated in favor of bigger clones. For clone type, a
moderate negative correlation has been calculated, which means that only type 1 clones can be considered
as having good trust level after one opinion, while other types need more opinions. This makes sense
when we consider the definition of each type, where type 1 clones are identical code fragments, while
higher types are more loosely and less strictly defined. Regarding programming languages’ impact on
trust level, negligible correlation has been calculated.</p>
      <p>The main threats to validity that authors report are mainly the fact that students were used for the
assessment (although they all had an IT background, none of them was an expert for the analyzed
systems, which is a thing that authors themselves reported as a Bellon’s disadvantage, and only half
of them knew about the concept of a code clone before the experiment), as well as the fact that only
a small portion of the Bellon’s benchmark was assessed. Furthermore, authors only considered clones
that were reported as true clones by Bellon. A large portion of clone candidates investigated by Bellon
were graded as false clones. In order to better investigate the effect of different opinions on calculated
precision and recall, these false clones should also be graded by multiple people.</p>
    </sec>
    <sec id="sec-5">
      <title>3. OTHER BENCHMARKS</title>
      <p>After Bellon et al., in the following years there were more efforts to create an unbiased and objective
code clone benchmark.</p>
    </sec>
    <sec id="sec-6">
      <title>3.1 Intra-project, method-level clones</title>
      <p>Krutz et al. [Krutz and Le 2014] constructed a set of intra-project, method-level clones from 3
opensource projects: PostgreSQL, Apache9 and Python10, which are all C-based projects. Authors randomly
sampled 3-6 classes from each project, created all possible function pairs from these classes, evaluated
them by several clone detection tools and then given them to 7 raters for manual examination (4 clone
experts and 3 students). Experts examined clone pairs in collaboration and clones were confirmed
only after consensus was achieved, while students examined clone pairs independently. Idea behind
this setup was to improve the confidence of data, and also to check the difference from students’ and
experts’ answers. In total, 66 clone pairs were found by the expert group, none of which were type-1
clones, 43 were type-2, 14 were type-3 and 9 were type-4 clones. Authors reported only clones found by
the expert group, since they concluded that student answers differ largely with one another, and while
there is a certain number of clones that they agree on, overlap with clones reported by experts is not a
significant one.
9https://httpd.apache.org/ (Accessed: October, 2017)
10https://www.python.org/ (Accessed: October, 2017)
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>Mutation and Injection Framework</title>
      <p>In 2014 Svajlenko and Roy [Svajlenko and Roy 2014] evaluated the recall of 11 clone detection tools
using the Bellon’s [Bellon et al. 2007] benchmark and their own Mutation and Injection Framework.
Their main goal was to evaluate the recall, since it was an inherently more difficult metric for
calculation than precision. Their proposed framework uses a corpus of artificially created clones based on
a taxonomy suggested in the previous study [Roy et al. 2009]. The framework extracts random code
fragments from the system and makes several copies which are then changed by mutation operators
following the taxonomy. There were 15 operators in total, which create first 3 clone types. Original and
mutated fragments are inserted back in the system and the process is repeated thousands of times.
This creates a large corpus of mutant systems. When a clone detection tool is tested with the
framework, its recall is measured specifically for the injected clones.</p>
      <p>Authors compared their calculated recall values for several tools with the ones reported by
Bellon and saw that values rarely agree (with threshold for agreement of 15% difference), with their
framework generally giving higher recall values. The validity of their measurement over Bellon’s was
backed up by previous expectations. Before running the experiment, authors made well thought-out
assessments of the recall for the analyzed tools based on documentation, publications and literature
discussions. The Mutation Framework agreed with these expectations in 90% cases for Java and 74%
cases for C, while Bellon’s benchmark agreed with expectations in only around 30% of cases. Authors’
conclusion was therefore that Bellon’s benchmark, while still a very popular benchmark for evaluating
clone detection tools, may not be accurate for modern tools, and that there is a need for an updated
corpus.</p>
    </sec>
    <sec id="sec-8">
      <title>3.3 BigCloneBench</title>
      <p>In [Svajlenko et al. 2014] authors state that the common approach for creating a benchmark, which is
by using clone detection tools to find clone candidates and manually evaluating them, gives an unfair
advantage to the tools that participated in creating the benchmark. As they have proven in the paper
described earlier, The Mutation Framework, which is a synthetic way to create clones and inject them
back in the system, is a better approach for modern clone detection tools. However, there is still a need
for a benchmark of real clones. The contribution of the paper is one such benchmark, one that authors
call BigCloneBench and which was built by mining the IJaDataset 2.0 11.</p>
      <p>IJaDataset 2.0 is a big inter-project repository consisting of 25,000 project and more than 365MLOC.
Construction of the benchmark didn’t include any clone detection tools, which makes possible for the
benchmark not to be biased, but rather used a search heuristics to identify code snippets that might
implement a target functionality (such as Bubble Sort, Copy a File, SQL Update and Rollback etc).
These candidate snippets were than manually judged as true or false clones, and (if true) labeled
by their clone type. The benchmark includes 10 functionalities with 6 million true clone pairs and
260 thousand false clone pairs. Since the clones were mined by functionality, this benchmark is also
appropriate for detection of semantic clones, which was the first benchmark at that time for semantic
clones. One more contribution of this work was the scale of the benchmark. Previous benchmarks were
all significantly smaller in size, while the BigCloneBench is appropriate for big data clone detection
and scalabillity evaluation.</p>
    </sec>
    <sec id="sec-9">
      <title>3.4 Benchmark from Google Code Jam submissions</title>
      <p>Another benchmark for functionally similar clones, as opposed to copy&amp;paste ones, was recently
suggested in [Wagner et al. 2016]. Authors intentionally use this term, instead of well-known Type 4
11https://sites.google.com/site/asegsecold/projects/seclone (Accessed: October, 2017)
clones, because they are interested in similar functionalities, not only identical ones. Since the
traditional clone detection tools rely on syntactical similarity, they are hardly able to detect these clones.
This was evaluated and confirmed as a part of the study and a proposed benchmark is a useful help
for further research of this type of clones. The benchmark was constructed from a set of coding contest
submissions (Google Code Jam12) in two languages - Java and C. Since the submissions are aiming
for the solution of same problems, it is expected that they would contain a large number of
functionally similar code snippets. The benchmark consists of 58 functionally similar clone pairs. Once again,
only clones between same language (either Java or C) were reported, without consideration of
interlanguage clones.</p>
    </sec>
    <sec id="sec-10">
      <title>4. RATERS RELIABILITY</title>
      <p>When constructing a clone benchmark, authors traditionally involve clone oracles, people (code clone
experts or others) who judge whether clone candidates are true clones or not [Bellon et al. 2007]
[Charpentier et al. 2015] [Yang et al. 2015] [Krutz and Le 2014] [Svajlenko et al. 2014]. However, as already
mentioned in previous sections, code clone definitions lack precision, they are open to interpretation,
it is highly subjective whether a clone candidate is a clone to an individual or not, it depends on a
person’s intent, on their expertise about the system under examination etc. Many authors investigated
this problem and tried to gain some insight about raters’ agreement in classifying clones, as well as
reliability of these judges. [Charpentier et al. 2017] [Kapser et al. 2007] [Walenstein et al. 2003]</p>
      <p>One of the first of these works was done in [Walenstein et al. 2003]. Authors themselves acted as
judges and classified a selection of function clone candidates (clone candidates that span across whole
function bodies) as true or false clones. Their main hypotheses was that, as an educated and informed
researchers in the domain of code clone detection, their decisions would be much or less the same.
However, they found out that their answers highly differ. As an example, authors report a case where
for a set of 317 clone candidates, 3 judges had the same opinion only in 5 cases. Obviously, consensus
what is or is not a clone is not automatic, even among researchers that are clone experts. This is,
among others, a consequence of vagueness of the definitions for code clones. After this initial results,
authors defined classification criteria more precisely and agreed upon them. Then they repeated the
process. Although level of agreement increased, it was still considerably low for some systems (authors
report level of agreement of .489 for one system). After some in-depth analysis, they realized that their
perspective on some design criteria differs (for example, whether constructors should be refactored
or not). In the next iteration they introduced one more judge, unfamiliar with all consensuses and
criteria agreed upon until then (hours of debating and discussion), to test his responses. Surprisingly,
level of agreement improved, which was an indication that maybe choice of individual judges is more
important than the group experience. In the final run, authors decided to classify clones specifically
based on information from the source code itself, and disregard all context-based knowledge. This, as
expected, increased the level of agreement. In the end, authors conclude that inter-rater reliability is
potentially a serious issue, and one judge is definitely not enough. Furthermore, their experiment does
not support the idea that practice is any more important than set of well-defined guidelines. Finally,
specifics of each system can have a great impact on judge reliability.</p>
      <p>
        Another similar experiment was performed at an international workshop
        <xref ref-type="bibr" rid="ref3 ref9">(DRASIS [Koschke et al.
2007])</xref>
        and the results are presented in [Kapser et al. 2007]. After the thorough discussion about
fundamental theory behind code clones and lack of formal definitions, a question has arised: will researchers
in this area ever agree between themselves what is a code clone and what is not? Such a question has
serious consequences to the validity and reproducibility of previous results across literature. During
12https://code.google.com/codejam/ (Accessed: October, 2017)
20:8
a working session of the mentioned workshop, a 20 clone candidates reported by clone detection tool
CCFinder [Kamiya et al. 2002] for PostgreSQL system were chosen by session leader and 8 clone
experts privately judged them. Only 50% of candidates were classified with the agreement level of at
least 80% (7 or more persons). This seriously puts in question reproducibility of previous studies. After
the judgment, results were discussed. It was concluded that attendants interpreted differently same
attributes of given candidates, their significance and meaning. For example: size of candidate, level of
change in types, possibility to refactor the code etc. This list lead the attendants to the following
questions: Was the formulation of the question a problem? Should the question be changed to something
more appropriate, using terms such as duplicated code, redundant code, similar code? Would it affect
the results of evaluation? One additional point was that maybe forcing the binary answer, a simple
yes or no, is increasing the disagreement. Maybe if a multiple-point scale was used, answers would be
more agreeable? The session brought many still unanswered questions. But one of the most important
conclusions was made: there is little chance that a reviewer can be sure of the true meaning of the
results without clear documentation about criteria that were used.
      </p>
      <p>One quite recent work [Charpentier et al. 2017] dealt with the question of non-expert reliability of
judgment whether a clone candidate is a true or false clone. The hypothesis is that one clone candidate
could be differently rated in different projects and contexts. The experiment was performed on a set
of 600 clones from 2 Java projects, with both experts and non-experts used for judgment (authors
define term expert as a person with expertise in a particular project, all of the raters were familiar with
the notion of clones). Formally speaking, 3 research questions were asked: are the answers consistent
over time, do external raters agree with each other and with experts of a project, what are the
characteristics that influence the agreement between experts and external raters. Each randomly selected
set of clones was presented to 4 raters, including one expert for the underlying system. They were
asked to rate the clone pairs with yes/no/unknown answers and used statistical analysis to answer
the research questions. Based on the collected data, authors derived following conclusions based on
the presented questions: regarding first question (consistency of raters answers), authors noticed that
experts’ answers are very consistent, while inconsistencies in non-expert judgments vary from 5% to
20% of the rated duplicates. For the second question (inter-rater reliability), authors conclude that for
both investigated projects, there is no agreement between external raters, while the strongest
agreement is achieved when comparing majority of non-expert answers with the expert ones, and not when
comparing any particular non-expert with the expert. Final question (factors influencing agreement
between external raters and experts) lead the authors to the conclusion that some characteristics have
an impact on the agreement between the majority and the expert, such as size or project (latter one
suggesting that some projects are easier for non-experts to judge).</p>
    </sec>
    <sec id="sec-11">
      <title>5. CONCLUSION</title>
      <p>Although code clone detection is a well-researched area with years of work dedicated to it and few
extraordinary, production-quality tools, it is still not mature enough in terms of formal definitions and
standards. One of the main consequences of this is the lack of universal and reliable benchmarks.
Many attempts exist, and are still being worked on in this area, but the lack of definitions is causing
much confusion and disagreement even between code clone experts.</p>
      <p>Benchmarks are partially created manually, involving human judges which have the final word
whether a clone candidate is a true or a false clone. However, as several experiments showed, some
of which were presented in this paper, there is a high level of disagreement between judges,
nonexperts and experts alike. This seriously questions validity of current benchmarks. Furthermore, it
has been shown that different clones are considered interesting in different contexts and with different
objectives. This is also a problem in achieving some universal benchmark.</p>
      <p>It is clear that code clone benchmarking still requires a lot of effort, primarily overcoming the
problematics of manual judgment and bias. However, current benchmarks can provide some valuable
information when comparing an arbitrary tool’s results with some state-of-the-art tools, which may lack
a formal assessment but have proven themselves in practise.</p>
      <p>Regarding a possibility of using any of these benchmarks in cross-language context, we can conclude
that only those which contain semantic, or functionally similar clones can be considered. Other
benchmarks concentrate on copy&amp;paste clones which are not to be found in fragments written in different
languages. Since BigCloneBench contains Java projects exclusively, the benchmark which could
provide a starting point for evaluating cross-language clones is the one presented in [Wagner et al. 2016].
Although it contains examples in two syntactically similar languages (Java and C), differences in their
respective paradigms mean significant differences in code organization and structure.
Jiachen Yang, Keisuke Hotta, Yoshiki Higo, Hiroshi Igaki, and Shinji Kusumoto. 2015. Classification model for code clones
based on machine learning. Empirical Software Engineering 20, 4 (2015), 1095–1125.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Brenda S Baker</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Finding clones with dup: Analysis of an experiment</article-title>
          .
          <source>IEEE Transactions on Software Engineering</source>
          <volume>33</volume>
          ,
          <issue>9</issue>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Ira D Baxter</surname>
            , Andrew Yahin, Leonardo Moura, Marcelo Sant'Anna,
            <given-names>and Lorraine</given-names>
          </string-name>
          <string-name>
            <surname>Bier</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>Clone detection using abstract syntax trees</article-title>
          .
          <source>In Software Maintenance</source>
          ,
          <year>1998</year>
          . Proceedings., International Conference on. IEEE,
          <fpage>368</fpage>
          -
          <lpage>377</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Bellon</surname>
          </string-name>
          , Rainer Koschke, Giulio Antoniol, Jens Krinke, and
          <string-name>
            <given-names>Ettore</given-names>
            <surname>Merlo</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Comparison and evaluation of clone detection tools</article-title>
          .
          <source>IEEE Transactions on software engineering 33</source>
          ,
          <issue>9</issue>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Alan</given-names>
            <surname>Charpentier</surname>
          </string-name>
          , Jean-Re´my Falleri, David Lo, and Laurent Re´veille`re.
          <year>2015</year>
          .
          <article-title>An empirical assessment of Bellon's clone benchmark</article-title>
          .
          <source>In Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering. ACM</source>
          ,
          <volume>20</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Alan</given-names>
            <surname>Charpentier</surname>
          </string-name>
          , Jean-Re´my Falleri, Flore´al Morandat,
          <article-title>Elyas Ben Hadj Yahia</article-title>
          , and Laurent Re´veille`re.
          <year>2017</year>
          .
          <article-title>Raters reliability in clone benchmarks construction</article-title>
          .
          <source>Empirical Software Engineering</source>
          <volume>22</volume>
          ,
          <issue>1</issue>
          (
          <year>2017</year>
          ),
          <fpage>235</fpage>
          -
          <lpage>258</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>Ste´phane Ducasse, Matthias Rieger</article-title>
          , and
          <string-name>
            <given-names>Serge</given-names>
            <surname>Demeyer</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>A language independent approach for detecting duplicated code</article-title>
          .
          <source>In Software Maintenance</source>
          ,
          <year>1999</year>
          .
          <article-title>(ICSM'99) Proceedings</article-title>
          . IEEE International Conference on. IEEE,
          <fpage>109</fpage>
          -
          <lpage>118</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Toshihiro</given-names>
            <surname>Kamiya</surname>
          </string-name>
          , Shinji Kusumoto, and
          <string-name>
            <given-names>Katsuro</given-names>
            <surname>Inoue</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>CCFinder: a multilinguistic token-based code clone detection system for large scale source code</article-title>
          .
          <source>IEEE Transactions on Software Engineering</source>
          <volume>28</volume>
          ,
          <issue>7</issue>
          (
          <year>2002</year>
          ),
          <fpage>654</fpage>
          -
          <lpage>670</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Cory</given-names>
            <surname>Kapser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Paul</given-names>
            <surname>Anderson</surname>
          </string-name>
          , Michael Godfrey, Rainer Koschke, Matthias Rieger, Filip Van Rysselberghe,
          <string-name>
            <given-names>and Peter</given-names>
            <surname>Weißgerber</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Subjectivity in clone judgment: Can we ever agree?</article-title>
          .
          <source>In Dagstuhl Seminar Proceedings. Schloss Dagstuhl-LeibnizZentrum fu¨ r Informatik.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Rainer</given-names>
            <surname>Koschke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Walenstein</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Ettore</given-names>
            <surname>Merlo</surname>
          </string-name>
          .
          <year>2007</year>
          . 06301
          <string-name>
            <given-names>Abstracts</given-names>
            <surname>Collection-Duplication</surname>
          </string-name>
          ,
          <article-title>Redundancy, and Similarity in Software</article-title>
          .
          <source>In Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum fu¨ r Informatik.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Jens</given-names>
            <surname>Krinke</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Identifying similar code with program dependence graphs</article-title>
          .
          <source>In Reverse Engineering</source>
          ,
          <year>2001</year>
          . Proceedings. Eighth Working Conference on. IEEE,
          <fpage>301</fpage>
          -
          <lpage>309</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Daniel E</given-names>
            <surname>Krutz</surname>
          </string-name>
          and
          <string-name>
            <given-names>Wei</given-names>
            <surname>Le</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>A code clone oracle</article-title>
          .
          <source>In Proceedings of the 11th Working Conference on Mining Software Repositories. ACM</source>
          ,
          <volume>388</volume>
          -
          <fpage>391</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Ettore</given-names>
            <surname>Merlo</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Detection of plagiarism in university projects using metrics-based spectral similarity</article-title>
          .
          <source>In Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum fu¨ r Informatik.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Chanchal</surname>
            <given-names>K Roy</given-names>
          </string-name>
          ,
          <string-name>
            <surname>James R Cordy</surname>
            , and
            <given-names>Rainer</given-names>
          </string-name>
          <string-name>
            <surname>Koschke</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Comparison and evaluation of code clone detection techniques and tools: A qualitative approach</article-title>
          .
          <source>Science of computer programming 74</source>
          ,
          <issue>7</issue>
          (
          <year>2009</year>
          ),
          <fpage>470</fpage>
          -
          <lpage>495</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Svajlenko</surname>
          </string-name>
          , Judith F Islam,
          <string-name>
            <surname>Iman Keivanloo</surname>
          </string-name>
          ,
          <article-title>Chanchal K Roy,</article-title>
          and Mohammad Mamun Mia.
          <year>2014</year>
          .
          <article-title>Towards a big data curated benchmark of inter-project code clones</article-title>
          .
          <source>In Software Maintenance and Evolution (ICSME)</source>
          ,
          <source>2014 IEEE International Conference on. IEEE</source>
          ,
          <fpage>476</fpage>
          -
          <lpage>480</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Svajlenko and Chanchal K Roy</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Evaluating modern clone detection tools</article-title>
          .
          <source>In Software Maintenance and Evolution (ICSME)</source>
          ,
          <source>2014 IEEE International Conference on. IEEE</source>
          ,
          <fpage>321</fpage>
          -
          <lpage>330</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Tijana</given-names>
            <surname>Vislavski</surname>
          </string-name>
          , Gordana Rakic, Nicola´ s Cardozo, and
          <string-name>
            <given-names>Zoran</given-names>
            <surname>Budimac</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>LICCA: A tool for cross-language clone detection</article-title>
          .
          <source>In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering</source>
          (SANER). IEEE,
          <fpage>512</fpage>
          -
          <lpage>516</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Wagner</surname>
          </string-name>
          , Asim Abdulkhaleq, Ivan Bogicevic,
          <string-name>
            <surname>Jan-Peter Ostberg</surname>
            , and
            <given-names>Jasmin</given-names>
          </string-name>
          <string-name>
            <surname>Ramadani</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>How are functionally similar code clones syntactically different? An empirical study and a benchmark</article-title>
          .
          <source>PeerJ Computer Science</source>
          <volume>2</volume>
          (
          <year>2016</year>
          ),
          <year>e49</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Walenstein</surname>
          </string-name>
          , Nitin Jyoti,
          <string-name>
            <given-names>Junwei</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Yun</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Arun</given-names>
            <surname>Lakhotia</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Problems Creating Task-relevant Clone Detection Reference Data.</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>WCRE</given-names>
          </string-name>
          , Vol.
          <volume>3</volume>
          . 285.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>