<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Structural Approach to Program Similarity Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Miller Trujillo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Silvia Takahashi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicolás Cardozo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Systems and Computing Engineering Department, Universidad de los Andes - Bogotá</institution>
          ,
          <country country="CO">Colombia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Analyzing program similarity is useful in automated assessment grading, plagiarism detection, or proving refactor equivalence. The precondition of existing approaches to program similarity is that the programs to compare are expected to be similar, as for example in the case of program evolution. We note that existing similarity approaches are not appropriate for analyzing programs that may use diferent implementations to solve a problem. Moreover, existing approaches still are not able to measure the actual diferences between program implementations. We propose an improvement to existing similarity techniques using the control flow graph representation of programs. In our approach, we exploit the input graphs' structure and avoid costly subgraph isomorphism comparisons, to reach a metric to measure similarity between programs. Furthermore, we obtain a better similarity detection when comparing similar and diferent programs, with respect to the state-of-the-art. To validate our approach, we use a new corpus of competitive programming problems to discover similarity between solutions submitted by contestants. Our results show that our approach correctly detects similarity between diferent program implementations with a precision improvement of up to 77% in some cases, and marks as diferent those implementations to unrelated programs, with a performance comparable to existing approaches.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Algorithmic diversity</kwd>
        <kwd>Program similarity</kwd>
        <kwd>Code clones</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In software engineering, program similarity refers to how closely related is the behavior and
structure of two programs. Analyzing similarity between programs is useful in application domains like
automated assessment grading in programming courses, plagiarism detection, and refactor
equivalence. Current approaches to analyze similarity between programs focus on semantic similarity (i.e.,
behavior), comparing programs that are known, or expected, to be similar in advance (e.g., diferent
versions of a program). That is, when analyzing the similarity between diferent versions of a program,
most of the structure of the program is expected to remain the same, with only a few lines of code
changing. For example, automated grading systems expect students’ programs to be similar to one of
the model programs provided by course instructors. This, however, rises questions about similarity
between programs describing the same behavioral purpose but using completely diferent algorithmic
ideas and structures. This question is of particular interest for algorithmic diversity in evolutionary
environments [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. When comparing programs that are actually similar, we answer questions like: Is
program A similar to program B? In some cases, this question is more restrictive and only concerned
with checking semantic equivalence between two programs, and not similarity.
      </p>
      <p>In this paper, we study program similarity between any two programs. We are interested in programs
that correspond to the same functional behavior –that is, programs that respond the same output to the
same input, but that are built with a diferent structure. As programs with the same response to a same
input can still difer, similarity requires a deeper perspective: evaluating programs’ structure. Take
for example the case of search algorithms, is QuickSort most similar to MergeSort or HeapSort? Even
though the three sorting algorithms have the same output for the same input, their time complexity and
abstract idea difer. The structure of the programs is also diferent. Therefore, we propose an algorithm
that takes into account programs’ structure to assess similarity. We define two programs to be similar if
they define similar control structures, have a similar usage of variables and data structures, and follow
the same order of the program statements. It is not the same to sort an array at the very beginning
of a program, or to sort it at the end of the program. Program pre-conditions exist if we sort at the
beginning; we could write completely diferent programs from that point on.</p>
      <p>
        The motivation for measuring program similarity comes from algorithmic diversity [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which is
concerned with improving systems’ performance by utilizing multiple diferent algorithm
implementations for a program feature at the same time. Example domains in which algorithmic diversity can
present improvements are: security, N-version programming, collective systems like smart camera
networks, or load balancers. While there are areas in which using diferent algorithmic solutions for
a given problem can be beneficial, there is still a lack of measuring tools to dictate how diferent the
actual implementations are. This is precisely one of the major drawbacks for techniques like N-version
programming, and a contribution of our work.
      </p>
      <p>
        Our proposal (Section 2) tackles the problem of program similarity comparing the structure of
programs, within a same domain, that are meant to be diferent beforehand [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Specifically, we use the
Control Flow Graph (CFG) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] program representation and add information about the graph structure as
input for our comparison [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Furthermore, we avoid complicated subgraph isomorphism detection [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
by normalizing the similarity value of the CFG nodes to the total number of nodes in the graph for each
program. To validate our approach (Section 3), we created a corpus of 566 C++ programs extracted from
the competitive programming platform Codeforces. Our corpus consists of multiple implementations of
ifve diferent programming problems. Our results (Section 4) show a clear clustering of the diferent
solutions to each of the problems, demonstrating the efectiveness of our approach in detecting similar
and disimilar implementations across diferent multi-function programs. The results present a similar
performance with respect to existing approaches, showing a precision improvement of up to 77%.
Additionally, our approach is efective in detecting disimilar implementations as such, providing a
balanced metric to program similarity.
      </p>
      <p>Concretely, the main contributions of our paper are:
• Definition of a new similarity metric and algorithm between diferent programs.
• Creation of a corpus to evaluate similarity between any two programs, rather than comparing
program versions diferentiating in small deltas.
• Evaluation of the proposed metric with respect to the state-of-the-art in node base similarity
analysis.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Similarity Among Programs</title>
      <sec id="sec-2-1">
        <title>2.1. Motivating Example</title>
        <p>To motivate the concept of similarity between programs, and to assess the perspectives of functional
behavior and structural similarity, we consider as example single function algorithms in three concrete
cases: linear search (Figure 1 left), max (Figure 2 left), and binary search (Figure 3 left).
int main() {
int a[] = {1, 2, 3, 4, 5};
int n = 5, k = 4;
int ans = -1;
for (int i = 0; i &lt; n; ++i)</p>
        <p>if (a[i] == k) { ans = i; }
return 0;
}
range(n)
not(ans==-1 or a[i]&gt;=a[ans])</p>
        <p>All programs receive a number array as parameter and iterate over it to compute the corresponding
operation. The main functionality of the programs is characterized by a loop with an internal
conditional statement, as seen in the code snippets. To assure the similarity analysis is agnostic to specific
syntax elements, the loops and conditionals use diferent syntax ( e.g., for vs. while loops). The CFGs
corresponding to each of the algorithms are on the right-hand side of Figures 1 to 3.</p>
        <p>These examples raise the question of which of these algorithms are more cloesely related to the
others (if any)? Linear search should be more similar to binary search, as these two algorithms belong
to the same domain, and their end result is the same to the same given input. However, linear search is
structurally closer to max, as their CFGs are closer to each other than to binary search. We can conclude
that focusing on just one dimension of similarity analysis can lead to erroneous assessments. As a
consequence, with our proposal we measure the similarity of programs that are functionally equivalent
but that have diferent structures to identify diversity or diferent versions, as in the case between linear
and binary search.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Program similarity</title>
        <p>We now turn our attention to the problem of assessing the similarity between two programs, A and
B. As discussed previously, diferent factors dictate the way programs may compare with each other,
beyond their functional equivalence. These are: (1) programs’ structure, (2) statements’ order, (3) control
structures used, and (4) data structures used.</p>
        <p>
          We build on the approach taken by the LAV program similarity evaluation [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] for our similarity
analysis. The LAV algorithm analyzes similarity between programs using their CFG representation,
according to the information stored in the CFG’s nodes (i.e., sequences of LLVM instructions). To
calculate the similarity between two given CFGs, the algorithm uses the neighbor matching method [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
The neighbor matching method defines similarity for a CFG,  = ⟨, ⟩, based on the similarity
scores of its individual nodes. Two nodes,  ∈  and  ∈ , abstracted from programs A and B
respectively, are defined as similar, if the neighbors of  can be matched to neighbors of . Therefore,
the neighbor matching method relies in the assignment problem. The goal of the assignment problem
is to find a matching of elements of CFGs  to  with the highest weight. A matching of nodes
for programs A and B is a set of pairs  = {⟨, ⟩ |  ∈ ,  ∈ } such that no element of one
set (say ) is paired with more than one element of the other set (). For the matching  , we
have two enumeration functions  : {1, 2, ..., } →  and  : {1, 2, ..., } → , such that  =
{⟨ (), ()⟩ |  = 1, 2, ..., } where  = | |. The similarity of nodes  and  is defined as the average of
the similarity of the matched in-nodes (i.e., in-neighbors) and the matched out-nodes (i.e., out-neighbors)
for  and , as in Equation (1),
+1 ←
+1(, ) + +1(, )
        </p>
        <p>2
+1(, ) ←
+1(, ) ←
 =1
1 ∑︁ 
()()

1
 =1

∑︁ 
()()
 
where  = ((), ()),  = ((), ()) with () the in-degree of node , and
() the out-degree of node . The functions  and  are the enumeration functions of the optimal
matching of in-neighbors for nodes  and  with weight function (, ) = . Analogously 
 and
 are the enumeration functions of the optimal matchings of out-neighbors for nodes  and . The
algorithm terminates the comparison between the CFGs when  | − − 1| &lt;  , or after reaching
a given number of iterations.</p>
        <p>
          The similarity of the CFGs (i.e., (, ) and (, )) is computed as the weight of the optimal
matching of nodes divided by the number of matched nodes [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>
          The original neighbor matching method of LAV [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] analyzes the graphs based only on their topology.
Our approach extends this by annotating nodes with valuable information related to the program (LLVM
instructions) extracted from the CFG. Furthermore, we use the edit distance between the sequences
of instructions of every pair of nodes  ∈  and  ∈  [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] to calculate their similarity. The edit
distance (, ) is the minimal number of insertions, deletions, and substitutions needed to transform
one sequence of characters into another. In LAV, the cost of insertion and deletion of an instruction
is defined to be 1. The substitution cost depends on whether the instruction is a function call. If the
instruction is a function call to diferent functions, then the substitution cost is 1. If the instruction is a
call to the same function, then the cost is 0. If the instruction is not a function call, then the cost is 1,
unless the instructions in both nodes are exactly the same. Finally, the similarity of the sequences of
instructions of the nodes  ∈  and  ∈  is defined as  = 1 − (, )/(||, ||), where || is
the number of instructions in the sequence [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The update rule is modified to calculate the similarity
of nodes taking into account nodes’ sequences of instructions, as in Equation (4).
        </p>
        <p>+1 ←
√︃
 ·
+1(, ) + +1(, )
2
(1)
(2)
(3)
(4)
We observe that the similarity evaluation described by LAV presents problems when we compare two
programs, A and B, in the case program A is contained in program B. In such case, the similarity value
of the programs will be unusually high, disregarding the additional instructions from the container
program B. This anomaly exacerbates whenever the size of program B is much larger than size of
program A. The reason for this anomaly is that the neighbor matching method is designed to determine
subgraph isomorphism. However, this property is not necessarily desired when analyzing programs.
For instance, an automated grading system using this approach could be fooled by an empty program
as it is contained by every other program. Therefore, to strengthen the neighbor matching method for
similarity analysis, we improve the algorithm by taking into account the topological context of the
graphs. Additionally, we normalize the output of the node similarity algorithm based on the maximum
number of nodes between the CFGs of the two programs under analysis.</p>
        <p>
          To counter the aforementioned problem and to take into account information about the graphs’
topology, we modify the similarity metric between nodes, so it also contains information about the
structure of the nodes’ neighborhood (a level beyond of the current approach). Therefore, given two
CFGs  and , we compute a similarity matrix , where the position  indicates the similarity
between the nodes  ∈  and  ∈ . The similarity index  is calculated based on the Local
Relative Entropy (LRE) index [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. In contrast to the neighbor matching method described previously,
our approach considers the similarity based only on the neighborhood structure –that is, comparing
the degrees of a node’s neighborhood without considering the value of the nodes itself.
        </p>
        <p>
          To calculate the LRE we need to: (1) Find local networks (, ), and their probability sets  ().
(2) Calculate the relative entropy between all node pairs based on the probability sets for each node.
(3) Calculate the similarity of each pair of nodes based on their relative entropy [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. The local network
of a node  is defined as (, ) where  is the set of nodes in the local network, composed of the
node  and its neighbors, and  the set of degrees of  . The total degree of a local network is defined
by Equation (5), where () is the degree of the  neighbor of node .
        </p>
        <p>||
() = ∑︁ ()</p>
        <p>=1
The probability set  () for node  describes the ratio between the degree of a node  in the local
network and the total degree of the local network of node . The probability set must have the same
magnitude for all nodes. All probability sets have a fixed size equal to the largest degree in the graph
() and the node itself, | ()| =  + 1, as in Equation (6).</p>
        <p>() = {(, 1), (, 2), ..., (, ), ..., (,  + 1)}
where the probability for node  relative to node  is</p>
        <p>⎧ ()
(, ) = ⎨ ()
⎩0,
,
 ≤ () + 1
 &gt; () + 1
(5)
(6)
(7)
(8)
(9)
To compute LRE, the probability sets are sorted in decreasing order ( ′() = ( (), )). Now,
the relative entropy between two nodes is defined as in Equation (8).</p>
        <p>( ′ ()|| ′ ()) =
(||,||)+1
∑︁
=1
′ (, )
′ (, )
′ (, )
Based on the relative entropy for each pair of nodes, the relevance matrix  is created, such that  =
( ′ ()|| ′ ()) + ( ′ ()|| ′ ()). Finally, the LRE similarity score is defined in Equation (9).

 = 1 − ()
Even though, LRE is described for nodes within the same graph, we used it for calculating similarity
between nodes across diferent graphs. This is part of our contribution to the algorithm to calculate the
LRE and similarity for pairs of nodes  ∈  and  ∈ .</p>
        <p>Once we compute the  matrix, we combine it with the edit distance stored in the  matrix, such
that  :=  +2  , the average of the edit distance score and the LRE score. Finally, the similarity of
the graphs corresponds to the weight of the optimal matching of nodes divided by the number of nodes
of the CFG with the largest number of nodes.</p>
        <p>Note that using our approach, the similarity of a program compared with itself is not 1, but rather is
the highest value of the comparison with all the other programs in the analysis. Values closer to the
comparison of a program with itself, mean that the programs are more similar.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation</title>
      <p>We evaluate our program similitude analysis in two stages. First, we validate the efectiveness of our
approach using algorithms from known domains. These algorithms are well understood so that their
semantic behavior is known. Additionally, the programs used are complex enough so that all language
features are evaluated. Second, use a new corpus for program similitude analysis, to evaluate our
algorithm on larger multi-function codebases. The programs in the corpus are divided in 5 domains,
including implementations that are alike and that are structurally diferent –that is, difer further than
small deltas, but still provide the same result to a problem.</p>
      <sec id="sec-3-1">
        <title>3.1. Evaluation Environment</title>
        <p>To execute all the evaluation scenarios we use an Intel core i5-5257U processor with 8GB RAM running
Ubuntu 18.04.2 LTS. Our evaluation uses version 3.3 of LLVM and C++11. All data used, and the
full evaluation results are available in our online appendix https://flaglab.github.io/SimCorp/web/
sqamia2024.html, here we focus the attention to the evaluation of our corpus.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data Corpus</title>
        <p>Our corpus consists of 566 diferent C++ programs extracted from the Codeforces competitive
programming online judge. The extracted programs are solution submissions to five problems (domains). The
problems used present diferent implementation characteristics, ranging from simple straightforward
implementations (the dificulty level of the problems is given by their accompanying letter starting with
A as the simplest problem), to implementations using multiple functions, requiring to manage complex
data structures (e.g., DSU) or advanced algorithms (e.g., dynamic programming (dp), or computational
geometry). For each of the problems we extracted up to 50 submissions from the categories: (1) (OK)
complete all test cases, (2) (RUNTIME_ERROR) yield a runtime error, and (3) (WRONG_ANSWER) do
not solve the problem properly. Table 1 shows the distribution of the data set classified by solution
category, each containing the average Lines of Code (LOC) per submission.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Experiment Design</title>
        <p>
          The similarity evaluation of the programs in our corpus first computes a similarity matrix containing
the similarity score for every pair of programs compared, as explained in Section 2.2. We use Principal
Component Analysis (PCA) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] to reduce the dimension of the matrix to two principal components, and
plot these components using the PCA index. Furthermore, we use the silhouette coeficient [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] on the
similarity matrix to evaluate the cohesion and separation of the clusters per problem. The silhouette
score is bounded between − 1 and 1, similar programs have a score close to 1, overlapped clusters have
a score close to 0, and dissimilar programs have a negative score.
        </p>
        <p>
          We evaluate our algorithm using the LAV similarity analysis [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] as a baseline. We take this baseline
for our experiments, as this constitutes the state-of-the-art analysis for node-based similarity, which is
closest to our approach. Here we focus on the comparison of functionally equivalent programs that
satisfy problems’ conditions (OK), which we use to identify diferent implementation techniques for a
specific problem.
        </p>
        <p>Note that all programs in the corpus have a common behavior (e.g., I/O instructions), and therefore
have a positive similarity score. Furthermore, as all the submissions that solve a problem (i.e., the
OK category) have the same black-box behavior, we expect all solutions to a problem to be clustered.
However, while the solutions to a problem behave the same, the algorithms used can have diferent
implementations; therefore, we also expect to find sub-clusters for each problem.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Results</title>
        <p>Our evaluation computes the similarity of the diferent problems from three perspectives. First, we
generate the similarity matrix using the LAV method (labeled ORIGINAL in the figures). Second, we
use our node similarity method to compare the submissions (labeled ORIGINAL-NODE-SYM). Third,
we normalize the node similarity as described in Section 2.2 (labeled NORMALIZED-NODE-SYM).</p>
        <p>When analyzing correct solutions (OK) to the problems, Figure 4a shows some clustering for each
problem (identified by shape and marker’s color in the figure). The silhouette score is 0.234, but still
presents overlapping and scattering between problems 1579A, 922E, and 1553G. Using our algorithm,
Figure 4b shows a better clustering, with a silhouette score of 0.204, and a 3.57× improvement over
the evaluation of all problems. This suggests that, as problems have behavior equivalence, our approach
significantly improves the similarity metric of the evaluated programs.</p>
        <p>Note program scattering can be attributed to specific algorithms. There might be diferent solutions
for a given problem, therefore these solutions should not be as similar to each other, as other solutions
with the same algorithmic principle.</p>
        <p>Figure 5a shows the OK submissions for problem 558B using the ORIGINAL method, with two
clusters, showing two distinctive solutions to this problem. From the figure we see a tight cluster
represented by the solutions using a circle, and a more scattered clustered represented by the solutions
using a × . This explains the low silhouette score of 0.276. Figure 5b shows our NORMALIZED method
evaluating the same problem. Here we too obtain two distinctive clusters. In this figure we observe
that the clusters are less scattered, explained by a silhouette score of 0.483. As there are two common
solution patterns for the problem, we confirm our algorithm is efective in finding similar and dissimilar
programs for particular problems with common black-box behavior.</p>
        <p>We use our approach to evaluate submission types for all the problems in the corpus. Table 2 presents
the silhouette scores for all the programs in our corpus, with the best scores in bold. It is important to
note, that the silhouette score for a specific problem, represents the ability of the algorithm to detect
diferent implementations for the same problem. Not all problems contain diferent implementations
within the same type of submission. Identifying such property is valuable for applications domains like
diversity or N-version programming.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Discussion</title>
        <p>From the results we conclude our approach is appropriate to: (1) Identify features common to diferent
problems, for example, in the case of SimCorp, in which we identify similarities across all analyzed
submissions in the way the input and output of the problem are processed (independent of the specific
instructions used). (2) Detect diferences in the algorithms behind diferent programs, even when they
are functionally equivalent. This is shown in the clusters for the correct (OK) solutions to a problem in
our corpus.</p>
        <p>(a) ORIGINAL algorithm for OK submissions
(b) NORMALIZED-NODE-SYM algorithm for OK submissions</p>
        <p>The performance of our algorithm, measured using the silhouette score, is similar to the performance
of the baseline, i.e., the LAV algorithm. Our evaluation shows that the use of our node similarity
definition has a slight improvement over the baseline in most cases. Moreover, when analyzing specific
algorithm pairs identified as similar/disimilar, we respectively note a great resemblance/disparity in the
specific implementations. This is beneficial as a notion of diversity between algorithms. However, the
validation shows that in some cases using node normalization is detrimental to the performance. We
note that, as programs difer but have an important feature in common ( e.g., input/output processing),
the algorithm will detect these as similar, decreasing the silhouette score.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Related Work</title>
      <p>Existing approaches for the semantic program analysis often focus on behavior equivalence between
programs (i.e., programs’ outputs are equivalent for the same inputs). Other approaches strive to detect
semantic similarities for programs that diferentiate on small deltas. We put our proposed algorithm to
analyze similarity in perspective of these existing approaches, divided in two categories: analysis for
semantic equivalence, and semantic code clone detection.</p>
      <p>(a) ORIGINAL algorithm for problem 558B OK
(b) NORMALIZED-NODE-SYM algorithm for problem 558B OK</p>
      <sec id="sec-4-1">
        <title>4.1. Semantic Equivalence of Programs</title>
        <p>
          Control Flow Graph CFGs are among the most used structures to analyze the structure and behavior
of programs, as they allow developers to explore all the execution paths of a program. CFGs are normally
analyzed using subgraph isomorphisms to detect similar code fragments as related graph sections. Such
techniques are problematic for similarity comparison, for two main reasons [
          <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
          ]. First, subgraph
isomorphism algorithms are time- and resource-heavy, restricting the complexity of the problems that
can be analyzed. Second, this technique is computable only for specific cases, restricting the types
of programs to analyze. We use a diferent method, where nodes’ local information determines their
similarity, this enables us to assess similarity between any type of programs.
        </p>
        <p>
          LAV [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] uses CFG neighbor and node similarity to assess student assignments for introductory
programming courses, when compared with the solutions provided by the course’s instructors. The
neighbor matching method used in LAV is the base for the implementation of our approach. This
method calculates the similarity between two nodes taking into account their in and out nodes,
following Equations 1-4 in Section 2.2. While LAV enhances code similarity using nodes’ content, which
corresponds to a linear code sequence to calculate graph nodes similarity, we use the LRE index and
normalize node similarity between programs to weight the relevance of each neighbor for the nodes
under comparison, with better results in many cases, and a defined metric to diferentiate algorithms.
Symbolic execution Symbolic execution methods can be used to focus the similarity analysis
between programs on their behavior, rather than their structure [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. This method performs analysis on
programs’ structure, represented as a CFG. The result of the symbolic step is a set of variable stores
and path conditions. While using symbolic execution methods can detect similarity between programs
with similar behavior, in generally they are unable to prove equivalence [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Moreover, if the programs
obtain the same behavior using diferent techniques, the symbolic method will not reach close similarity.
Our approach addresses this shortcoming.
        </p>
        <p>
          Abstract interpretation In response to the problems presented in symbolic execution methods,
abstract interpretations methods aim to increase the precision of existing proposals that sufer of
under-approximation. Partush and Yahav [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] present speculative correlating semantics to allow the
bounded representation of program diference. The technique uses SCORE, that given an abstracted
pair of programs, finds the state of minimal diference between them. Similar to our approach, the use of
program abstractions in SCORE allows the comparison of programs with more significant diferences.
Diferential symbolic execution (DSE) Finally, DSE methods encompass the analysis of programs
based on a preliminary diferential program analysis [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], exploring only the parts of the programs that
have changed. DSE performs a method level comparison of programs, taking the symbolic summaries of
each method to check functional or path equivalence [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. DSE presents an improvement over symbolic
execution methods using the path equivalence, which takes into account both the final behavior of
programs, and the way programs are structured for a more faithful comparison between programs.
ARDif [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] improves the assessment of program equivalence to improve results using refinement
abstractions to optimize the symbolic summaries needed to prove equivalence. As with SCORE, this
approach is efective in evaluating small diferences between two programs.
        </p>
        <p>
          Trostanetski et al. [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] extend DSE with a modular execution improving the search of procedures
that have already been explored (an alternative method to refinement abstraction). The contribution
of this approach is the possibility to both prove and disprove equivalences between programs. This
characteristic is similar to our proposed metric, we can state that programs with a lower PSA metric
are diferent.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Semantic Code Clone Detection</title>
        <p>
          A vast body of work exists on the topic of code clone detection [
          <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
          ]. However, of the 54 tools in the
most recent survey [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], only 9 tools report the detection of Type-4 or semantic clones. As the datasets
to evaluate these tools are not available, we discuss the algorithm and results of the main existing tools
in perspective of our approach.
        </p>
        <p>
          CCSharp [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] specializes in detecting Type-4 clones using Program Dependence Graphs (PDGs).
CCSharp is based on subgraph isomorphism of synthesized PDGs to improve the algorithm’s performance.
Program similarity in CCSharp is based on string and numerical similarity. String similarity evaluates
the distance between input and output parameters, then evaluates the similarity of function names.
Numerical similarity takes the characteristic vectors extracted from the PDG (i.e., the meaning of nodes)
and calculates their euclidean distance. Based on graph isomorphism, CCSharp may not be able to cope
with programs with strong syntactic diferences, as the ones we are targeting. Similar to our approach,
Nasirloo and Azimzadeh [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] use normalization to refine the clone analysis over PDGs.
        </p>
        <p>
          Sheneamer et al. [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] present a machine learning based method to detect Type-4 clones on obfuscated
code. This approach is based on the features extracted from a Bytecode Dependency Graph (BDG),
and features extracted from an AST or PDG. This method is similar to other CFG methods discussed
before, in that CFGs are the base structure to extract information about the programs. In this case we
also require example programs to use as the training set of the machine learning algorithm, to further
related observed features as semantic clones during testing. The amount of data required to use machine
learning techniques makes such approaches inappropriate for the comparison of diverse programs.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>This paper presents a new approach to measure similitude between diferent programs. Unlike existing
approaches that focus on the equivalence of programs based on their behavior, our approach analyzes
similitude from the perspective of programs’ structure and complexity. A particular contribution of our
approach is that it is agnostic to the specific program under analysis. That is, we do not require the
programs to be similar beforehand (i.e., are delta variations of each other). In fact, we strive to analyze
programs that are diferent in kind, which is of use to evaluate how diverse a code base is, or how
diverse are diferent implementations of a system functionality.</p>
      <p>Together with our approach, we posit a new corpus for evaluating similitude between diferent
programs. Our corpus contains 566 diferent C++ programs extracted from submitted solutions to
competitive programming problems from the Codeforces online judge. In the evaluation, we validate
that our approach is efective in detecting similarity between programs at par with existing approaches.
Additionally, our approach is efective in identifying diverse families of programs for a same problem
with diferent algorithmic solutions, outperforming the precision of existing solutions by up to 77%.</p>
      <p>We identify two avenues of future work. First, we will continue to improve the precision. One
possible improvement would be to provide the CFG nodes with more information from the LLVM
instructions. Second, we want to apply our approach as a new metric for technical debt for development
companies, and as a recommender for existing functionality to implement in new products.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>V.</given-names>
            <surname>Nallur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. O</given-names>
            <surname>'Toole</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cardozo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Clarke</surname>
          </string-name>
          , Algorithm Diversity:
          <article-title>A Mechanism for Distributive Justice in a Socio-Technical MAS</article-title>
          , in
          <source>: Proceedings of the International Conference on Autonomous Agents &amp; Multiagent Systems</source>
          , AAMAS'16,
          <string-name>
            <surname>International</surname>
            <given-names>Foundation</given-names>
          </string-name>
          <source>for Autonomous Agents and Multiagent Systems</source>
          , Richland,
          <string-name>
            <surname>SC</surname>
          </string-name>
          ,
          <year>2016</year>
          , pp.
          <fpage>420</fpage>
          -
          <lpage>428</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Badihi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Akinotcho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Rubin,</surname>
          </string-name>
          <article-title>ARDif: scaling program equivalence checking via iterative abstraction and refinement of common code</article-title>
          ,
          <source>in: Proceedings of the Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering</source>
          , ESEC/FSE'20,
          <year>2020</year>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>24</lpage>
          . doi:
          <volume>10</volume>
          .1145/3368089.3409757.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F. E.</given-names>
            <surname>Allen</surname>
          </string-name>
          ,
          <article-title>Control flow analysis</article-title>
          ,
          <source>ACM Sigplan Notices</source>
          <volume>5</volume>
          (
          <year>1970</year>
          )
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Krinke</surname>
          </string-name>
          ,
          <article-title>Identifying similar code with program dependence graphs</article-title>
          , in: Proceedings Eighth Working Conference on Reverse Engineering,
          <year>2001</year>
          , pp.
          <fpage>301</fpage>
          -
          <lpage>309</lpage>
          . doi:
          <volume>10</volume>
          .1109/WCRE.
          <year>2001</year>
          .
          <volume>957835</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <surname>CCSharp:</surname>
          </string-name>
          <article-title>An eficient three-phase code clone detector using modified PDGs</article-title>
          , in: Asia-Pacific Software Engineering Conference, APSEC'17, IEEE,
          <year>2017</year>
          , pp.
          <fpage>100</fpage>
          -
          <lpage>109</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Vujošević-Janičić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nikolić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tošić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kuncak</surname>
          </string-name>
          ,
          <article-title>Software verification and graph similarity for automated evaluation of students' assignments</article-title>
          ,
          <source>Information and Software Technology</source>
          <volume>55</volume>
          (
          <year>2013</year>
          )
          <fpage>1004</fpage>
          -
          <lpage>1016</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nikolic</surname>
          </string-name>
          ,
          <article-title>Measuring similarity of graphs and their nodes by neighbor matching</article-title>
          ,
          <source>CoRR abs/1009</source>
          .5290 (
          <year>2010</year>
          ). URL: http://arxiv.org/abs/1009.5290. arXiv:
          <volume>1009</volume>
          .
          <fpage>5290</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mahadevan</surname>
          </string-name>
          ,
          <article-title>Measure the similarity of nodes in the complex networks</article-title>
          ,
          <source>CoRR abs/1502</source>
          .00780 (
          <year>2015</year>
          ). URL: http://arxiv.org/abs/1502.00780. arXiv:
          <volume>1502</volume>
          .
          <fpage>00780</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>I. T.</given-names>
            <surname>Jollife</surname>
          </string-name>
          ,
          <source>Principal Component Analysis</source>
          , Springer-Verlag,
          <year>1986</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Rousseeuw</surname>
          </string-name>
          ,
          <article-title>Silhouettes: A graphical aid to the interpretation and validation of cluster analysis</article-title>
          ,
          <source>Journal of Computational and Applied Mathematics</source>
          <volume>20</volume>
          (
          <year>1987</year>
          )
          <fpage>53</fpage>
          -
          <lpage>65</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/0377042787901257. doi:https://doi.org/ 10.1016/
          <fpage>0377</fpage>
          -
          <lpage>0427</lpage>
          (
          <issue>87</issue>
          )
          <fpage>90125</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Arifi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Abbou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zahi</surname>
          </string-name>
          ,
          <article-title>A New Similarity-based Method for Assessing Programming Assignments using Symbolic Execution</article-title>
          ,
          <source>International Journal of Applied Engineering Research</source>
          <volume>13</volume>
          (
          <year>2018</year>
          )
          <fpage>1963</fpage>
          -
          <lpage>1981</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Partush</surname>
          </string-name>
          , E. Yahav,
          <article-title>Abstract semantic diferencing via speculative correlation</article-title>
          ,
          <source>in: Proceedings of the International Conference on Object Oriented Programming Systems Languages &amp; Applications</source>
          , OOPSLA'
          <volume>14</volume>
          ,
          <year>2014</year>
          , pp.
          <fpage>811</fpage>
          -
          <lpage>828</lpage>
          . doi:
          <volume>10</volume>
          .1145/2660193.2660245.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Winstead</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <article-title>Towards diferential program analysis</article-title>
          ,
          <source>in: ICSE Workshop on Dynamic Analysis</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Person</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. B.</given-names>
            <surname>Dwyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Elbaum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Păsăreanu</surname>
          </string-name>
          ,
          <article-title>Diferential symbolic execution</article-title>
          ,
          <source>in: Proceedings of the International Symposium on Foundations of Software Engineering, FSE'08</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2008</year>
          , pp.
          <fpage>226</fpage>
          -
          <lpage>237</lpage>
          . doi:
          <volume>10</volume>
          .1145/1453101.1453131.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Trostanetski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grumberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kroening</surname>
          </string-name>
          ,
          <article-title>Modular demand-driven analysis of semantic diference for program versions</article-title>
          ,
          <source>in: International Static Analysis Symposium</source>
          , Springer,
          <year>2017</year>
          , pp.
          <fpage>405</fpage>
          -
          <lpage>427</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Q. U.</given-names>
            <surname>Ain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. H.</given-names>
            <surname>Butt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. W.</given-names>
            <surname>Anwar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Azam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Maqbool</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Systematic</given-names>
            <surname>Review</surname>
          </string-name>
          <article-title>on Code Clone Detection</article-title>
          ,
          <source>IEEE Access 7</source>
          (
          <year>2019</year>
          )
          <fpage>86121</fpage>
          -
          <lpage>86144</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2019</year>
          .
          <volume>2918202</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bellon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Koschke</surname>
          </string-name>
          , G. Antoniol,
          <string-name>
            <given-names>J.</given-names>
            <surname>Krinke</surname>
          </string-name>
          , E. Merlo,
          <article-title>Comparison and evaluation of clone detection tools</article-title>
          ,
          <source>IEEE Transactions on Software Engineering</source>
          <volume>33</volume>
          (
          <year>2007</year>
          )
          <fpage>577</fpage>
          -
          <lpage>591</lpage>
          . doi:
          <volume>10</volume>
          .1109/TSE.
          <year>2007</year>
          .
          <volume>70725</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H.</given-names>
            <surname>Nasirloo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Azimzadeh</surname>
          </string-name>
          ,
          <article-title>Semantic code clone detection using abstract memory states and program dependency graphs</article-title>
          , in: International Conference on Web Research, ICWR'18, IEEE,
          <year>2018</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sheneamer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kalita</surname>
          </string-name>
          ,
          <article-title>A detection framework for semantic code clones and obfuscated code</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>97</volume>
          (
          <year>2018</year>
          )
          <fpage>405</fpage>
          -
          <lpage>420</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>