<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>SATToSE</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Robust and Automatic Approach for Matching Algorithms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>AlessandroMidolo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>EmilianoTramontan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop ProceedingsC(EUR-WS.org)</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Matematica e Informatica, University of Catania</institution>
          ,
          <addr-line>Catania</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>15</volume>
      <fpage>12</fpage>
      <lpage>14</lpage>
      <abstract>
        <p>Modern software applications can have millions of lines of code. Moreover, while several developers collaborate to develop code, each of them could have their own coding style, e.g. follow a personal code convention, prefer some data structures, etc. Having a large amount of code and several coding styles can negatively afect the readability of the code, especially when this has not been documented properly. Furthermore, a somewhat incorrect version of an algorithm would lead, besides to the presence of bugs, an involute, complex and ineficient code.</p>
      </abstract>
      <kwd-group>
        <kwd>Static analysis</kwd>
        <kwd>coding suggestions</kwd>
        <kwd>code reuse</kwd>
        <kwd>refactoring</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>ifers to identify the algorithm11[, 12, 13]; their classifier</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>same project.</p>
      <p>The bigger the repositories the greater the demandiisnbgased on a set of structural and truth value
characefort for developers when trying to understand codteeristics that are strictly related on sorting algorithms,
implemented by their colleagues. Visually inspectitnhgerefore new beacons should be defined for diferent
source code to understand its structure and the functciaotne-gories of algorithms. Furthermore, these approaches
alities could be time consuming and error-prone, sinrceequire a new training of the dataset if new labels are
indevelopers could misidentify some algorithms. Generalcllyu,ded in the classification, which can be time consuming
this occurs when the source code documentation is poaonrd, complex.
or missing, and when many developers contribute to theThis paper focuses on algorithm recognition and presents
an innovative approach that automatically matches
algotools that support the works of develop6e,r7,s8[].</p>
      <p>Automatic Program Comprehension (PC) tries to solrviethms by inspecting the source code. Our approach uses
these issues by proposing several approaches that austtoa-tic analysis to collect data from the source code and
matically assist developers to understand source 1c,odeco[mpute a similarity score with templates of known
algo2]. Several applications of the PC have been presentreidt:hms to identify the correct one. The use of templates
source code classification according to specific categor3ie,gsu[arantees that: new algorithms can be easily added for
4, 5], code clone detection, and algorithm recognittiohne.recognition step; multiple versions of the same
algoAutomatic approaches can help in analysing, comprriet-hm can be used to improve the accuracy of the
idenhending and improving the source code, by means otfification; many, if not all, categories of algorithms can
be recognised (sorting, searching, traversing, etc.). The
proaches have been presented9,[10], where diferent</p>
      <p>The state of the art shows several techniques toparuo-posed approach consists in four main phases: firstly,
tomatically identify algorithms. Machine learninga acopd-e parsing tool collects all the statements of the
algorithm analysed; secondly, a statement transformation is
classifiers are used to label code fragments. Other prpoe-rformed to extract the data required for the similarity
posals use a hybrid approach, mixing static analysismtaotch; thirdly, the Levenshtein distance is computed to
nEvelop-O</p>
      <sec id="sec-2-1">
        <title>Levenshtein distance has been widely used in program analysis, especially for code clone detection when evaluating the similarity between code fragme1n4t,s1[5, 16].</title>
        <p>However, such approaches are strictly related to deAtelgcto-rithm 1 The algorithm of the proposed approach
ing clone fragments. Such approaches focus on detectp-rocedure MatchingAlgorithm(SC)
ing type-3 clones that are portions of code that difer   ← ℎ()
in terms of whitespaces, comments, layouts and iden- for cu, compilationUnitdso
tifiers, and can have some modifications like addition ℎ ← ℎ()
or removal of statement1s7][. Conversely, we propose end for
another approach in which diferent statements of the for mDecl, methodsdo
algorithms under analysis are reflected on the matching  ← . ()
score, hence having a varying degree of matching; more- for tmp, templatedso
over, the above said approaches refer to types and names   ← . ()
(e.g. methods and fields names) to detect clones, whereas  ←  (,  )
our approach focuses on the statements freed from the .(, )
developer chosen names, providing a more generalised end for
identification. end for</p>
        <p>The rest of the paper is organised as follows. Sec2tion end procedure
describes the proposed approach with all the steps
followed by the analysis. Secti3onshows three examples of
the use of the approach and how the similarity scorielaisrity score. Javaparser is an automatic parser that
gencomputed on real scenarios. Secti4odnisplays the met- erates an abstract syntax tree (AST) from source code and
rics obtained by the analysis of the examples previousplryovides a set of APIs to perform operations on18i]t. [
shown and a comparison with a text similarity approacThh.e root of the AST is thCeompilationUnit
(representSection6 reports meaningful related works and compariensg a Java file) to which all code elements are connected,
them with our approach. Finally, conclusions are draew.gn.package declaration, class and methods declarations,
in Section7. etc. Code inspection has been performed by using the</p>
        <p>VoidVisitorAdapter class, which lets us define aVisitor
2. Proposed Approach class to search for a specific property. In tVhiseitor class,
the methodvisit() was implemented, which takes as
paWe propose an approach that gathers data by parsrianmgeters the type of object being searched (e.g. method
the source code and then evaluates the similarity scdoercelaration, statement, field), and the container in which
of an algorithm and a set of known algorithms. Tdhaeta are stored; the body of the method contains all the
proposed approach makes a proper generalisation of tinhsetructions that are executed for every object found of
algorithms to avoid depending on naming conventiotnhse type specified as a parameter.
or on statements that are not contributing to the mWaienhave defined a Visitor which looks foMrDs (method
goal of the analysed algorithm. We make use of the stadteicclarations); onceMaD is found, thevisit() method
exanalysis of the source code, hence executable files arteracts from its body all the statements, and stores them
not needed for the analysis. Algorit1hsmhows the main in aList preserving the order. The list of statements
prosteps, as psuedo-instructions, followed by the proposveidded by Javaparser is further expanded, as it would
othapproach to match algorithms. The procedure takeesrawsise miss: (i) nested statements (e.g. all the statements
input the source code, variabSlCe, and parses it extractp-resent in the body of faor loop, defined as ForStmt),
ing the compilation units. Then, method declaratiaonnds (ii) expressions, such as assignments, method calls,
are extracted and, for each, all the statements typevsaarriaeble declarations, etc. (definedEaxspressionStmt).
collected and compared to the algorithm templates to
compute the similarity score. The approach can be str2u.c2-. Statements Transformation
tured in four steps: (i) parsing the source code by means
of a tailoreVdisitor to gather all the data needed for tThhee statements initially gathered by Javaparser are
transfollowing analysis; (ii) selecting and transformingftohrmeed to better serve the following analysis. Javaparser
most relevant statements; (iii) computing an adapted Lpervo-vides a functiong,etStatements(), to get the statements
enshtein distance to determine the similarity score;c(oivn)tained in the body of a method declaration; nested
evaluating the matching degree of the algorithms. statements, e.g. statements contained in a for loop, are
omitted be the saidgetStatements(), and just the
parent statement is inserted in the list. To collect all the
2.1. Code Parsing statements inside the method, we further extract nested
We perform code parsing, by means of the Javaparsesrtatements and we place them in the list of statements
library, to extract all the data required to evaluate trhieghsitma-fter their parent, preserving the order of the block.</p>
        <p>Whereupon, for each statement in the list, we extract the
class type, avoiding collecting other parts such as name}s,
types, comments and expressions, to better generali}se
the approach, so as to recognise diferent versions of the
same algorithm; e.g. statemefnotr(int i=0; i&lt;size; i++) is Statements Type:
represented in Javaparser byFoarStmt type, whereas all ForStmt,
the other parts, such as the variable declariantt iio=n0)(, ForStmt,
the binary expressioni&lt;(size) and the unary expressionIfStmt,
(i++) are omitted. For all the statements that canVcaornia-bleDeclarationExpr,
tain nested statements in their body, feo.rg,.if, while, AssignExpr,
a custom statement is inserted at the end of the nesAtsesdignExpr
statements; its name is given by the concatenation of EtnhdeIf
”End” prefix with the type of the statement containinEngdFor
the nested ones, e.gE. ndIf, EndFor, EndWhile. This cus- EndFor
tom statement allows us to identify which statements
are nested, thus improving the precision of the approach</p>
        <p>Listing 1: The upper part shows an iterative version of the
when comparing algorithms. bubblesort algorithm implemented in Java, whereas the
The extracted types are defined by the Javaparser
li</p>
        <p>bottom part displays the list of statements extracted by
btrhaercy1o.lTuambnSltea1tesmhoenwtsTsyopmeereepxaremspelnetssotfhsetatytpemeednefintetdybpyes:our approach, using the class types defined in Javaparser.
the library, and the coluCmonde shows an example of the
code associated with the type; further types can be found
in the documentation. ThEexpressionStmt type does not 2.3. Adapting the Levenshtein Distance
have a code example because it can represent any tyTphee Levenshtein distance is a string metric for measuring
of expression: AssignExpr (a = b +c;), MethodCallExpr the diference between two sequences1[9]. It is defined
(method();), VariableDeclarationExpr (int a = 0;) etc. This as the minimum number of operations (replace, insert
class type is too generic, hence we select and insertanind delete) required to change a sequence into the other.
the statement list the type of the expression contaAinsetdring can be seen as a list of single characters; the
Levin the statement. If there are nested expressions, e.ge.nashtein algorithm iteratively compares all the characters
method call with argumentOabnjectCreationExpr (e.g. and finds the minimum number of operations (insertion,
method(a, new object())), we select the parent expressiond,eletion or substitution) required to make the two
sein this example thMeethodCallExpr. quences equal. We have implemented a custom version</p>
        <p>Listing1 shows an example of data extracted fromoaf the algorithm where two lists of statements are the
method: the code on top shows the methboudbbleSort(); compared sequences, and every character represents a
the bottom part displays the list of statements extrasicntgelde statement. Once defined the number of minimum
by our visitor. The statement list extracted by Javapairnssetrructions, the similarity score is compute2d0a]:s [
1https://www.javadoc.io/doc/com.github.javaparser/javaparsercore/latest/index.html
 (
where  1 and  1 are the two sequences of statementvs,oid iterativeBubbleSortTemplate(int list[],
levDist() gives as output the Levenshtein distance, and↪ int length) {
size() gives the number of elements in a sequence. int length = list.length;
for(int i=0; i &lt; length; i++) {
for(int j=1; j &lt; length-i; j++) {
2.4. Algorithm Recognition if(list[j-1] &gt; list[j]){
int swap = list[j-1];
list[j-1] = list[j];
list[j] = swap;
Once all the data needed for the analysis have been
obtained, we can compute the similarity scores according to
the extracted list of statements, i.e. the analysed method
is compared with a set of known algorithms that have }
been gathered and parsed beforehand. We have created }
a set of Java files containing the source code of several}
known algorithms, and for each one at least two versi}ons
are stored: iterative and recursive. For some algorithms,
more versions have been implemented as the aim is toStatements Type:
improve the accuracy of our analysis. E.g., the bubblesoVratriableDelcarationExpr,
algorithm has two diferent versions, besides the iteratFivoerStmt,
and recursive versions: the one shown in List1i,nagnd ForStmt,
an optimised version where a boolean flag breaks thIefStmt,
execution if no elements are swapped in the inner looVpa.riableDeclarationExpr,</p>
        <p>Listing2 shows one of the templates of the iteratiAvsesignExpr,
version for the bubblesort algorithm used in our anAaslsyi- gnExpr
sis: the code on top displays the implementation of tEhnedIf
algorithm, while the list below represents the statemEenndtFsor
extracted by our approach. We show this templateEbned-For
cause, according to our approach, it is the most similar
to the code shown in Listin1.gHowever, the two meth-Listing 2: The upper part shows one iterative version
ods have several textual diferences: firstly, the name ofof thebubblesort algorithm stored in our template db,
some variables is diferent, e.g. the variable representinwghereas the bottom part displays the list of statements
number of elementss,ize andlength, the array containingextracted by our approach, according to the types defined
the elementss,ort_arr andlist, and the variable used forin Javaparser.
the swap,tmp andswap; secondly, the condition in the
IfStmt is inverted; finally, the template has an additional
statement compared to the example, the first statementListing3 shows a method implementing the factorial
VariableDeclarationExpr. algorithm for an integer value in a recursive form. The</p>
        <p>Despite these diferences, our approach correctly idelni-st of statements is as followIfSst:mt, ReturnStmt,
Retifies the algorithm implemented. According to the
Lev</p>
        <p>turnStmt. The analysis carried out by our approach has
enshtein distance, the number of operations needed itdoentified the method as the recursive version of the
facmatch the two sequences is one (an insertion becautsoerial algorithm with a similari1t.y0.oIfn such a case,
the list of statements difer in only one element). Indeetdhe similarity is the maximum possible value since given
the similarity score between these two sequences is cotmh-e simple structure of the algorithm, the types of
stateputed as = 1 − (1/7) = 0.857 , where the the ments used by such a method match 100% the factorial
size of the longest sequence 7is. algorithm template.</p>
        <p>2https://github.com/AleMidolo/MatchingAlgorithms</p>
      </sec>
      <sec id="sec-2-2">
        <title>Moreover, two versions of the quicksort algorithm are</title>
        <p>public static int factorial(int n) {
if (n == 0 || n == 1)</p>
        <p>return 1;</p>
      </sec>
      <sec id="sec-2-3">
        <title>Listing 3: An example of the recursive factorial algorithm implemented in Java.</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation</title>
      <p>We tested our approach on four diferent algorithms, each
implemented as a method. All the templates used by ourreturn n * factorial(n - 1);
approach can be found on a public reposit2o.rTyhe first }
example is shown in Listin1gpreviously discussed; here,
we discuss three other algorithms: a versiofnacotofrial
and two versions oqfuicksort.
considered to test the approach on diferent versionspoufblic void quickSortStack(short[] array) {
the same algorithm; both versions propose an iterativ/e/ create a stack for storing
solution. // subarray start and end index</p>
      <p>Firstly, Listin4gshows an iterative version of quick- Stack&lt;Pair&gt; st = new Stack&lt;&gt;();
sort algorithm that uses a stack as a support to sort sthhoert finish = array.length -1;
elements contained in the array passed as argument. The// push the start and end index
method consists in twelve statements, in order:Vtawri-o // of the array into the stack
ableDeclaration, MethodCall, WhileStmt, VariableDecla- st.push(new Pair(0, finish));
ration, AssignExpr, MethodCall, VariableDeclaration, and // loop till stack is empty
twoIfStmt with aMethodCall in their body. Therefore, while (!s.empty()) {
the method is characterised by such statements, indeed // remove top pair from the list and get
comments, names and types will be ignored for the pur- // subarray starting and ending indices
pose of the identification. Our template storage includes short begin = st.peek().getX();
an implementation of the quicksort using a stack to sort finish = st.peek().getY();
the element3s; the diferences between the template and st.pop();
this method are: the template has an additiVoanria-l // rearrange elements across pivot
ableDeclarationExpression before the firstpush() call; in short pv = partition(array, begin);
the example, the first instruction after tWhehileStmt is // push subarray indices with elements
a VariableDeclarationExpr, while in the template it is an // less than the current pivot to stack
AssignExpr; thepartition() method call takes one more if (pv - 1 &gt; begin) {
argument in the template; types and names of the vari- st.push(new Pair(begin, pv - 1));
ables are diferent. Considering the said diferences, our }
analysis has computed a similarity scor0e.8o5f, correctly // push subarray indices with elements
identifying the algorithm. // more than the current pivot to stack</p>
      <p>Secondly, Listing5 displays an iterative version of if (pv + 1 &lt; begin) {
the quicksort algorithm that uses a supporting array to st.push(new Pair(pv + 1, begin));
sort the elements of the array passed as argument. The }
method consists in fiteen statements, in order: three }
VariableDeclaration, twoAssignExpr, WhileStmt, three }
AssignExpr, and twoIfStmt with twoAssignExpr in their
body. There are two main diferences in the structurLeisting 4: An iterative version of the quicksort algorithm
between the two versions of the same algorithm: the toustianlg a stack of objects to sort the elements.
number of statements, twelve against fiteen, and the
absence ofMethodCall statements in the second version.</p>
      <p>An iterative version of the quicksort algorithm is st4or.edResults
in our template storage, and it uses an array to sort the
elements like the method given as an example. The maiWne have compared our approach to a text-based search
diferences between the template and the method are thaepproach between methods and the templates of the
alfollowing: the method has thVraeeriableDeclarationExpr gorithms. Table2 shows the metrics obtained for the
before theWhileStmt, whereas the template has only twfoo;ur methods previously described: colummnethod
distypes and names of variables are diferent. The analysipslays the method considered for the analysis, respectively
has computed a similarity score0o.8f6. bubblesortV1 (listin1g), factorialV1
(listi3n)g,quicksort</p>
      <p>Our approach correctly identified both versions bSet-ack (listin4g) and quicksortArray (listi5n).gThe other
cause the analysis uses templates for diferent versio nfivse main columns are the templates used by our approach
of the same algorithm, making the recognition more taoc-match the algorithmItBs:ubblesort is the bubblesort
curate. Still, the storage containing all templates citaenrabteive version (the one displayed in
Lis2t)i;nRgecFacupdated with more versions of algorithms to make tthoreial is the factorial recursive versiIotQnu; icksortST is
approach more sensitive to diferences and up-to-date.the quicksort iterative version using a stack to sort the
elements;ItQuicksortAr is the quicksort iterative version
using an array to sort the elemenIttMse;rgesort is the
mergesort iterative version. We have also considered the
mergesort to show how the analysis is able to distinguish
diferent algorithms. Each of these columns have two
3The templates in the db can be found in the github repositosruybcolumns: sim andtext are respectively the similarity
given above. score of our approach and the similarity score of the text</p>
      <p>5. Discussion
public static void quickSortArray(</p>
      <p>long arr[], long low, long high) {
// Create an auxiliary stack
long[] list = new long[high - low + 1];
long max = -1;
long pivot = 0;
// push initial values to stack
list[++max] = low;
list[++max] = high;
// Keep popping from stack while not empty
while (max &gt;= 0) {
h = list[max--];
l = list[max--];
// Set pivot element at its correct
// position in sorted array
pivot = partition(arr, low, high);
// If there are elements on left side
// of pivot, push left side to stack
if (pivot - 1 &gt; low) {
list[++max] = low;
list[++max] = pivot - 1;</p>
      <p>For all the five methods shown in Tab2le,our approach
shows a higher identification score compared to the text
similarity approach. Indeed, we have a higher
similarity score that is more than double forqutihckeSortStack
(0.84 compared to 0.37) andquickSortArray (0.86 and 0.34)
methods, and values abou4t0% greater fobrubblesortV1
(0.85 and 0.52) andfactorialV1 (1.0 and 0.7) methods. Our
approach performs better because it can generalise the
matching, without considering names of variables,
comments and names of types.</p>
      <p>Moreover, the matching scores given by our analysis
are clearly greater than the score of other algorithms,
whereas, with a text similarity approach, we can see
that the quicksortArray method has a similarity score of
0.34 forItQuickSortAr, 0.33 forItMergesort and 0.28 for</p>
      <p>ItQuicksortSt, hence the closeness of such scores can bring
} ambiguity in the correct identification of the algorithm.
// If there are elements on right side Finally, our approach can distinctly recognise two
dif// of pivot, push right side to stack ferent versions of the same algorithm: qtuhiceksortStack
if (pivot + 1 &lt; high) { has 0.84 score asItQuicksortSt and0.42 as ItQuickSortAr,
list[++max] = pivot + 1; while thequicksortArray method has respectivel0y.46
list[++max] = high; and0.86. The accurate identification of the version used
} is crucial when suggesting improvements or proposing
} diferent versions.
} The ability to recognise an algorithm is related to the
set of templates, which is not easy to maintain and grow.</p>
      <p>An algorithm can be matched if a similar template of
Listing 5: An iterative version of the quicksort algoritthhemsame algorithm is already part of the templates. To
using an array to handle the sort of the elements. handle this, the database can be populated with the most
popular algorithms, and, if an algorithm is not included,
it could be added by a developer using our approach and
comparison approach. tool.</p>
      <p>We have highlighted in bold text the highest score coTrh-e approach extracts the statement’s type to evaluate
responding to the correct identification of the algoritthhemsimilarity between algorithms. On the one hand, two
in subcolumnsim; whereas in subcolumntext the maxi- algorithms can have similar statements despite having a
mum score has been highlighted for the text match. Wdieferent behaviour, thus showing low accuracy in
identican see that the score assigned by our approach is mufcyhing code’s behaviour. On the other hand, the approach
higher in each case, indicating a higher precision in ctahne give a degree of generalisation, since it is not
dependent on names and types encountered, therefore it is ntohte authors propose a cross-language clone detector for
based on the comprehension of how the algorithm wCa,sC++ and Java, the input code is tokenized to obtain
implemented, but on its structure. This property is the keywords of the corresponding language, then these
main diference between our approach and type-3 clonekeywords are compared with the Levenshtein distance
approaches. and finally the clone types are classified based on
similarity of keywords match. In15[], the authors present a tool
to detect clones of a faulty code fragment, a Normalised
6. Related Works Compression Distance is defined to detect duplicate code
fragments.</p>
      <p>Algorithm recognition has been tackled by several
ap</p>
      <p>The above said approaches use the Levenshtein
disproaches in the state of the art, both for the softwtaarnece to detect code clone fragments, whereas our
proengineering industry and for academic settings9.]I,n [posal defines a tailored distance to match methods with
the authors present a solution for automatic algorithm</p>
      <p>algorithm templates in order to achieve algorithm
recogrecognition using machine-learning techniques; theyneixt-ion. Furthermore, these tools are sensitive to
difertract a dataset of source code containing algorithms,etnhceensin term of statements between code fragments,
bea feature extraction is carried out to collect all the cchauasreatch-e presence or absence of multiple statements can
teristic data (e.g. count-vars, operators, constructsleetacd.)t, o a misidentification of a code similarity; whereas
ifnally a tag updating is done to remove redundant tags.</p>
      <p>our approach proposes a similarity score which, despite
They have trained the dataset and built a group of
clas</p>
      <p>the statements diferences, is able to define a matching
sifiers to identify the algorithm. In order to add a new</p>
      <p>grade for all templates, where the highest one is the most
algorithm or category of algorithms to the classification,</p>
      <p>similar to the implemented algorithm.
all the previous steps have to be executed again in order
to properly train the new dataset, which it could be time
consuming; conversely, since our approach uses stat7ic. Conclusion
analysis to match templates, a new template can be added
to the collection without the need for further operaTthioisnpsa.per presented an automatic approach to recognise</p>
      <p>In [13], the authors propose an algorithm recognitaiolgnorithms using static analysis. By parsing the source
method that detects sorting algorithmic schemas; tchoedse it is possible to identify all the statements
composschemas consist in a set of loops, features, operatioings a method, transform them according to a specified
etc. Another approach discussing sorting algoritfhomrsmat, then compute Levenshtein distance for
obtainis presented in 1[2, 11], where numerical (number of ing a similarity score between the method and several
blocks, number of loops etc.) and descriptive (iteratitveem, plates of known algorithms. The template having the
recursive) characteristics are extracted from the sohuirgcheest score is suggested as the algorithm matching the
code and a C4.5 decision tree classifier is builded to detecatnalysed method. We performed an experiment on four
sorting algorithms. These approaches focus on sortminegthods to test our approach; the results obtained
highalgorithms under some assumptions such as algorithmlisght a high accuracy when recognising the algorithm
are expected to be implemented in a well-establishceodmpared to a textual similarity.
way, e.g. quicksort algorithm should be implemented The versatility of the approach allows us to add more
in a recursive way since it is more common. Moreovert,emplates to widen the spectrum of recognisable
algothey define a set of characteristics that are mostly relarittehdms and to increase the number of diferent versions of
to sorting algorithms. In contrast, our approachalcgaonrithms. This approach can be employed both for
proidentify a wider spectrum of algorithms since the stagtriacm comprehension purposes on software development,
analysis can be performed to any Java source code, ansudpporting developers in understanding and
implementwe consider diferent versions of the same algorithm tiong source code, by proposing alternative versions of the
increase the identification ability. same algorithm, and for academic purposes to
automati</p>
      <p>Many tools have been presented to measure sourceally assess students’ assignments.
code similarity. Most of these approaches address
problems such as code clone detection, software licensing
violation and software plagiarism21[]. The Levenshtein Acknowledgments
distance is often used for clone detection. I1n6][, a
hybrid technique is presented where source code is lexicalTlhye authors acknowledge project TEAMS funded by
University of Catania PIACERI 2020/22.
analysed to detect and extract sub-blocks, then similar
blocks are grouped and hashed, finally tLheevenshtein
similarity and theCosine Similarity are used to compute
similarity between blocks and find Type-3 clones. I1n4[],
1177–1183. doi:10.1145/3408877.3432358.</p>
      <p>[11] A. Taherkhani, L. Malmi, A. Korhonen, Algorithm
[1] J. Siegmund, Program comprehension: Past, recognition by static analysis and its application
present, and future, in: Proceedings of 23rd IEEE In- in students’ submissions assessment, in:
Proceedternational Conference on Software Analysis, Evo- ings of 8th ACM International Conference on
Comlution, and Reengineering (SANER), volume 5, 2016, puting Education Research (ICER), 2008, p. 88–91.
pp. 13–20. doi:10.1109/SANER.2016.35. doi:10.1145/1595356.1595372.
[2] W. Maalej, R. Tiarks, T. Roehm, R. Koschke, On the[12] A. Taherkhani, Recognizing sorting algorithms
comprehension of program comprehension, ACM with the c4.5 decision tree classifier, in:
ProTransactions on Software Engineering Methodol- ceedings of 18th IEEE International Conference on
ogy 23 (2014). doi:10.1145/2622669. Program Comprehension (ICPC), 2010, pp. 72–75.
[3] S. Ugurel, R. Krovetz, C. L. Giles, What’s the code? doi:10.1109/ICPC.2010.11.</p>
      <p>automatic classification of source code archive[s1,3] A. Taherkhani, L. Malmi, Beacon-and schema-based
in: Proceedings of 8th ACM SIGKDD International method for recognizing algorithms from students’
Conference on Knowledge Discovery and Data Min- source code, Journal of Educational Data Mining 5
ing (KDD), 2002, p. 632–638. doi1:0.1145/775047. (2013) 69–101. doi:10.5281/zenodo.3554635.
775141. [14] S. B. Ankali, L. Parthiban, Development of
port[4] K. Tian, M. Revelle, D. Poshyvanyk, Using latent ing analyzer to search cross-language code clones
dirichlet allocation for automatic categorization oufsing levenshtein distance, in: Proocedings of
software, in: Proceedings of 6th IEEE International Fourth International Conference on Smart
ComWorking Conference on Mining Software Reposi- puting and Informatics (SCI), Springer, 2021, pp.
tories (MSR), 2009, pp. 163–166. do1i:0.1109/MSR. 623–632. doi:10.1007/978-981-16-0878-0_60.
2009.5069496. [15] T. Ishio, N. Maeda, K. Shibuya, K. Inoue, Cloned
[5] S. Srikant, V. Aggarwal, Automatic grading of com- buggy code detection in practice using
normalputer programs: A machine learning approach, in: ized compression distance, in: Proceedings of
Proceedings of 12th International Conference on IEEE International Conference on Software
MainMachine Learning and Applications (ICMLA), vol- tenance and Evolution (ICSME), 2018, pp. 591–594.
ume 1, 2013, pp. 85–92. doi:10.1109/ICMLA.2013. doi:10.1109/ICSME.2018.00022.</p>
      <p>22. [16] A. Sheneamer, J. Kalita, Code clone detection
[6] A. Midolo, E. Tramontana, Refactoring java loops using coarse and fine-grained hybrid approaches,
to streams automatically, in: Proceedings of 4th in: Proceedings of 7th IEEE International
ConferInternational Conference on Computer Science and ence on Intelligent Computing and Information
Software Engineering (CSSE), 2021, p. 135–139. Systems (ICICIS), 2015, pp. 472–480. doi:10.1109/
doi:10.1145/3494885.3494910. IntelCIS.2015.7397263.
[7] A. Midolo, E. Tramontana, An api for analysin[1g7] Q. U. Ain, W. H. Butt, M. W. Anwar, F. Azam,
and classifying data dependence in view of paral- B. Maqbool, A systematic review on code clone
lelism, in: Proceedings of 10th International Con- detection, IEEE Access 7 (2019) 86121–86144.
ference on Computer and Communications Man- doi:10.1109/ACCESS.2019.2918202.
agement (ICCCM), 2022, p. 61–67. doi1: 0.1145/ [18] N. Smith, D. Van Bruggen, F. Tomassetti, Javaparser:
3556223.3556232. visited, Leanpub, oct. de (2017).
[8] A. Midolo, E. Tramontana, Automatic generati[o1n9] V. I. Levenshtein, et al., Binary codes capable of
of accurate test templates based on junit asserts, correcting deletions, insertions, and reversals, in:
in: Proceedings of the 7th International Conference Soviet physics doklady, volume 10, Soviet Union,
on Algorithms, Computing and Systems (ICACS), 1966, pp. 707–710.</p>
      <p>2023. [20] A. Niewiarowski, et al., Short text similarity
algo[9] M. Shalaby, T. Mehrez, A. El Mougy, K. Abdulnasser, rithm based on the edit distance and thesaurus,
CzaA. Al-Safty, Automatic algorithm recognition of sopismo Techniczne 2016 (2016) 159–173. doi1:0.
source-code using machine learning, in: Proceed- 4467/2353737XCT.16.149.5760.
ings of 16th IEEE International Conference on M[2a1-] C. Ragkhitwetsagul, J. Krinke, D. Clark, A
comchine Learning and Applications (ICMLA), 2017, pp. parison of code similarity analysers, Empirical
170–177. doi:10.1109/ICMLA.2017.00033. Software Engineering 23 (2018) 2464–2519. doi1:0.
[10] W. Crichton, G. G. Sampaio, P. Hanrahan, Au- 1007/s10664-017-9564-7.</p>
      <p>tomating program structure classification, in:
Proceedings of 52nd ACM Technical Symposium on
Computer Science Education (SIGCSE), 2021, p.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>