-

SATToSE

A Robust and Automatic Approach for Matching Algorithms

AlessandroMidolo

EmilianoTramontan

Workshop ProceedingsC(EUR-WS.org)

0 Dipartimento di Matematica e Informatica, University of Catania , Catania , Italy

2023

15 12 14

Modern software applications can have millions of lines of code. Moreover, while several developers collaborate to develop code, each of them could have their own coding style, e.g. follow a personal code convention, prefer some data structures, etc. Having a large amount of code and several coding styles can negatively afect the readability of the code, especially when this has not been documented properly. Furthermore, a somewhat incorrect version of an algorithm would lead, besides to the presence of bugs, an involute, complex and ineficient code.

Static analysis coding suggestions code reuse refactoring

ifers to identify the algorithm11[, 12, 13]; their classifier

1. Introduction

same project.

The bigger the repositories the greater the demandiisnbgased on a set of structural and truth value characefort for developers when trying to understand codteeristics that are strictly related on sorting algorithms, implemented by their colleagues. Visually inspectitnhgerefore new beacons should be defined for diferent source code to understand its structure and the functciaotne-gories of algorithms. Furthermore, these approaches alities could be time consuming and error-prone, sinrceequire a new training of the dataset if new labels are indevelopers could misidentify some algorithms. Generalcllyu,ded in the classification, which can be time consuming this occurs when the source code documentation is poaonrd, complex. or missing, and when many developers contribute to theThis paper focuses on algorithm recognition and presents an innovative approach that automatically matches algotools that support the works of develop6e,r7,s8[].

Automatic Program Comprehension (PC) tries to solrviethms by inspecting the source code. Our approach uses these issues by proposing several approaches that austtoa-tic analysis to collect data from the source code and matically assist developers to understand source 1c,odeco[mpute a similarity score with templates of known algo2]. Several applications of the PC have been presentreidt:hms to identify the correct one. The use of templates source code classification according to specific categor3ie,gsu[arantees that: new algorithms can be easily added for 4, 5], code clone detection, and algorithm recognittiohne.recognition step; multiple versions of the same algoAutomatic approaches can help in analysing, comprriet-hm can be used to improve the accuracy of the idenhending and improving the source code, by means otfification; many, if not all, categories of algorithms can be recognised (sorting, searching, traversing, etc.). The proaches have been presented9,[10], where diferent

The state of the art shows several techniques toparuo-posed approach consists in four main phases: firstly, tomatically identify algorithms. Machine learninga acopd-e parsing tool collects all the statements of the algorithm analysed; secondly, a statement transformation is classifiers are used to label code fragments. Other prpoe-rformed to extract the data required for the similarity posals use a hybrid approach, mixing static analysismtaotch; thirdly, the Levenshtein distance is computed to nEvelop-O

Levenshtein distance has been widely used in program analysis, especially for code clone detection when evaluating the similarity between code fragme1n4t,s1[5, 16].

However, such approaches are strictly related to deAtelgcto-rithm 1 The algorithm of the proposed approach ing clone fragments. Such approaches focus on detectp-rocedure MatchingAlgorithm(SC) ing type-3 clones that are portions of code that difer ← ℎ() in terms of whitespaces, comments, layouts and iden- for cu, compilationUnitdso tifiers, and can have some modifications like addition ℎ ← ℎ() or removal of statement1s7][. Conversely, we propose end for another approach in which diferent statements of the for mDecl, methodsdo algorithms under analysis are reflected on the matching ← . () score, hence having a varying degree of matching; more- for tmp, templatedso over, the above said approaches refer to types and names ← . () (e.g. methods and fields names) to detect clones, whereas ← (, ) our approach focuses on the statements freed from the .(, ) developer chosen names, providing a more generalised end for identification. end for

The rest of the paper is organised as follows. Sec2tion end procedure describes the proposed approach with all the steps followed by the analysis. Secti3onshows three examples of the use of the approach and how the similarity scorielaisrity score. Javaparser is an automatic parser that gencomputed on real scenarios. Secti4odnisplays the met- erates an abstract syntax tree (AST) from source code and rics obtained by the analysis of the examples previousplryovides a set of APIs to perform operations on18i]t. [ shown and a comparison with a text similarity approacThh.e root of the AST is thCeompilationUnit (representSection6 reports meaningful related works and compariensg a Java file) to which all code elements are connected, them with our approach. Finally, conclusions are draew.gn.package declaration, class and methods declarations, in Section7. etc. Code inspection has been performed by using the

VoidVisitorAdapter class, which lets us define aVisitor 2. Proposed Approach class to search for a specific property. In tVhiseitor class, the methodvisit() was implemented, which takes as paWe propose an approach that gathers data by parsrianmgeters the type of object being searched (e.g. method the source code and then evaluates the similarity scdoercelaration, statement, field), and the container in which of an algorithm and a set of known algorithms. Tdhaeta are stored; the body of the method contains all the proposed approach makes a proper generalisation of tinhsetructions that are executed for every object found of algorithms to avoid depending on naming conventiotnhse type specified as a parameter. or on statements that are not contributing to the mWaienhave defined a Visitor which looks foMrDs (method goal of the analysed algorithm. We make use of the stadteicclarations); onceMaD is found, thevisit() method exanalysis of the source code, hence executable files arteracts from its body all the statements, and stores them not needed for the analysis. Algorit1hsmhows the main in aList preserving the order. The list of statements prosteps, as psuedo-instructions, followed by the proposveidded by Javaparser is further expanded, as it would othapproach to match algorithms. The procedure takeesrawsise miss: (i) nested statements (e.g. all the statements input the source code, variabSlCe, and parses it extractp-resent in the body of faor loop, defined as ForStmt), ing the compilation units. Then, method declaratiaonnds (ii) expressions, such as assignments, method calls, are extracted and, for each, all the statements typevsaarriaeble declarations, etc. (definedEaxspressionStmt). collected and compared to the algorithm templates to compute the similarity score. The approach can be str2u.c2-. Statements Transformation tured in four steps: (i) parsing the source code by means of a tailoreVdisitor to gather all the data needed for tThhee statements initially gathered by Javaparser are transfollowing analysis; (ii) selecting and transformingftohrmeed to better serve the following analysis. Javaparser most relevant statements; (iii) computing an adapted Lpervo-vides a functiong,etStatements(), to get the statements enshtein distance to determine the similarity score;c(oivn)tained in the body of a method declaration; nested evaluating the matching degree of the algorithms. statements, e.g. statements contained in a for loop, are omitted be the saidgetStatements(), and just the parent statement is inserted in the list. To collect all the 2.1. Code Parsing statements inside the method, we further extract nested We perform code parsing, by means of the Javaparsesrtatements and we place them in the list of statements library, to extract all the data required to evaluate trhieghsitma-fter their parent, preserving the order of the block.

Whereupon, for each statement in the list, we extract the class type, avoiding collecting other parts such as name}s, types, comments and expressions, to better generali}se the approach, so as to recognise diferent versions of the same algorithm; e.g. statemefnotr(int i=0; i<size; i++) is Statements Type: represented in Javaparser byFoarStmt type, whereas all ForStmt, the other parts, such as the variable declariantt iio=n0)(, ForStmt, the binary expressioni<(size) and the unary expressionIfStmt, (i++) are omitted. For all the statements that canVcaornia-bleDeclarationExpr, tain nested statements in their body, feo.rg,.if, while, AssignExpr, a custom statement is inserted at the end of the nesAtsesdignExpr statements; its name is given by the concatenation of EtnhdeIf ”End” prefix with the type of the statement containinEngdFor the nested ones, e.gE. ndIf, EndFor, EndWhile. This cus- EndFor tom statement allows us to identify which statements are nested, thus improving the precision of the approach

Listing 1: The upper part shows an iterative version of the when comparing algorithms. bubblesort algorithm implemented in Java, whereas the The extracted types are defined by the Javaparser li

bottom part displays the list of statements extracted by btrhaercy1o.lTuambnSltea1tesmhoenwtsTsyopmeereepxaremspelnetssotfhsetatytpemeednefintetdybpyes:our approach, using the class types defined in Javaparser. the library, and the coluCmonde shows an example of the code associated with the type; further types can be found in the documentation. ThEexpressionStmt type does not 2.3. Adapting the Levenshtein Distance have a code example because it can represent any tyTphee Levenshtein distance is a string metric for measuring of expression: AssignExpr (a = b +c;), MethodCallExpr the diference between two sequences1[9]. It is defined (method();), VariableDeclarationExpr (int a = 0;) etc. This as the minimum number of operations (replace, insert class type is too generic, hence we select and insertanind delete) required to change a sequence into the other. the statement list the type of the expression contaAinsetdring can be seen as a list of single characters; the Levin the statement. If there are nested expressions, e.ge.nashtein algorithm iteratively compares all the characters method call with argumentOabnjectCreationExpr (e.g. and finds the minimum number of operations (insertion, method(a, new object())), we select the parent expressiond,eletion or substitution) required to make the two sein this example thMeethodCallExpr. quences equal. We have implemented a custom version

Listing1 shows an example of data extracted fromoaf the algorithm where two lists of statements are the method: the code on top shows the methboudbbleSort(); compared sequences, and every character represents a the bottom part displays the list of statements extrasicntgelde statement. Once defined the number of minimum by our visitor. The statement list extracted by Javapairnssetrructions, the similarity score is compute2d0a]:s [ 1https://www.javadoc.io/doc/com.github.javaparser/javaparsercore/latest/index.html ( where 1 and 1 are the two sequences of statementvs,oid iterativeBubbleSortTemplate(int list[], levDist() gives as output the Levenshtein distance, and↪ int length) { size() gives the number of elements in a sequence. int length = list.length; for(int i=0; i < length; i++) { for(int j=1; j < length-i; j++) { 2.4. Algorithm Recognition if(list[j-1] > list[j]){ int swap = list[j-1]; list[j-1] = list[j]; list[j] = swap; Once all the data needed for the analysis have been obtained, we can compute the similarity scores according to the extracted list of statements, i.e. the analysed method is compared with a set of known algorithms that have } been gathered and parsed beforehand. We have created } a set of Java files containing the source code of several} known algorithms, and for each one at least two versi}ons are stored: iterative and recursive. For some algorithms, more versions have been implemented as the aim is toStatements Type: improve the accuracy of our analysis. E.g., the bubblesoVratriableDelcarationExpr, algorithm has two diferent versions, besides the iteratFivoerStmt, and recursive versions: the one shown in List1i,nagnd ForStmt, an optimised version where a boolean flag breaks thIefStmt, execution if no elements are swapped in the inner looVpa.riableDeclarationExpr,

Listing2 shows one of the templates of the iteratiAvsesignExpr, version for the bubblesort algorithm used in our anAaslsyi- gnExpr sis: the code on top displays the implementation of tEhnedIf algorithm, while the list below represents the statemEenndtFsor extracted by our approach. We show this templateEbned-For cause, according to our approach, it is the most similar to the code shown in Listin1.gHowever, the two meth-Listing 2: The upper part shows one iterative version ods have several textual diferences: firstly, the name ofof thebubblesort algorithm stored in our template db, some variables is diferent, e.g. the variable representinwghereas the bottom part displays the list of statements number of elementss,ize andlength, the array containingextracted by our approach, according to the types defined the elementss,ort_arr andlist, and the variable used forin Javaparser. the swap,tmp andswap; secondly, the condition in the IfStmt is inverted; finally, the template has an additional statement compared to the example, the first statementListing3 shows a method implementing the factorial VariableDeclarationExpr. algorithm for an integer value in a recursive form. The

Despite these diferences, our approach correctly idelni-st of statements is as followIfSst:mt, ReturnStmt, Retifies the algorithm implemented. According to the Lev

turnStmt. The analysis carried out by our approach has enshtein distance, the number of operations needed itdoentified the method as the recursive version of the facmatch the two sequences is one (an insertion becautsoerial algorithm with a similari1t.y0.oIfn such a case, the list of statements difer in only one element). Indeetdhe similarity is the maximum possible value since given the similarity score between these two sequences is cotmh-e simple structure of the algorithm, the types of stateputed as = 1 − (1/7) = 0.857 , where the the ments used by such a method match 100% the factorial size of the longest sequence 7is. algorithm template.

2https://github.com/AleMidolo/MatchingAlgorithms

Moreover, two versions of the quicksort algorithm are

public static int factorial(int n) { if (n == 0 || n == 1)

return 1;

Listing 3: An example of the recursive factorial algorithm implemented in Java. 3. Evaluation

We tested our approach on four diferent algorithms, each implemented as a method. All the templates used by ourreturn n * factorial(n - 1); approach can be found on a public reposit2o.rTyhe first } example is shown in Listin1gpreviously discussed; here, we discuss three other algorithms: a versiofnacotofrial and two versions oqfuicksort. considered to test the approach on diferent versionspoufblic void quickSortStack(short[] array) { the same algorithm; both versions propose an iterativ/e/ create a stack for storing solution. // subarray start and end index

Firstly, Listin4gshows an iterative version of quick- Stack<Pair> st = new Stack<>(); sort algorithm that uses a stack as a support to sort sthhoert finish = array.length -1; elements contained in the array passed as argument. The// push the start and end index method consists in twelve statements, in order:Vtawri-o // of the array into the stack ableDeclaration, MethodCall, WhileStmt, VariableDecla- st.push(new Pair(0, finish)); ration, AssignExpr, MethodCall, VariableDeclaration, and // loop till stack is empty twoIfStmt with aMethodCall in their body. Therefore, while (!s.empty()) { the method is characterised by such statements, indeed // remove top pair from the list and get comments, names and types will be ignored for the pur- // subarray starting and ending indices pose of the identification. Our template storage includes short begin = st.peek().getX(); an implementation of the quicksort using a stack to sort finish = st.peek().getY(); the element3s; the diferences between the template and st.pop(); this method are: the template has an additiVoanria-l // rearrange elements across pivot ableDeclarationExpression before the firstpush() call; in short pv = partition(array, begin); the example, the first instruction after tWhehileStmt is // push subarray indices with elements a VariableDeclarationExpr, while in the template it is an // less than the current pivot to stack AssignExpr; thepartition() method call takes one more if (pv - 1 > begin) { argument in the template; types and names of the vari- st.push(new Pair(begin, pv - 1)); ables are diferent. Considering the said diferences, our } analysis has computed a similarity scor0e.8o5f, correctly // push subarray indices with elements identifying the algorithm. // more than the current pivot to stack

Secondly, Listing5 displays an iterative version of if (pv + 1 < begin) { the quicksort algorithm that uses a supporting array to st.push(new Pair(pv + 1, begin)); sort the elements of the array passed as argument. The } method consists in fiteen statements, in order: three } VariableDeclaration, twoAssignExpr, WhileStmt, three } AssignExpr, and twoIfStmt with twoAssignExpr in their body. There are two main diferences in the structurLeisting 4: An iterative version of the quicksort algorithm between the two versions of the same algorithm: the toustianlg a stack of objects to sort the elements. number of statements, twelve against fiteen, and the absence ofMethodCall statements in the second version.

An iterative version of the quicksort algorithm is st4or.edResults in our template storage, and it uses an array to sort the elements like the method given as an example. The maiWne have compared our approach to a text-based search diferences between the template and the method are thaepproach between methods and the templates of the alfollowing: the method has thVraeeriableDeclarationExpr gorithms. Table2 shows the metrics obtained for the before theWhileStmt, whereas the template has only twfoo;ur methods previously described: colummnethod distypes and names of variables are diferent. The analysipslays the method considered for the analysis, respectively has computed a similarity score0o.8f6. bubblesortV1 (listin1g), factorialV1 (listi3n)g,quicksort

Our approach correctly identified both versions bSet-ack (listin4g) and quicksortArray (listi5n).gThe other cause the analysis uses templates for diferent versio nfivse main columns are the templates used by our approach of the same algorithm, making the recognition more taoc-match the algorithmItBs:ubblesort is the bubblesort curate. Still, the storage containing all templates citaenrabteive version (the one displayed in Lis2t)i;nRgecFacupdated with more versions of algorithms to make tthoreial is the factorial recursive versiIotQnu; icksortST is approach more sensitive to diferences and up-to-date.the quicksort iterative version using a stack to sort the elements;ItQuicksortAr is the quicksort iterative version using an array to sort the elemenIttMse;rgesort is the mergesort iterative version. We have also considered the mergesort to show how the analysis is able to distinguish diferent algorithms. Each of these columns have two 3The templates in the db can be found in the github repositosruybcolumns: sim andtext are respectively the similarity given above. score of our approach and the similarity score of the text

5. Discussion public static void quickSortArray(

long arr[], long low, long high) { // Create an auxiliary stack long[] list = new long[high - low + 1]; long max = -1; long pivot = 0; // push initial values to stack list[++max] = low; list[++max] = high; // Keep popping from stack while not empty while (max >= 0) { h = list[max--]; l = list[max--]; // Set pivot element at its correct // position in sorted array pivot = partition(arr, low, high); // If there are elements on left side // of pivot, push left side to stack if (pivot - 1 > low) { list[++max] = low; list[++max] = pivot - 1;

For all the five methods shown in Tab2le,our approach shows a higher identification score compared to the text similarity approach. Indeed, we have a higher similarity score that is more than double forqutihckeSortStack (0.84 compared to 0.37) andquickSortArray (0.86 and 0.34) methods, and values abou4t0% greater fobrubblesortV1 (0.85 and 0.52) andfactorialV1 (1.0 and 0.7) methods. Our approach performs better because it can generalise the matching, without considering names of variables, comments and names of types.

Moreover, the matching scores given by our analysis are clearly greater than the score of other algorithms, whereas, with a text similarity approach, we can see that the quicksortArray method has a similarity score of 0.34 forItQuickSortAr, 0.33 forItMergesort and 0.28 for

ItQuicksortSt, hence the closeness of such scores can bring } ambiguity in the correct identification of the algorithm. // If there are elements on right side Finally, our approach can distinctly recognise two dif// of pivot, push right side to stack ferent versions of the same algorithm: qtuhiceksortStack if (pivot + 1 < high) { has 0.84 score asItQuicksortSt and0.42 as ItQuickSortAr, list[++max] = pivot + 1; while thequicksortArray method has respectivel0y.46 list[++max] = high; and0.86. The accurate identification of the version used } is crucial when suggesting improvements or proposing } diferent versions. } The ability to recognise an algorithm is related to the set of templates, which is not easy to maintain and grow.

An algorithm can be matched if a similar template of Listing 5: An iterative version of the quicksort algoritthhemsame algorithm is already part of the templates. To using an array to handle the sort of the elements. handle this, the database can be populated with the most popular algorithms, and, if an algorithm is not included, it could be added by a developer using our approach and comparison approach. tool.

We have highlighted in bold text the highest score coTrh-e approach extracts the statement’s type to evaluate responding to the correct identification of the algoritthhemsimilarity between algorithms. On the one hand, two in subcolumnsim; whereas in subcolumntext the maxi- algorithms can have similar statements despite having a mum score has been highlighted for the text match. Wdieferent behaviour, thus showing low accuracy in identican see that the score assigned by our approach is mufcyhing code’s behaviour. On the other hand, the approach higher in each case, indicating a higher precision in ctahne give a degree of generalisation, since it is not dependent on names and types encountered, therefore it is ntohte authors propose a cross-language clone detector for based on the comprehension of how the algorithm wCa,sC++ and Java, the input code is tokenized to obtain implemented, but on its structure. This property is the keywords of the corresponding language, then these main diference between our approach and type-3 clonekeywords are compared with the Levenshtein distance approaches. and finally the clone types are classified based on similarity of keywords match. In15[], the authors present a tool to detect clones of a faulty code fragment, a Normalised 6. Related Works Compression Distance is defined to detect duplicate code fragments.

Algorithm recognition has been tackled by several ap

The above said approaches use the Levenshtein disproaches in the state of the art, both for the softwtaarnece to detect code clone fragments, whereas our proengineering industry and for academic settings9.]I,n [posal defines a tailored distance to match methods with the authors present a solution for automatic algorithm

algorithm templates in order to achieve algorithm recogrecognition using machine-learning techniques; theyneixt-ion. Furthermore, these tools are sensitive to difertract a dataset of source code containing algorithms,etnhceensin term of statements between code fragments, bea feature extraction is carried out to collect all the cchauasreatch-e presence or absence of multiple statements can teristic data (e.g. count-vars, operators, constructsleetacd.)t, o a misidentification of a code similarity; whereas ifnally a tag updating is done to remove redundant tags.

our approach proposes a similarity score which, despite They have trained the dataset and built a group of clas

the statements diferences, is able to define a matching sifiers to identify the algorithm. In order to add a new

grade for all templates, where the highest one is the most algorithm or category of algorithms to the classification,

similar to the implemented algorithm. all the previous steps have to be executed again in order to properly train the new dataset, which it could be time consuming; conversely, since our approach uses stat7ic. Conclusion analysis to match templates, a new template can be added to the collection without the need for further operaTthioisnpsa.per presented an automatic approach to recognise

In [13], the authors propose an algorithm recognitaiolgnorithms using static analysis. By parsing the source method that detects sorting algorithmic schemas; tchoedse it is possible to identify all the statements composschemas consist in a set of loops, features, operatioings a method, transform them according to a specified etc. Another approach discussing sorting algoritfhomrsmat, then compute Levenshtein distance for obtainis presented in 1[2, 11], where numerical (number of ing a similarity score between the method and several blocks, number of loops etc.) and descriptive (iteratitveem, plates of known algorithms. The template having the recursive) characteristics are extracted from the sohuirgcheest score is suggested as the algorithm matching the code and a C4.5 decision tree classifier is builded to detecatnalysed method. We performed an experiment on four sorting algorithms. These approaches focus on sortminegthods to test our approach; the results obtained highalgorithms under some assumptions such as algorithmlisght a high accuracy when recognising the algorithm are expected to be implemented in a well-establishceodmpared to a textual similarity. way, e.g. quicksort algorithm should be implemented The versatility of the approach allows us to add more in a recursive way since it is more common. Moreovert,emplates to widen the spectrum of recognisable algothey define a set of characteristics that are mostly relarittehdms and to increase the number of diferent versions of to sorting algorithms. In contrast, our approachalcgaonrithms. This approach can be employed both for proidentify a wider spectrum of algorithms since the stagtriacm comprehension purposes on software development, analysis can be performed to any Java source code, ansudpporting developers in understanding and implementwe consider diferent versions of the same algorithm tiong source code, by proposing alternative versions of the increase the identification ability. same algorithm, and for academic purposes to automati

Many tools have been presented to measure sourceally assess students’ assignments. code similarity. Most of these approaches address problems such as code clone detection, software licensing violation and software plagiarism21[]. The Levenshtein Acknowledgments distance is often used for clone detection. I1n6][, a hybrid technique is presented where source code is lexicalTlhye authors acknowledge project TEAMS funded by University of Catania PIACERI 2020/22. analysed to detect and extract sub-blocks, then similar blocks are grouped and hashed, finally tLheevenshtein similarity and theCosine Similarity are used to compute similarity between blocks and find Type-3 clones. I1n4[], 1177–1183. doi:10.1145/3408877.3432358.

[11] A. Taherkhani, L. Malmi, A. Korhonen, Algorithm [1] J. Siegmund, Program comprehension: Past, recognition by static analysis and its application present, and future, in: Proceedings of 23rd IEEE In- in students’ submissions assessment, in: Proceedternational Conference on Software Analysis, Evo- ings of 8th ACM International Conference on Comlution, and Reengineering (SANER), volume 5, 2016, puting Education Research (ICER), 2008, p. 88–91. pp. 13–20. doi:10.1109/SANER.2016.35. doi:10.1145/1595356.1595372. [2] W. Maalej, R. Tiarks, T. Roehm, R. Koschke, On the[12] A. Taherkhani, Recognizing sorting algorithms comprehension of program comprehension, ACM with the c4.5 decision tree classifier, in: ProTransactions on Software Engineering Methodol- ceedings of 18th IEEE International Conference on ogy 23 (2014). doi:10.1145/2622669. Program Comprehension (ICPC), 2010, pp. 72–75. [3] S. Ugurel, R. Krovetz, C. L. Giles, What’s the code? doi:10.1109/ICPC.2010.11.

automatic classification of source code archive[s1,3] A. Taherkhani, L. Malmi, Beacon-and schema-based in: Proceedings of 8th ACM SIGKDD International method for recognizing algorithms from students’ Conference on Knowledge Discovery and Data Min- source code, Journal of Educational Data Mining 5 ing (KDD), 2002, p. 632–638. doi1:0.1145/775047. (2013) 69–101. doi:10.5281/zenodo.3554635. 775141. [14] S. B. Ankali, L. Parthiban, Development of port[4] K. Tian, M. Revelle, D. Poshyvanyk, Using latent ing analyzer to search cross-language code clones dirichlet allocation for automatic categorization oufsing levenshtein distance, in: Proocedings of software, in: Proceedings of 6th IEEE International Fourth International Conference on Smart ComWorking Conference on Mining Software Reposi- puting and Informatics (SCI), Springer, 2021, pp. tories (MSR), 2009, pp. 163–166. do1i:0.1109/MSR. 623–632. doi:10.1007/978-981-16-0878-0_60. 2009.5069496. [15] T. Ishio, N. Maeda, K. Shibuya, K. Inoue, Cloned [5] S. Srikant, V. Aggarwal, Automatic grading of com- buggy code detection in practice using normalputer programs: A machine learning approach, in: ized compression distance, in: Proceedings of Proceedings of 12th International Conference on IEEE International Conference on Software MainMachine Learning and Applications (ICMLA), vol- tenance and Evolution (ICSME), 2018, pp. 591–594. ume 1, 2013, pp. 85–92. doi:10.1109/ICMLA.2013. doi:10.1109/ICSME.2018.00022.

22. [16] A. Sheneamer, J. Kalita, Code clone detection [6] A. Midolo, E. Tramontana, Refactoring java loops using coarse and fine-grained hybrid approaches, to streams automatically, in: Proceedings of 4th in: Proceedings of 7th IEEE International ConferInternational Conference on Computer Science and ence on Intelligent Computing and Information Software Engineering (CSSE), 2021, p. 135–139. Systems (ICICIS), 2015, pp. 472–480. doi:10.1109/ doi:10.1145/3494885.3494910. IntelCIS.2015.7397263. [7] A. Midolo, E. Tramontana, An api for analysin[1g7] Q. U. Ain, W. H. Butt, M. W. Anwar, F. Azam, and classifying data dependence in view of paral- B. Maqbool, A systematic review on code clone lelism, in: Proceedings of 10th International Con- detection, IEEE Access 7 (2019) 86121–86144. ference on Computer and Communications Man- doi:10.1109/ACCESS.2019.2918202. agement (ICCCM), 2022, p. 61–67. doi1: 0.1145/ [18] N. Smith, D. Van Bruggen, F. Tomassetti, Javaparser: 3556223.3556232. visited, Leanpub, oct. de (2017). [8] A. Midolo, E. Tramontana, Automatic generati[o1n9] V. I. Levenshtein, et al., Binary codes capable of of accurate test templates based on junit asserts, correcting deletions, insertions, and reversals, in: in: Proceedings of the 7th International Conference Soviet physics doklady, volume 10, Soviet Union, on Algorithms, Computing and Systems (ICACS), 1966, pp. 707–710.

2023. [20] A. Niewiarowski, et al., Short text similarity algo[9] M. Shalaby, T. Mehrez, A. El Mougy, K. Abdulnasser, rithm based on the edit distance and thesaurus, CzaA. Al-Safty, Automatic algorithm recognition of sopismo Techniczne 2016 (2016) 159–173. doi1:0. source-code using machine learning, in: Proceed- 4467/2353737XCT.16.149.5760. ings of 16th IEEE International Conference on M[2a1-] C. Ragkhitwetsagul, J. Krinke, D. Clark, A comchine Learning and Applications (ICMLA), 2017, pp. parison of code similarity analysers, Empirical 170–177. doi:10.1109/ICMLA.2017.00033. Software Engineering 23 (2018) 2464–2519. doi1:0. [10] W. Crichton, G. G. Sampaio, P. Hanrahan, Au- 1007/s10664-017-9564-7.

tomating program structure classification, in: Proceedings of 52nd ACM Technical Symposium on Computer Science Education (SIGCSE), 2021, p.