12 Detecting Source Code Similarity Using Compression IVAN PRIBELA, GORDANA RAKIĆ and ZORAN BUDIMAC, University of Novi Sad Different forms of plagiarism make fair assessment of student assignments more difficult. Source code plagiarisms pose a signif- icant challenge especially for automated assessment systems aimed for students’ programming solutions. Different automated assessment systems employ different text or source code similarity detection tools, and all of these tools have their advantages and disadvantages. In this paper we revitalize the idea of similarity detection based on string complexity and compression. We slightly adapt an existing, third party, approach, implement it and evaluate its potential on synthetically generated cases and on a small set of real student solutions. On synthetic cases we showed that average deviation (in absolute values) from the expected similarity is less than 1% (0.94%). On the real life examples of student programming solutions we compare our results with those of two established tools. The average difference is around 18.1% and 11.6%, while the average difference between those two tools is 10.8%. However, the results of all three tools follow the same trend. Finally, a deviation in some extent is expected as observed tools apply different approaches that are sensitive to other factors of similarities. Gained results additionally demonstrate open challenges in the field. 1. INTRODUCTION Automatic assessment systems encounter significant problems when assessing students’ solutions to programming assignments. Main difficulties arise from lack of academic discipline among the students. While the students are expected to submit their original solutions, sometimes this is not the case. The assignments that they are expected to solve are usually very similar, and there is a restricted set of varieties among the correct solutions. Therefore, plagiarism detection is always an open question in the assessment of students’ work. Multiple tools and approaches are available for plagiarism detection. They are based on source code similarity detection algorithms or code clone detection tools. Available approaches rely on lexical, syn- tactic, structural, or semantic information about the programming solution. Thus, algorithms analyze character streams, token streams, controls structures or dependencies in the observed code. Usually, they are applied to the source code or some of its intermediate representations in the form of an inter- mediate language, a tree or a graph. All of these similarity analysis approaches have their advantages and disadvantages. Their sensitivity to different categories of similarities and precision varies and therefore their applicability and usefulness depend on objectives of the analysis [Novak et al. 2019]. Although high similarity between two projects does not mean that one is plagiarized from the other, it is often a good indicator. The students that are submitting work of other students usually make This work was partially supported by the Ministry of Education, Science, and Technological development, Republic of Serbia, through project no. OI 174023: ”Intelligent techniques and their integration into wide-spectrum decision support”. Author’s address: Ivan Pribela, Gordana Rakić, Zoran Budimac, University of Novi Sad, Faculty of Sciences, Department of Mathematics and Informatics, Trg Dositeja Obradovića 4, 21000 Novi Sad, Serbia; email: ivan.pribela@dmi.uns.ac.rs gordana.rakic@dmi.uns.ac.rs gordana.rakic@dmi.uns.ac.rs zoran.budimac@dmi.uns.ac.rs Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Inter- national (CC BY 4.0). In: Z. Budimac and B. Koteska (eds.): Proceedings of the SQAMIA 2019: 8th Workshop on Software Quality, Analysis, Mon- itoring, Improvement, and Applications, Ohrid, North Macedonia, 22–25. September 2019. Also published online by CEUR Workshop Proceedings (http://ceur-ws.org, ISSN 1613-0073) 12:2 • Ivan Pribela et al. only small superficial adjustments. They often do not change the structure of the code but only rename variables and functions or reorder statements that they know will not change the program correctness. In this paper we describe an approach to similarity analysis based on string complexity and com- pression. We have chosen this method because the initial study presented in [Chen et al. 2004] showed promising results and a fresh perspective on settled practices, but the system is not available anymore for use or extension. Other approaches to measuring similarity and detecting plagiarism were concentrated around fea- ture comparison and program structure matching. Early attempts are usually based on computing different software metrics and comparing the results. More recent tools usually use a form of greedy string tiling to compare two streams of tokens instead a summary indicator. Both types of approaches have their strengths and weaknesses with attribute counting systems performing better in the case of very small changes, and structure comparison systems exhibiting better performance when only a part of the code is copied. In section 2, we describe the background and motivation for our work, that lays in the previous third-party solution [Chen et al. 2004] that introduced this approach and showed perspective but is not available anymore. Description of our approach and our contribution to the previous results can be found in the section 3. Results of our contribution, demonstrated on both artificial cases and real- life student solutions, are described in the section 4, followed by conclusions and the future work in section 5. 2. BACKGROUND AND RELATED WORK Kolmogorov complexity K(s) of a string s [Zvonkin and Levin 1970] is the length of the shortest com- puter program that with no given input produces the string s as the output. It can be considered as a measure of the amount of information contained in the string s. The conditional Kolmogorov complexity K(s|t) of two strings s and t [Zvonkin and Levin 1970] is the length of the shortest computer program that produces the string s as the output with the string t given as an input. It can be viewed as a measure of the amount of information contained in the string s that is not present in the string t. The chain Kolmogorov complexity K(s, t) of two strings s and t [Zvonkin and Levin 1970] is the length of the shortest computer program that produces the string s followed by the string t with the information on how to separate them. The chain rule for Kolmogorov complexity [Li et al. 2008] states that K(s, t) ≤ K(s) + K(t|s) + O(log(K(s, t))) or K(s, t) ≈ K(s) + K(t|s). In other words, the length of the shortest computer program that produces the string s followed by the string t is approximately the same as the length of the shortest computer program that produces the string s plus the length of the shortest computer program that produces the string t having the string s as input. The algorithmic mutual information for Kolmogorov complexity IK (s, t) is defined as IK (s, t) = K(s) − K(s|t) and represents the amount of information about the string s that is contained in the string t. This quantity is symmetric up to a logarithmic factor as shown in [Gács 1974] which gives us IK (s, t) ≈ IK (t, s). Having all this in mind we get K(s, t) ≈ K(s) + K(t|s) ≈ K(t) + K(s|t) ≈ K(t, s). If we take the absolute amount of mutual information in the strings s and t and divide it by total amount of information in both strings as in formula 1 we get a normalized measure of similarity between the strings s and t. IK (s, t) sim(s, t) = (1) K(s, t) Detecting Source Code Similarity Using Compression • 12:3 If the strings s and t are identical, this will be equal to one, if the two string are completely unrelated, the value will be zero. However sim function is not a properly defined metric function. On the other hand, the function dsim (s, t) = 1 − sim(s, t) satisfies all the necessary conditions for a metric up to a logarithmic error as shown in [Li et al. 2001]. Unfortunately, Kolmogorov complexity is not computable. It is not possible to compute even the lower bound [Chaitin 1974]. However, a reasonable upper bound can be expressed as the length of the compressed representation of the string as produced by a reasonable compression algorithm. This is the approach taken in a tool called SID [Chen et al. 2004] that employs algorithmic complexity to check for similarities in source code. As approximation for this value, a special compression algorithm called TokenCompress is used. However, this tool is no longer available for usage, which is one of the motivating factors behind our work. 3. DESCRIPTION OF THE APPROACH Our tool is also based on measuring Kolmogorov complexity, or more accurately approximating it using compression similarly to the work done in [Chen et al. 2004]. A compressed version of the string can be viewed as a program written in a language executed by the corresponding decompressor. Interpreted by the decompressor, the output of such program is the original string. Also such program tends to be the shortest as the main goal of a compression algorithm is to produce the shortest possible sequence that decompresses into the original input. As the original compression algorithm described in [Chen et al. 2004] is not available. We have cho- sen GNU gzip library to approximate K(s). This library uses Lempel-Ziv coding (LZ77) [Ziv and Lempel 1977] and Huffman coding [Huffman 1952], and is considered the industry standard for general com- pression with a good balance of time and compression ratio. We define |s| as the length in bytes of the compressed form of the string s when compressed using the mentioned library. As stated earlier, the value |s| is a reasonable approximation for K(s). Similarly, K(s, t) is approximated by |st| and K(s|t) is approximated using the formula K(s|t) ≈ K(t, s) − K(t) based on the chain rule for Kolmogorov complexity. Having this in mind, the final similarity is calculated using the formula 2. |s| + |t| − |ts| sim(s, t) = (2) |st| In order to calculate the percentage of the common information we have to further interpret this similarity metric, as was done in [Chen et al. 2004]. If we assume that the two strings s and t are p approximately the same length as is common among student solutions, then sim(s, t) = 2−p where p is the percentage of the information common in both strings [Chen et al. 2002]. This gives us the formula 3 to calculate p. 2sim(s, t) p= (3) 1 + sim(s, t) However, when students try to hide similarities, one of the methods they employ is adding redundant or unused code thus increasing the total length of the copy. In such cases, when compared with each other, the copy and the original can have considerable length difference. To combat this we have proposed two new measures in addition to the one already described in [Chen et al. 2004]. These measures represent the percentage of string s that is present in t and vice versa, and are given with formulas 4 and 5. 12:4 • Ivan Pribela et al. |s| + |t| − |ts| sims (s, t) = (4) |s| |s| + |t| − |ts| simt (s, t) = (5) |t| These two measures are not always symmetrical. For example, if we have a string s representing the source code form one student solution, and string t generated from s by adding two times more unused code, then sim(s, t) will not give a strong indication about possible plagiarism as the percentage will be around 50%. However, sims (s, t) will give the percentage around 100% immediately stating that one string is present in the other in its entirety. Similar situation can be observed when a part of one solution is copied from the other without adding or modifying it much. 4. RESULTS To explore the viability of our proposed approach, we tested the performance of our tool both in isolation and in comparison with two other similarity detection tools. 4.1 Synthetic cases In the first step we tested our tool using synthetic benchmarks. We have performed 22 separate tests by supplying two randomly generated strings as input to our tool and recording the calculated similarities. Each test was repeated 1000 times and the average similarity was recorded. In each test we generated three sequences of random data of predefined lengths for that test: unique content for the first input string, unique content for the second input string and content common for both strings. The two inputs supplied to the tool were generated by combining the content from the appropriate random sequences. The results of the tests are given in Table I. The value unq(s) represents the total amount of data unique to the string s, while unq(t) represent the same for the string t. The value cmn(s, t) represents the total amount of data common in both strings. The next two groups of values represent the expected and calculated percentages of similarity, while the last three columns contain percentage of difference between expected and calculated values. The value p represent the percentage of data common to both strings. The value ps represent the percentage of data in the string s that is also present in the string t. The value pt represent the percentage of data in the string t that is also present in the string s. As can be seen from the data, the similarity calculated by our tool is very close to the real values. More precisely, values of all three functions differ from the expected value within an interval between 0.04% and 2.69%. It can be observed that the calculated values differ from the expected ones on average 0.94% for both strings, 1.1% for the amount of data s has in common with t, and 0.85% for the amount of data t has in common with s, while standard deviation of differences is 0.59%, 0.69% and 0.57% respectively. The retrieved values are consistently slightly lower because of the imperfections of the compression algorithm used to approximate the Kolmogorov complexity. With a compression algorithm closer to the perfect compression these values would probably be closer to the expected. 4.2 Real world examples In the second step we have compared the results from our tool with SIM [Grune 2006] and JPlag [Prechelt et al. 2002]. SIM tests lexical similarity in natural language texts and in programs written in various programming languages. Its main purpose is to detect duplicated code in large software projects and to detect plagiarism between different software projects. JPlag is a system that finds similarities among multiple sets of source code files. It is aimed at detecting software plagiarism and as such was used by expert witnesses in several intellectual property cases. Detecting Source Code Similarity Using Compression • 12:5 Table I. Similarity percentages for the synthetic benchmarks Amount of data Expected similarity Calculated similarity Difference in similarities unq(s) unq(t) cmn(s, t) p ps pt p ps pt p ps pt 0 0 3000 1.00000 1.00000 1.00000 0.97311 0.97311 0.97311 0.0269 0.0269 0.0269 3000 3000 0 0.00000 0.00000 0.00000 0.00100 0.00100 0.00100 0.0010 0.0010 0.0010 3000 3000 3000 0.50000 0.50000 0.50000 0.48801 0.48801 0.48801 0.0120 0.0120 0.0120 3000 3000 6000 0.66667 0.66667 0.66667 0.65475 0.65475 0.65475 0.0119 0.0119 0.0119 6000 6000 3000 0.33333 0.33333 0.33333 0.32538 0.32538 0.32538 0.0080 0.0080 0.0080 3000 3000 9000 0.75000 0.75000 0.75000 0.73850 0.73850 0.73850 0.0115 0.0115 0.0115 9000 9000 3000 0.25000 0.25000 0.25000 0.24302 0.24302 0.24302 0.0070 0.0070 0.0070 0 3000 3000 0.66667 1.00000 0.50000 0.64981 0.97440 0.48744 0.0169 0.0256 0.0126 0 3000 6000 0.80000 1.00000 0.66667 0.78571 0.98204 0.65481 0.0143 0.0180 0.0119 0 3000 9000 0.85714 1.00000 0.75000 0.84400 0.98462 0.73852 0.0131 0.0154 0.0115 3000 6000 0 0.00000 0.00000 0.00000 0.00067 0.00100 0.00050 0.0007 0.0010 0.0005 3000 6000 3000 0.40000 0.50000 0.33333 0.39051 0.48808 0.32544 0.0095 0.0119 0.0079 3000 6000 6000 0.57143 0.66667 0.50000 0.56127 0.65479 0.49113 0.0102 0.0119 0.0089 3000 6000 9000 0.66667 0.75000 0.60000 0.65587 0.73783 0.59029 0.0108 0.0122 0.0097 3000 9000 0 0.00000 0.00000 0.00000 0.00050 0.00100 0.00033 0.0005 0.0010 0.0003 3000 9000 3000 0.33333 0.50000 0.25000 0.32549 0.48816 0.24414 0.0078 0.0118 0.0059 3000 9000 6000 0.50000 0.66667 0.40000 0.48995 0.65321 0.39198 0.0101 0.0135 0.0080 3000 9000 9000 0.60000 0.75000 0.50000 0.59057 0.73830 0.49210 0.0094 0.0117 0.0079 6000 9000 0 0.00000 0.00000 0.00000 0.00040 0.00050 0.00033 0.0004 0.0005 0.0003 6000 9000 3000 0.28571 0.33333 0.25000 0.27714 0.32331 0.24250 0.0086 0.0100 0.0075 6000 9000 6000 0.44444 0.50000 0.40000 0.43589 0.49037 0.39231 0.0086 0.0096 0.0077 6000 9000 9000 0.54545 0.60000 0.50000 0.53696 0.59073 0.49216 0.0085 0.0093 0.0078 As sample data, we have selected a few student solutions to practical exercises. All solutions belong to a single group of second year students working independently on the same assignment. The students were not sharing code among them. Along with student solutions one produced by us was added. We have randomly chosen one of the student solutions and changed all variable names without changing the rest of the code. This was done to illustrate a common way students try to mask plagiarism. The results of this test are illustrated by charts in figures 1 and 2. From the starting 14 solutions, all possible pairs were generated. Figure 1 shows 182 points in total representing each pair twice, once for each component in the pair. For each of these components, the figure shows the percentage of the code that is also present in the other component of the pair, as reported by all three tools. Figure 2 contains the similarity values reported for the pair as a whole. As SIM does not report this measure, it is not included in the graph on this figure. The results from our tool are comparable with the results from SIM and JPlag. As can be seen from the charts, percentages are roughly consistent within a margin, both for a single solution and for the whole pair. There are a few instances in which each of the tools give a slightly different result from the others. The discrepancies between our tool and each of the other two are roughly the same as differences between them and all three tools are showing the same trend. Anyhow, in no situation one of the tools gives a strong indication of similarity while the other two are indicating the opposite. However, our tool usually gives a slightly higher percentage of similarity. This is rooted in two funda- mental reasons. Firstly, both SIM and JPlag have a concept of a minimum run length. This parameter dictates how many consecutive matches are needed to consider a part of the program code a potential copy. Our tool does not have such restriction stated explicitly. However, implicitly as a feature of stan- dard compression algorithms, a match that is too small is disregarded on the basis that its compressed representation is longer that the uncompressed. 12:6 • Ivan Pribela et al. Fig. 1. Percentage of similarity within one solution in a pair Fig. 2. Percentage of similarity in a pair of solutions Secondly, our tool is basing the comparison on the level of characters, while SIM and JPlag are comparing token streams. This results in our tool detecting finer similarities that do not span whole tokens, like for example, between Integer, LongInteger and ShortInteger or by capturing a part of a token in front or behind the group detected by other tools. This does not mean that either of the approaches is better that the other. It depends on the particular case and warrants further investigation. 5. CONCLUSIONS AND FUTURE WORK In the area of automatic assessment of students’ programming solutions, one of the open challenges is plagiarism detection. While there are numerous tools for source code similarity analysis and for code clone detection, there is still no tool fully suitable to be used for these purposes. Some of the factors that affect the usefulness of these tools are the applied approach and the used similarity algorithm. In this paper we explore possibilities of employing string complexity and compression in source code similarity measurement. We evaluate gained results on synthetic examples in comparison with the expected values based on the background theory, as well as on real-life student solutions in comparison with alternative tools. In comparison with the expected values, we gain slightly lower similarity values. The cause of this is the imperfection of the chosen compression algorithm, and the fact that it implements lossless com- pression which considers the equality of string sequences and not actually their similarity. Hence, in Detecting Source Code Similarity Using Compression • 12:7 the future work we can experiment with alternative algorithms that are lossy and sacrifice a little accuracy to, for example, better capture attempts at renaming variables. Results obtained from our tool, when compared to the similarity values acquired from competitive tools, show the same general trend, although our calculated similarity values tend to be slightly higher. The differences are caused by different observation unit employed (character- versus token-based ob- servation). Experimentation in this direction may bring higher precision, but also lower detection ca- pabilities. Also, one of the important features in similarity analysis is the consideration of the base code, as student assignments usually include some predefined elements for the future solution. This is another significant direction for future work. REFERENCES Gregory J Chaitin. 1974. Information-theoretic limitations of formal systems. Journal of the ACM (JACM) 21, 3 (1974), 403–424. Xin Chen, Brent Francia, Ming Li, Brian Mckinnon, and Amit Seker. 2004. Shared information and program plagiarism detec- tion. IEEE Transactions on Information Theory 50, 7 (2004), 1545–1551. Xin Chen, Ming Li, Brian Mckinnon, and Amit Seker. 2002. A theory of uncheatable program plagiarism detection and its practical implementation. SID website at http://dna.cs.ucsb.edu/SID (2002). Peter Gács. 1974. On the symmetry of algorithmic information. In Doklady Akademii Nauk, Vol. 218. Russian Academy of Sciences, 1265–1267. Dick Grune. 2006. The software and text similarity tester SIM. (2006). David A Huffman. 1952. A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40, 9 (1952), 1098–1101. Ming Li, Jonathan H Badger, Xin Chen, Sam Kwong, Paul Kearney, and Haoyong Zhang. 2001. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17, 2 (2001), 149–154. Ming Li, Paul Vitányi, and others. 2008. An introduction to Kolmogorov complexity and its applications. Vol. 3. Springer. Matija Novak, Mike Joy, and Dragutin Kermek. 2019. Source-code Similarity Detection and Detection Tools Used in Academia: A Systematic Review. ACM Transactions on Computing Education (TOCE) 19, 3 (2019), 27. Lutz Prechelt, Guido Malpohl, Michael Philippsen, and others. 2002. Finding plagiarisms among a set of programs with JPlag. J. UCS 8, 11 (2002), 1016. Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Transactions on informa- tion theory 23, 3 (1977), 337–343. Alexander K Zvonkin and Leonid A Levin. 1970. The complexity of finite objects and the development of the concepts of infor- mation and randomness by means of the theory of algorithms. Russian Mathematical Surveys 25, 6 (1970), 83.