<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Qualitative Estimation of Plagiarism Presence in Programming Assignment Submissions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tetiana Karnaukh</string-name>
          <email>tkarnaukh@knu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Taras Shevchenko National University of Kyiv</institution>
          ,
          <addr-line>64/13, Volodymyrska Street, Kyiv, 01601</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <fpage>210</fpage>
      <lpage>218</lpage>
      <abstract>
        <p>The aim of this paper is to develop an approach for qualitative estimation of plagiarism presence in programming assignment submissions. Desired algorithm should take into account that some parts of assignment can be discussed in class, can implement design patterns or can contain pieces of code provided by instructors. The distinctive feature of processed data is that analyzed submitted programs solve similar tasks and therefore cannot be very different. So, we are trying to find the most similar programs between similar ones. An approach to qualitative estimation of plagiarism presence among the set of programs for similar tasks is formulated. According to it, all the submissions should be processed simultaneously. To analyze a submission, a multiset formed by the numbers of n-grams common with other submissions is computed. The decision about submission originality, i.e., plagiarism absence, is based upon the density of this multiset largest elements. The proposed technique is simple to implement for any programming language course and is quite effective to help an instructor to recognize signs of borrowings in programming assignments. It is used by the author since 2018. Plagiarism, programming assignment, similarity, n-grams, density</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The problem of text document similarity estimation has large applied importance today. Due to a
large amount of information, people need to automate the processes of distinguishing unique elements
as well as the dual processes of finding out similarities. As a result, numerous methods of similarity
estimation are used in the searching systems and so on. One branch of such methods is focused on the
analysis of texts on plagiarism. Among all texts, one can mark out program texts as texts with some
specifics. And, finally, estimation of arbitrary programs similarity is not a goal of this paper. More
precisely, this paper is concerned about how given a set of quite similar programs, one can find out the
most similar ones. The input data consist of all the submissions of some programming assignment. A
key point is that this assignment can be only slightly different for different students. For each program
from the set, the problem is to decide if a given program is original or it borrows some essential parts
from another program that we have.</p>
      <p>In general, the methods of similarity estimation are well known. But among them, there are no
universal method that would give an adequate similarity estimation in all the domains.</p>
      <p>
        A large family of methods for texts similarity evaluation is based upon n-grams (in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] are mentioned
as n-shingles) technique and Jaccard similarity of sets [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Different n-grams techniques are considered
in [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1-4</xref>
        ]. Another group is based upon finding common subsequences, for example [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Methods of both
groups can do text preprocessing. For natural languages it can be, for example, lemmatization.
Considering the specifics of program code, methods for code analyzing transform code into a sequence
of tokens, which are not necessary tokens of corresponding programming language, but semantically
quite close to them [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ], and then analyze induced sequence. There are methods using abstract syntax
      </p>
      <p>
        2022 Copyright for this paper by its authors.
tree [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and programs behavior [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8-10</xref>
        ]. Programs similarity is interconnected with different plagiarism
attacks [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] or different ways of code cloning [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ]. One of the modern tendencies is applying
methods of AI to problem of program similarities [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ], but these techniques use significant
computing resources and need a lot of data for training.
      </p>
      <p>
        In methods [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8">5-8</xref>
        ] after lexical analysis, when a sequence of programming language tokens is
derived, one should make additional transformation, and this transformation depends on the
programming language. For modern popular programming languages, there are tools for lexical analysis
so that it is possible to use existing libraries to transform code into programming language token
sequence. But additional transformation should consider syntactic as well as semantic programming
language specifics. Thus, such transformation should depend on programming language too. In case of
evolutive programming languages, for example Python, making frequent update of this transformation
for a more modern language dialect can be quite inconvenient. Sometimes language changes can invoke
minor changes in the additional transformation. But sometimes one can think of more significant
changes which could not be easily worked out and implemented.
      </p>
      <p>It is worth of pointing out that methods which simply estimate similarity of two programs are not
very applicable to the situation in which it is necessary to define the most identical objects among the
moderate amount of almost identical ones. In such circumstances, one should consider that the
similarity of submissions with no plagiarism can be quite large.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Strategy of programming assignments composing and usage of proposed technique</title>
      <p>Each programming course is accompanied with series of programming assignments for
unsupervised work. These assignments can be very simple tasks for learning certain programming
techniques and constructions as well as more complicated programming projects for acquiring skills
not only in coding, but in planning work, designing a project, testing and so on.</p>
      <p>Modern programming is built upon patterns and libraries. Following the notion of good code,
programmers use meaningful names, which for typical programming entities are quite typical. As usual,
the best programs for very easy assignment do not vary too much due to using well-known patterns,
and this is normal. Overachiever students write almost identical version of, for example, matrix
multiplication. If such assignment is given to 100 students, their programs can be grouped into some
not very large number of clusters. Moreover, for a very simple program we often could not assert that
it was done with collaboration with someone else even in case of 90% program text coincidence. In
most cases such simple assignments are not supposed to be graded. For more complicated assignment,
a student has more alternatives. He or she should decide which patterns to use and how to implement
these patterns. Also, a student should make other decisions about the code and its structure. As a result,
such programs can include not only implementations of standard design patterns but some original parts.
Their code can be near about 10 kB (the author deals with first-year students). A great problem with
such assignments is that some students tend not to program them on his or her own. Sometimes students
understand borrowed code but sometimes not, and the last happens much more often. An instructor
should be able to identify not written by one’s own programs because such assignments should not be
graded positively. One can think that if we assign each student a substantially different task, then we
can eliminate the problem of code plagiarism. From one side, this is true; but from the other side, this
is not a solution. First, it is a problem for instructors to invent 100+ different but equivalent in
complexity tasks each year. Then, the programming course includes some classes where the lab
instructor and students discuss some design patterns and the instructor guides students through their
tasks gradually. In case of very distinct assignments, there can be no common guides suitable to all
tasks simultaneously. Finally, if tasks are very different and there are no guides, then students will begin
to find external help with more probability. Our teaching goal is to stimulate students to do the work by
themselves and to teach them to solve not only the simplest problems so that the above-mentioned
approach is not suitable. An opposite approach is to give each student the same assignment. This is not
a good idea too because provokes students to borrow code from each other.</p>
      <p>A possible solution is to use similar, but not the same tasks. This let instructors not only guide
students through their tasks but let them automate testing (to be more precise, automate test generation
because automation of testing is not a problem if we have tests). Different in details tasks make students
deep into the samples and guides because they cannot be used literally. Unfortunately, some students
prefer to ask other students for code anyway. It does not mean that one student copies the whole program
of other. He or she can copy some parts, also he or she can slightly modify borrowed code and he or
she can have quite different set of the translation units. Borrowed parts do not always coincide literally.
From the other side, students never write the same code after discussions in the class. Their programs
are more different than in case of simple copy-paste method.</p>
      <p>Instructions, guides, requirements to code, and other like that have great influence on the design and
code of the resulting programs making them quite similar. But these are objective facts which could not
be eliminated for educational projects. Unfortunately, from the legal point of view, the reason that some
program is marked as plagiarism by some plagiarism detection tool is not enough to reject it. Very often
its “author” insists that the program is a product of his or her own and that coincidence of code parts is
accidental or is the result of implementing design patterns. The only way to prove that the plagiarism
detection tool was right is to demonstrate student that he or she is not aware of “own” code details. In
such circumstances there are no sense to find out all common parts.</p>
      <p>It should be also noted that programs of the first-year students can be grouped easily by certain errors
in the project and/or code structure and other irrelevances. During code borrowing, exactly these
unsuccessful elements migrate from one student to other. And this helps instructors to prove that code
is plagiarized. Other, and more effective, plagiarism proving method is based upon student’s ability to
make some minor changes in own code. Its advantage is that it does not suppose an instructor be
acquainted with minor details of the disputable submission in advance (instructor can firstly test
modified program as “black box”). But this is not a subject of this paper.</p>
      <p>This paper discusses the technique which was successfully used in practice and let the author quite
effectively figure out code with elements of plagiarism. Or, the dual problem, solved by this technique,
is to help an instructor decide if some program is quite original to be positively graded or not and should
its author be interviewed or not.
3. Proposed
methodology
of
programming
assignment
submissions
processing
3.1.</p>
    </sec>
    <sec id="sec-3">
      <title>Experimental data</title>
      <p>The main research was conducted in 2018. Its main goal was to formulate a methodology for
qualitative estimation of programming assignment code originality or plagiarism. At time of grading,
automation of such methodology should help instructors to find out submissions exposing signs of
borrowing. At that moment, the author had data consisted of three programming assignments coded in
C++ language by approximately 100 students:
1. In a numerical sequence find the maximal subsequence that satisfies some condition.
2. Implement certain traversal of some area of a square matrix.</p>
      <p>3. Build class which models some device, then build this class client code so that a user can manage
this device interactively.</p>
      <p>
        First two tasks were more concerned about algorithms, but the third one dealt mainly with code
organization. Individual tasks for each assignment were similar, but not identical. They could use some
universal decisions for certain parts (for example, sequence input), and these common parts of code
were not separated at time of submissions processing (and this approach is opposed to [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]). Also, our
teaching staff never restricts students to using fixed IDE for code developing.
      </p>
      <p>In spring 2020, the programming course was concerned about Python. Due to specifics of distance
learning, we should refine decision rules. As experimental data were used programming assignment
about processing text file with some table data. Thus, the input consists of all the submissions of some
programming assignment. (Different programming assignments are processed separately.) Let us
denote the input as D and suppose there are 
submissions. Without loss of generality, suppose</p>
      <p>= { 1,  2, … ,   }, where each submission   is a set of text files with program code.
3.2.</p>
    </sec>
    <sec id="sec-4">
      <title>Short description of the algorithm</title>
      <p>The final algorithm has three main logical steps. At step 1 each submission is transformed into a set
of  -grams. To implement this, some text preprocessor, as well as  -gram length (i.e., value of  ),
should be specified. Let 
( ,  ) denote the set of  -grams constructed for submission  .</p>
      <p>Step 2 deals with all the submissions converted to sets of  -grams. Let us define the intersection
number of submissions  ′ and  ′′ as the number of elements in intersection of their corresponding sets
of  -grams and denote it as  ′ ⊗  ′′. By definition,
s
′ ⊗ s′′ = |grams(s′, n) ∩  rams(s′′, n)|
(1)</p>
      <p>At step 2 intersection numbers are calculated for each pair of submissions. For each submission, in
the output of step 2 there is a multiset of its intersection numbers with the other submissions.</p>
      <p>At step 3 constructed multisets are tested. The key points of these processes will be described in
detail later.
3.3.</p>
    </sec>
    <sec id="sec-5">
      <title>Step 1: transforming submissions into sets of  -grams</title>
      <p>Let 
Let</p>
      <p>Transforming submissions into sets of  -grams involves given text files preprocessing. There are
different approaches to the problem what should be considered as a symbol while  -gram forming: a
word (i.e., token) or literally a symbol of the text. A chosen approach and its details can be encapsulated
in a subroutine which implements code preprocessing.</p>
      <p>Subroutines A and B were selected as preprocessing transformations for later analysis.</p>
      <p>Subroutine A eliminates comments, white symbols, and string literals (except f-strings for Python)
from code. Disposing of literals and comments (as containing natural language elements) let us slightly
reduce amount of main memory needed for storing all possible n-grams. But the identifiers are essential
to our goal. Students usually do not change them in borrowed code, and, nevertheless, if they do, then
these changes are hardly noticeable. Another reason is that those who use borrowed code without
understanding do not feel themselves free to update identifiers properly, i.e., saving meaningful names.</p>
      <p>As for white symbols, atypical same spacing could help see borrowings more definitely.
Nevertheless, analyzing spaces in code is only waste of time and memory because many modern IDEs
format code automatically. Subroutine B converts text into a programming language token sequence.
This transformation converts all number literals into the same token as well as all identifiers. Having
token numbered, the resulting sequence is a sequence of corresponding token numbers. From technical
point of view, our implementation casts these numbers to characters.</p>
      <p>be preprocessing transformation, then for any file  let us denote the result of applying
subroutine  to  as  ( ). The value of  ( ) is a string anyway.</p>
      <p>
        ( ,  ) denote the set of all  -grams of string 
=  1 2 …   . The definition from [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
can be formally written in our notation in a such way
grams( ,  ) = {    +1 …   + −1 |   ∈ ̅1̅,̅̅̅̅̅−̅̅̅̅̅+̅̅̅1̅}
      </p>
      <p>⋃ 
 =1

( ,  ) =
( (  ),  )</p>
      <p>For clearness, it should be stated that a set (not a multiset) is considered. Then, assuming
preprocessing subroutine  is chosen, let us form the set of n-grams of submission from all n-grams of
its files. Formally, let submission  contain files  1,  2, …,   , i.e.,  = { 1,  2, … ,   }, then
(2)
(3)</p>
    </sec>
    <sec id="sec-6">
      <title>Adjustment of n-gram length</title>
      <p>During preliminary analysis, different  -gram length values were examined. Our main goal was to
find balance between quality of plagiarism determination, optimization of the space complexity of
analyzer and time for analyzer code development. It was noticed that at transition from  = 5 to  =
6 the number of  -grams increases not so drastically as before. Therefore, it was suggested that next
small increase of  -gram length would not change the general estimations of originality/similarity
qualitatively. Clear, that given a large value of  , many partial borrowings could be skipped. Also, a
small value of  leads to large intersection numbers regardless of similarity presence.</p>
      <p>Finally, the decision was to use value</p>
      <p>= 5 for subroutine A. For subroutine B, the choice was</p>
      <p>
        = 9 on the same considerations about increasing the number of  -grams. Let us describe one more
reason. The if, for, while, and other statements like that can be found almost in every program and these
statements have specific syntax and a lot of standard usage cases. That is why small values of  can
result in plenty of false positive results in case of token sequences analysis by  -grams methods. The
trigram method, described in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], is not the best one for program code. It is obvious that two different
almost trivial functions can produce the same sequence of programming language tokens. For similar
programs, most corresponding short if statements produce very similar token sequences as well. One
can state the same about other types of statements.
3.5.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Step 2: calculation of intersection numbers</title>
      <p>Now, for each submission from  (the set of the submissions), we should compute its intersection
numbers. More formally, multiset { ⊗  ′| ′ ∈  ∖ { }} should be obtained. For step 3 (analysis), it is
convenient to store such multiset as a nonincreasing sequence of its elements. Let  ( ) denote such
sequence for submission  . The output of step 2 consists of such sequences for the given submissions.
3.6.</p>
    </sec>
    <sec id="sec-8">
      <title>Analysis of obtained data</title>
      <p>The main idea of determining if a given submission is original is based upon density of its largest
intersection numbers. Before formulating the decision rule, let us see some observations.</p>
      <p>In Table 1 and Table 2 for the submission about which is known for certain that it is not original,
the greatest intersection numbers, their dynamics and corresponding Jaccard similarity values are
presented. It should be noted that the greatest intersection numbers for 5-grams and 9-grams were
achieved on the different pairs of packages.</p>
      <p>As it turned out, for packages with borrowings, the distance between the greatest value and next to
it was substantially more than other distances between successive intersection numbers. Also, there
were situations when substantial decreasing of distances, i.e., some jump, took place somewhere on the
fifth largest value. Such situation is possible if an almost identical code is submitted not by two but by
the greater number of students. Tables 3 and 4 present the same data for the submission about which
is known for certain that it is original. One can see that for original submission nearby distances between
the largest values after preprocessing with subroutine A differ far less than for submissions with
plagiarism. After preprocessing with subroutine B these distances can be almost stable.
3.7.</p>
    </sec>
    <sec id="sec-9">
      <title>Step 3: making a decision</title>
      <p>In spring 2018, the decision about originality of submission  was made after analyzing the first
components of  ( ) (with  =  and  = 5).</p>
      <p>Suppose  ( ) = ( 1,  2, … ,   −1) and   is a submission which corresponds to intersection number
  (i ∈ ̅1̅,̅̅̅̅−̅̅̅1̅). (Recall that sequence  1,  2, … ,   −1 is nonincreasing.)</p>
      <p>If distance between two largest intersection numbers was more than 15% of the submission greatest
intersection number, i.e., ( 1 −  2)/ 1 &gt; 0.15, then this submission was marked as suspicious.
Submission  1, on which the greatest intersection value was achieved, was marked as suspicious too.
In some cases, submission  1 did not fulfill criterion of 15%, but the value of ( 1 −  2)/ 1 was quite
significant anyway. If this distance was less than 15% of the greatest intersection number, then, in
addition, dynamics of the ten largest intersection numbers  1,  2, … ,  10 was considered in search of
a jump. At average, intersection numbers for  =  and  = 9 was less than the same for  =  and
 = 5. Thus, the jumps of the nearby distances were not so obvious in case of tokens processing
(subroutine B) as in case of single characters processing (subroutine A). Investigation of numerical
series was limited to the first ten largest elements on considerations that code borrowings were not total
among students and that among approximately 100 students from different academical groups of
different lab instructors, hardly more than 10 students simultaneously would take the same source.
Examination of the greatest intersection numbers obtained for X = A and n = 5 allowed effectively find
out submissions with borrowings: suspicious submissions were analyzed by humans and their authors
were interviewed. There were almost no false positive results (almost all suspicious submissions were
proved to have some plagiarism). As for false negative results, it is impossible to estimate that number
exactly due to this process should involve human comparison of all pairs of submission and interview
with students. Also, it should be stated, that the number of proved plagiarism cases was approximately
25% greater that it seems to be after code inspection by only humans. From the other side, an instructor
usually remembers typical solutions found out in the students' programs. During grading internal
program characteristics, no more submissions with plagiarism were found (all the submissions were
reviewed by the same instructor). So, one can suppose that the number of false positive (without
plagiarism) results were not too much. The only goal was to determine would be the submission original
or not, so estimations were done not quantitively but qualitatively.</p>
    </sec>
    <sec id="sec-10">
      <title>4. Decision procedure refinement</title>
      <p>Due to distance learning, in spring 2020 students actively communicate by the internet. As a result,
some students coded their individual tasks in small commands. The consequence of this was that the
distance from the greatest intersection number to the nearest one more often appears not the largest
distance between two neighbor intersection numbers. The main decision-making algorithm should be
modified. For submission  and its sequence  ( ) = ( 1,  2, … ,   −1), another characteristic was used
instead the distance between only two largest intersection numbers. The 11 largest intersection numbers
were considered, and a jump was searched. Let</p>
      <p>j = arg max{a −   +1|  ∈ ̅1̅,̅1̅̅0̅} (4)</p>
      <p>Then the value of relative accumulated jump, i.e., ( 1 −   +1)/ 1, was examined as well as relative
jump, i.e., (  −   +1)/ 1, and relative distance between the two largest intersection numbers, i.e.
( 1 −  2)/ 1. If
and
( 1 −   +1)/  1 &gt; 0.15,
(  −   +1)/  1  &gt; 0.10,</p>
      <p>( 1 −  2)/ 1 &gt; 0.10, (7)
then submission  was considered as suspicious, corresponding to values  1,  2, … ,   submissions were
marked as suspicious too.</p>
      <p>If relative accumulated jump is large, but relative jump is not, then here potentially can be situation
when some parts was discussed by some academic group in class. So, the condition (6) is designed to
except such submissions from the set of suspicious ones. The purpose of the condition (7) is the same.
In most cases of borrowing presence, the relative distances between the two first intersection numbers
are sufficiently large even if the same code parts are submitted not by two but by more students. (Here
we should mention that the last condition should be skipped if student tasks are identical.)</p>
      <p>The nearer to the maximal value there is a jump (if any) in the sorted sequence of intersection
numbers, the less set of students simultaneously writes such submitted code. The greater is the
difference between the nearby greatest values, the more code of the project is borrowed.</p>
      <p>Let us note that consideration the first 25% of largest intersection numbers in place of only the
first 11 values did not change anything. In the most degree, the same threshold 15% as in 2018 is the
result of the same teaching approach and, surely, can have other value for some other assignment.</p>
    </sec>
    <sec id="sec-11">
      <title>5. Discussion</title>
      <p>It is worth to say some words about Jaccard similarity of submissions. In 2018, Jaccard similarity
was calculated for each assignment. Dependency between presence of borrowings and the Jaccard
similarity appeared not high. If we analyzed the Jaccard similarity only, then quite much plagiarism
would not be found. Processing assignments about maximal subsequence and matrix traversal was
uninformative. No valid programs with signs of borrowings were found. Using borrowed main code for
the algorithmic assignment is almost impossible: needed on its modification time and skills are much
more significant than development from the very beginning. The main attention was devoted to the
assignment about device modelling. This assignment is more complicated in the sense that it needs
some code organization and incremental design to make the program works right.</p>
      <p>For device modelling programs, pairwise calculation of Jaccard similarity showed very surprisingly
results. It appeared that means of Jaccard similarities for 5-grams after subroutine A applying and for
9-grams after subroutine B applying were approximately 0.15 and 0.105. At the same moment, the
greatest values of similarity were 0.67 and 0.91 accordingly. But there were unoriginal submissions
with maximal similarity to the other packages about 0.21 as well as original ones with maximal
similarity to the other packages about 0.24 (after preprocessing subroutine A). The same situation was
in case subroutine B applying. It was impossible to settle any Jaccard similarity threshold for all pairs
to separate original submissions from plagiarized ones. This phenomenon is a consequence of
assignments similarity, discussions that took place in classes, and varying packages length.</p>
      <p>Also, some experiments for subroutine  were conducted in 2020. It was interesting what influence
increasing the value of  from 5 to 6 would have. 21 suspicious packages were found with the threshold
15% for a relative accumulated jump and  = 5. After changing to  = 6, the number of suspicious
packages became 28. From 21 suspicious packages only one was not marked as suspicious after
6-grams analysis. The packages marked suspicious only with  = 6 were examined not only by
programs. It turned out that threshold 15% for  = 6 was not accurate, and the result was approximately
equal to the computation with threshold 11% and  = 5. Having threshold 11% for relative
accumulated jump, we should compare the values of relative jump and relative distance between the
two first intersection numbers not with 0.10 but with less values. Three discussed thresholds can depend
not only on the nature of the problems and their parts discussed in classes but on n-gram length. The
largest intersection numbers are slightly reducing with growth of  so the thresholds should be changed.</p>
      <p>Like neural networks are learned on data, after adjusting the value of  , the thresholds can be adapted
for each programming assignment personally. Assuming that the vast majority of students do not use
borrowed code, one can found out thresholds which cut off only 70%–80% of programs as original.
(The percentage should be a little less than expected percent of original programs.) Then their values
can be gradually reduced with simultaneous code inspection of some “new” suspicious submissions and
their matches. If such inspection discovers no code that looks as borrowed, then thresholds are found
(the latest reducing of the thresholds should be discarded). Remember, that the final decision is made
after interview with a student. So, too small thresholds can lead to interviewing almost all students that
could be unacceptable due to office hours limitations. For more accuracy it is possible to build a set of
decision rules based upon different preprocessing transformations and discussed relative values as well
as apply their AND and OR combinations. The discussed in the paper decision rules work well if
cheaters did not use more than one source to borrow significant amount of code. This is true in practice
because to compile two or more programs with different structures in one is not easier than to write
code without assistance.</p>
    </sec>
    <sec id="sec-12">
      <title>6. Conclusions</title>
      <p>The described methodology assumes simultaneous processing of all the submissions. The proposed
approach is not sensible to mutual location of code parts, to distribution of code between translation
units, and to presence of dummy code. It can be easily adapted to different programming languages and
their dialects by using preprocessing routines appropriate to a language. Moreover, it is acceptable that
some parts of assignment are discussed in class or contain pieces of code provided by instructors. An
important point here is that these parts should not be identified and removed before processing.</p>
      <p>Also, the described technique is quite agile in the sense that can be specialized not only with
preprocessing subroutine and n-gram length, but with thresholds for decision-making rules. This
methodology can be used with slightly different tasks of programming assignments as well as with
identical ones. The better results are exposed on quite complex tasks which need thoroughly code
organization and incremental design.</p>
      <p>The proposed technique only helps instructors to find out plagiarism among students but gives no
guarantees. The same thing is true for other methods in this field. In any case there could be some
students who successfully submit third-party code or code with elements of plagiarism. To reject
original code is impossible due to plagiarism proving procedure moderated by instructors.</p>
      <p>Further research can deal with applying the proposed density analysis not to intersection numbers
but to other pairwise similarity measures. The analogous situation with density of pairwise similarity
series is expected. But there are strong doubts that a lot of time wasted on programming more
sophisticated pairwise characteristics gives enhancement adequate to coding time or gives any
enhancement at all. Highly likely, it will be pure theoretical research. Another branch of research can
deal with code compiled from two or more sufficiently dissimilar sources.</p>
    </sec>
    <sec id="sec-13">
      <title>7. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rajaraman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ullman</surname>
          </string-name>
          , Mining of massive datasets, 2nd ed., University Press, Cambridge,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>MAC</given-names>
            <surname>Jiffriya</surname>
          </string-name>
          , MAC Jahan,
          <string-name>
            <given-names>R.G.</given-names>
            <surname>Ragel</surname>
          </string-name>
          , S. Deegalla,
          <article-title>AntiPlag: plagiarism detection on electronic submissions of text based assignments</article-title>
          ,
          <source>in: Proceedings of the 8th International Conference on Industrial and Information Systems, ICIIS</source>
          <year>2013</year>
          , IEEE, NJ,
          <year>2013</year>
          , pp.
          <fpage>376</fpage>
          -
          <lpage>380</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICIIS31010.
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Minaei-Bidgoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Niknam</surname>
          </string-name>
          ,
          <article-title>An n-gram based method for nearly copy detection in plagiarism systems</article-title>
          ,
          <source>in: Proceedings of the 6th Workshop on Awareness and Reflection in Technology Enhanced Learning (ARTEL</source>
          <year>2016</year>
          ), Lyon, France,
          <year>September 13</year>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <article-title>On automatic plagiarism detection based on n-grams comparison</article-title>
          , in: M.
          <string-name>
            <surname>Boughanem</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Berrut</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          Soule-Dupuy (Eds.),
          <source>Advances in Information Retrieval</source>
          ,
          <string-name>
            <surname>ECIR</surname>
          </string-name>
          <year>2009</year>
          , volume
          <volume>5478</volume>
          of Lecture Notes in Computer Science, Springer-Verlag, Berlin Heidelberg,
          <year>2009</year>
          , pp.
          <fpage>696</fpage>
          -
          <lpage>700</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Grune</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Huntjens</surname>
          </string-name>
          ,
          <article-title>Het detecteren van kopieёn bij informatica-practica</article-title>
          .
          <source>Informatie</source>
          <volume>31</volume>
          (
          <issue>11</issue>
          ) (
          <year>1989</year>
          ):
          <fpage>864</fpage>
          -
          <lpage>867</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Prechelt</surname>
          </string-name>
          , G. Malpohl,
          <string-name>
            <given-names>M.</given-names>
            <surname>Philippsen</surname>
          </string-name>
          .
          <article-title>Finding plagiarisms among a set of programs with JPlag</article-title>
          .
          <source>J Univers. Comput Sci</source>
          .
          <volume>8</volume>
          (
          <issue>11</issue>
          ) (
          <year>2002</year>
          ):
          <fpage>1016</fpage>
          -
          <lpage>1038</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Yang. WASTK:</surname>
          </string-name>
          <article-title>An weighted abstract syntax tree kernel method for source code plagiarism detection</article-title>
          .
          <source>Scientific Programming</source>
          (
          <year>2017</year>
          ). doi:
          <volume>10</volume>
          .1155/
          <year>2017</year>
          /7809047.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Cheers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.P.</given-names>
            <surname>Smith.</surname>
          </string-name>
          <article-title>Academic source code plagiarism detection by measuring program behavioral similarity</article-title>
          .
          <source>IEEE</source>
          <volume>9</volume>
          (
          <year>2021</year>
          ):
          <fpage>50391</fpage>
          -
          <lpage>50412</lpage>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2102.03995.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Anjali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Swapna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jayaraman</surname>
          </string-name>
          .
          <article-title>Plagiarism detection for Java programs without source codes</article-title>
          .
          <source>Procedia Computer Science</source>
          <volume>46</volume>
          (
          <year>2015</year>
          ):
          <fpage>749</fpage>
          -
          <lpage>758</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>J.-H. Ji</surname>
            , G. Woo,
            <given-names>H.-G.</given-names>
          </string-name>
          <string-name>
            <surname>Cho</surname>
          </string-name>
          .
          <article-title>A plagiarism detection technique for Java program using bytecode analysis</article-title>
          ,
          <source>in: 2008 Third International Conference on Convergence and Hybrid Information Technology</source>
          ,
          <year>2008</year>
          , pp.
          <fpage>1092</fpage>
          -
          <lpage>1098</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCIT.
          <year>2008</year>
          .
          <volume>267</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>O.</given-names>
            <surname>Karnalim</surname>
          </string-name>
          .
          <article-title>Python source code plagiarism attacks on introductory programming course assignments</article-title>
          .
          <source>Themes in Science and Technology Education</source>
          ,
          <volume>10</volume>
          (
          <issue>1</issue>
          ),
          <year>2017</year>
          , pp.
          <fpage>17</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Q. U.</given-names>
            <surname>Ain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. H.</given-names>
            <surname>Butt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. W.</given-names>
            <surname>Anwar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Azam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Maqbool</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Systematic</given-names>
            <surname>Review</surname>
          </string-name>
          <article-title>on Code Clone Detection</article-title>
          , in: IEEE Access,
          <year>2019</year>
          , vol.
          <volume>7</volume>
          , pp.
          <fpage>86121</fpage>
          -
          <lpage>86144</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2019</year>
          .
          <volume>2918202</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Buratti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pujar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Morari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          .
          <article-title>Towards learning (dis)-similarity of source code from program contrasts</article-title>
          ,
          <source>in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, May 22-27</source>
          ,
          <year>2022</year>
          , v. 1, pp.
          <fpage>6300</fpage>
          -
          <lpage>6312</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yasaswi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Purini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. V.</given-names>
            <surname>Jawahar</surname>
          </string-name>
          .
          <article-title>Plagiarism detection in programming assignments using deep features</article-title>
          ,
          <source>in: 2017 4th IAPR Asian Conference on Pattern Recognition (ACPR)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>652</fpage>
          -
          <lpage>657</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACPR.
          <year>2017</year>
          .
          <volume>146</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>