<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Normalization based Stop-Word approach to Source Code Plagiarism Detection</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Saimadhav</forename><surname>Heblikar</surname></persName>
							<email>saimadhavheblikar@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="institution">PES Institute of Technology Bangalore</orgName>
								<address>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Poorva</forename><surname>Sharma</surname></persName>
							<email>poorvasharma0615@gmail.com</email>
							<affiliation key="aff1">
								<orgName type="institution">PES Institute of Technology Bangalore</orgName>
								<address>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Manogna</forename><surname>Munnangi</surname></persName>
							<email>manogna08@gmail.com</email>
							<affiliation key="aff2">
								<orgName type="institution">PES Institute of Technology Bangalore</orgName>
								<address>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Channa</forename><surname>Bankapur</surname></persName>
							<email>channabankapur@pes.edu</email>
							<affiliation key="aff3">
								<orgName type="institution">PES University Bangalore</orgName>
								<address>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Normalization based Stop-Word approach to Source Code Plagiarism Detection</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">F9F1E080BA12AA5F0A4F798C5CAB744C</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T13:59+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>CCS Concepts</term>
					<term>Information systems → Similarity measures</term>
					<term>Clustering and classification</term>
					<term>Source code reuse, Plagiarism detection</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper is a report of PES Institute of Technology's participation in the Cross Language Detection of Source Code Reuse (CL-SOCO) task at FIRE 2015 <ref type="bibr" target="#b1">[1]</ref>. We approach this task as text document plagiarism task, without considering formal programming language grammatical structure. We use normalization of commonly used identifiers to detect pair of programs which have the same objective. We also find that entirely removing these normalized operations improves the system.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>Vast amounts of software code has become easily available on the Internet. Sites like Stackoverflow make available solutions to common problems. In such a scenario, software developers are tempted to copy and paste code from one place to another. This could cause the owners of the software legal, ethical, licensing and maintenance problems in the long run. Software plagiarism also affects competitive programming competitions like ACM ICPC. The sheer scale of available resources to plagiarize from and the possible number of plagiarized documents makes this a source code plagiarism detection a daunting task.</p><p>Plagiarism detection in software source code is different from text plagiarism detection task. One of the popular approaches to text plagiarism detection is bag-of-words model <ref type="bibr" target="#b4">[4]</ref>. However, this is not useful in a software source code context as a small set of programming constructs are bound to be reused repeatedly, whilst doing altogether different things.</p><p>There currently exist tools like MOSS <ref type="bibr" target="#b2">[2]</ref> and JPLAG <ref type="bibr" target="#b3">[3]</ref> which try to solve this problem. MOSS stands for "Measure Of Software Similarity" and is a system for detecting similarity in software. JPLAG is a system for detecting software similarity considering text features as well as programming language features. Both JPLag and MOSS are used in academic environments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">TASK DESCRIPTION</head><p>Cross language Source code reuse(CL-SOCO) track of FIRE 2015 deals with the detection of plagiarism in software source code. The cross language aspect deals with detecting plagiarism from C to Java sources.</p><p>The training set given to us consisted of 599 C and 599 Java files. These files were numbered from 001.C to 599.C and 001.java to 599.java The files with the same number represented a plagiarized case. That is, 012.c and 012.java represents a reuse case, while 012.c and 021.java don't. Since some of these files were generated using a tool, they contained parse errors. This was true for both the C and Java data-set.</p><p>The test set given to us consisted of 79 C files and 79 Java files. These files were numbered from 300.c to 378.c and 000.java to 078.java.</p><p>Both the training and test corpus are available at <ref type="bibr" target="#b1">[1]</ref>.</p><p>It is important to note that we do not have to mention the direction of reuse. That is, whether the reuse was from C to Java or from Java to C.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">CURRENT WORK</head><p>In this section, we describe the state of research in the field of plagiarism detection in general, and source code plagiarism detection in particular.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Bag-of-words-model</head><p>In this model, the document is represented as a bag-ofwords. In practice, it is a multi-set. It disregards order or grammar, but accounts for multiplicity. Bag of words is shown to work well for the text plagiarism detection task <ref type="bibr" target="#b4">[4]</ref>. It's performance is not satisfactory for the programming language plagiarism detection task <ref type="bibr" target="#b5">[5]</ref>. The reason being similar programming constructs are bound to be a very high number of times in programs. However, these programs may be doing entirely different things.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">NLP Techniques</head><p>Common Natural Language Processing techniques like word n-grams are used to detect similarity between documents. Some works also consider using features of the text like number of white-spaces, average indentation, and other stylistic features for evaluation. A popular tool is XPLAG <ref type="bibr" target="#b6">[6]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Longest common sub-string</head><p>Tools like JPLAG <ref type="bibr" target="#b3">[3]</ref> make use of the Longest common substring (LCS) approach. This is a pair wise approach. The similarity between a pair is decided by the length of the longest common substring.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">SYSTEM DESCRIPTION</head><p>We build upon the existing work described in Section 3.1. We work on the bag-of-words model, modifying it to support term weighting. We then use word 1-grams as features. Our approach is from XPLAG <ref type="bibr" target="#b6">[6]</ref> in the sense that we do not consider any other NLP techniques which were described in Section 3.2.</p><p>In this section we describe our approach. We present three iterative runs, each built upon and improving over its predecessor. Only the preprocessing stage varies for each run. The first run is the baseline run. The second run is the normalization run. The third run builds upon the second, and removes any normalized operation or identifier. We call this third run as using the removal of stopwords, from the normalized operations or identifiers.</p><p>We divide the workflow into four stages :</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Preprocessing</head><p>The preprocessing stage is divided into 2 parts. The first part is same for all approaches and is described below</p><p>In the first part of preprocessing, more than one continuous whitespace are converted to a single whitespace. The code is then converted to lower case. Any accents in the text are stripped. The source code is then passed to a lexer. The lexer removes lexemes like +, -, *, / etc. The output of the lexer is a stream of tokens. Subsection, 4.1.1 to 4.1.3 provides a detailed description of approach to the second part of preprocessing stage for each run.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1">Baseline</head><p>In this stage, no work is done. That is, there is no transformation of the tokenized stream obtained from part one of preprocessing. This approach serves as a baseline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.2">Normalization</head><p>The input to this approach is the output from baseline approach. In this stage, we study the language usage features from the training data. We obtain frequency statistics about the most commonly used identifiers in both the languages C and Java. This was sorted based on frequency in nonincreasing order. This list was pruned to consider those in the top-n positions of the list. We also considered keywords as identifiers.</p><p>We then manually mapped similar identifiers/functions to new operation identifiers or op-codes. For example, printf is used as output function in C. System.out.println is used as a output function in Java. Both these functions perform similar operations. Therefore, we replace all occurrences of </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.3">Removing stopwords</head><p>The input to this approach is the output from preprocessing of normalization approach. All op-codes which were generated in the previous stage are removed. This can also be seen as using a stop-word list consisting of op-codes generated earlier.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Vectorizer</head><p>This stage receives as input a set of tokenized documents. Each document is converted to a vector. The feature chosen to create the vector is word 1-gram and 2-gram. The weighting factor used is term frequency-inverse document frequency (tf-idf).</p><formula xml:id="formula_0">tf (t, d) = 0.5 + 0.5 * f (t, d) maxf (t, d) : t ∈ d<label>(1)</label></formula><p>We know that term frequency(tf) increases proportionally to the number of times it appears in a document, but is offset by its frequency in the corpus. We require the offset weights because certain set of programming language constructs are used a very high number of times, almost always in programs which do very different things. Inverse document frequency(idf) serves this purpose. The output of this stage is a set of tf-idf vectors, each vector representing a document.</p><p>We group the output into training set and a testing set. The training set is passed to the similarity phase. The testing set is passed to the deciding phase. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Similarity Phase</head><p>The input to this stage is a set of vectors representing the documents in the training set. The set of vectors corresponding to the training set was divided into sets corresponding to C files and Java files. A cross product was taken between these two sets. This cross product set represents comparing every C file with every Java file. A cross product was taken between these two sets. This cross product set represents every possible (C, Java) pair from training data.</p><formula xml:id="formula_1">similarity = cos(θ) = A.B A B = n i=1 Ai * Bi n i=1 (Ai) 2 * n i=1 (Bi) 2</formula><p>(2) As mentioned in Section 2, we know that a (C, Java) file pair is plagiarized if and only if they have the same file number. Any other case they are not plagiarized. We use the above statistic to record the mean and median cosine similarity values for all plagiarized cases and all non-plagiarized cases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Deciding Phase</head><p>The input to this stage is a set of vectors from the testing data and similarity statistics like mean or median for the plagiarized cases from the similarity phase. The set of vectors corresponding to the test data was divided into sets corresponding to C files and Java files. A cross product is taken between these two sets. This cross product set represents every possible (C, Java) pair from test data. For every pair in the cross product set, cosine similarity is computed. The deciding factor to say that a pair is plagiarized was done by choosing the mean/median obtained from the similarity phase as threshold. Anything above this threshold was considered as plagiarized. For the 3 runs, we chose mean value from the previous phase as a threshold. The reason for choosing mean over median is given below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.1">Choosing threshold for deciding</head><p>The statistics in Table <ref type="table" target="#tab_1">2</ref> are results from baseline approach. The input documents were the training set itself. There were 599x599 (C, Java pairs). The number of plagiarized cases were 599. We measured the mean and median values of cosine similarity for all the plagiarized cases.</p><p>We see that using mean as threshold gives us slightly lower precision, but a far better recall, and therefore a better F1 value. We see that number of false positives is higher using mean as threshold, but the difference between using mean or median as threshold is tending to 0(0.00014) when expressed as percentage. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">RESULTS AND ANALYSIS</head><p>Precision is the fraction of retrieved instances that are relevant, while recall is the fraction of relevant instances that are retrieved.</p><p>F1 score is the harmonic mean of precision and recall. Since it takes both precision and recall into equal consideration, it's an overall measure of relevance.</p><p>As we can see in the results in Table <ref type="table" target="#tab_2">3</ref>, precision of all the three runs is 1. This means all the retrieved results are relevant i.e., there are no false positives.</p><p>Next, we compare the different approaches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Comparison between baseline and normalization</head><p>The reuse cases of &lt;345.c and 033.java&gt;, &lt;351.c and 043.java&gt; and &lt;368.c and 061.java&gt; are there in normalization but not in baseline. Our reasoning about this is as follows -Since after preprocessing, there is no further transformation of the tokens obtained from the lexer in baseline, certain tokens in C and Java, although have the same meaning (perform similar actions), might not have been considered the same. This reduces the cosine similarity between the files under consideration.</p><p>For example, the statements s t r c p y ( u r l , ' ' ' ' ) ( i n 3 4 5 . c ) and u r l = ' ' ' ' ( i n 0 3 3 . j a v a ) do the same thing, but they are not considered the same by the lexer. The reuse case of &lt;312.c and 005.java&gt; is there in baseline but not in normalization. Since only the topn positions of the frequency statistics regarding commonly used identifiers/keywords in C and Java were considered for the normalization, it would have missed out on certain identifiers/keywords that perform the same actions in both C and Java.</p><p>Comparison between baseline and stop-word removal</p><p>The reuse case of &lt;337.c and 034.java&gt; is there in baseline but not in stop word removal, reasons being similar to above mentioned (we know that output of preprocessing of normalization is fed to stop word removal preprocessing stage).</p><p>The reuse cases of &lt;321.c and 049.java&gt;, &lt;331.c and 065.java&gt;, &lt;345.c and 033.java&gt;, &lt;351.c and 043.java&gt;, &lt;359.c and 029.java&gt;, &lt;368.c and 051.java&gt;, &lt;373.c and 058.java&gt;, &lt;374.c and 006.java&gt;, &lt;375.c and 042.java&gt; and &lt;376.c and 074.java&gt; are all there in stop-word removal but not in baseline. The possible reason is since the op-codes are removed and direct mapping between the identifiers or keywords of both the languages is done, there is a higher possibility in matching similar constructs and identifiers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Comparison between normalization and stop-word removal</head><p>The reuse cases in bold letters above along with &lt;312.c and 005.java&gt; are all there in stop-word removal but not in normalization run. The reuse case of &lt;337.c and 034.java&gt; is there in normalization but not in stop-word removal. This may be because in normalization, there are certain op-codes for similar identifiers. In stop-word removal, we remove the op-codes. Suppose there are many number of identifiers in both the languages put together that have similar meaning, it might become difficult/inaccurate to keep track of them without using op-codes. (If we are using op-codes, all of them will have a single op-code).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">FUTURE SCOPE AND CONCLUSION</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Improving recall</head><p>In order to improve the recall, we may do the following: Use a combination of the methods used for normalization and stop-word removal. In some cases, where a large number of similar identifiers are there, op-code can be used. In others, direct mapping can be used. More number of identifiers can be grouped under one op-code. For example, as and when you encounter an identifier that has the same meaning as the identifiers in an already defined set, it can be added to the set. The value of 'n' in deciding the top-n by frequency identifiers can be varied.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Improving the normalization and stop-word removal procedure</head><p>We mention certain aspects which were lacking in our system and possibly suggestions on how it can be improved in the future. This suggestions are keeping in mind how the system can be made generic, that is, to support as many language pairs as possible.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.1">Same language normalization automation</head><p>Most programming languages provide many functions or identifiers to do the same thing. For example, C provides printf, fprintf and puts to output a string. We use contextual information to automate the process of detecting such functions or identifiers. Once contextually similar pairs are detected, they may be assigned op-codes if they are indeed doing similar things.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.2">Cross language normalization automation</head><p>The approach of Section 6.2.1 may be used to decide whether it would be feasible to automate the process of generating op-codes across languages for similar operation.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Normalization List</figDesc><table><row><cell>Op-Code</cell><cell>Identifiers assigned to Op-Code</cell></row><row><cell>op1</cell><cell>len,strlen,length,size</cell></row><row><cell>op2</cell><cell>stdio,stdlib,system</cell></row><row><cell>op3</cell><cell>size,sizeof</cell></row><row><cell>op4</cell><cell>struct,typedef,class,object</cell></row><row><cell>op5</cell><cell>string,str,StringFunctions</cell></row><row><cell>op6</cell><cell>list,iter</cell></row><row><cell>op7</cell><cell>new,malloc</cell></row><row><cell>op8</cell><cell>argv,argValue</cell></row><row><cell>op9</cell><cell>rand</cell></row><row><cell>op10</cell><cell>argc,args</cell></row><row><cell>op11</cell><cell>pthis,pthread</cell></row><row><cell>op12</cell><cell>print, fprintf, printf, sprintf, println</cell></row><row><cell></cell><cell>System.out.println,</cell></row><row><cell></cell><cell>System.out.print,</cell></row><row><cell></cell><cell>System.out.printf,</cell></row><row><cell></cell><cell>puts, putchar,fputs</cell></row><row><cell>op13</cell><cell>array,charAt</cell></row><row><cell>op14</cell><cell>ret,return</cell></row><row><cell>op15</cell><cell>file,fd</cell></row><row><cell>op16</cell><cell>int,integer</cell></row><row><cell>op17</cell><cell>char,character</cell></row><row><cell>op18</cell><cell>bool,Boolean,boolean</cell></row><row><cell>op19</cell><cell>float,Float</cell></row><row><cell>op20</cell><cell>scanf,scanner,gets,getch,getchar</cell></row></table><note>printf and System.out.println with op1. Refer to Table1for a full list of such replacements. The output from the stage is fed to the vectorizer.</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Comparison of mean vs. median for decid-</figDesc><table><row><cell>ing threshold</cell><cell></cell><cell></cell></row><row><cell></cell><cell cols="2">Mean Median</cell></row><row><cell cols="3">Threshold(similarity value) 0.644 0.776</cell></row><row><cell>False positives</cell><cell>139</cell><cell>88</cell></row><row><cell>F1</cell><cell cols="2">0.450 0.324</cell></row><row><cell>Precision</cell><cell cols="2">0.738 0.775</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc>Results for the cross language collection</figDesc><table><row><cell>Preprocessing approach</cell><cell>F1</cell><cell cols="2">Precision Recall</cell></row><row><cell>Baseline</cell><cell>0.683</cell><cell>1.000</cell><cell>0.519</cell></row><row><cell>Normalization</cell><cell>0.697</cell><cell>1.000</cell><cell>0.534</cell></row><row><cell>Removing stopwords</cell><cell>0.740</cell><cell>1.000</cell><cell>0.603</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title/>
		<author>
			<persName><surname>References</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">PAN@FIRE 2015: Overview of CL-SOCO Track on the Detection of Cross-Language SOurce COde Re-use</title>
		<author>
			<persName><forename type="first">E</forename><surname>Flores</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Moreno</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Villatoro-Tello</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Seventh Forum for Information Retrieval Evaluation (FIRE 2015)</title>
				<meeting>the Seventh Forum for Information Retrieval Evaluation (FIRE 2015)<address><addrLine>Gandhinagar, India</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015-12">December (2015</date>
			<biblScope unit="page" from="4" to="6" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Plagiarism Detection</title>
		<ptr target="https://theory.stanford.edu/aiken/moss/" />
		<imprint>
			<date type="published" when="2015-10-18">2015. 2015-10-18</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Finding plagiarisms among a set of programs with JPlag</title>
		<author>
			<persName><forename type="first">L</forename><surname>Prechelt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Malpohl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Philippsen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Universal Computer Science</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="issue">11</biblScope>
			<biblScope unit="page" from="1016" to="1038" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">On Automatic Plagiarism Detection Based on n-Grams Comparison</title>
		<author>
			<persName><forename type="first">A</forename><surname>Barrón-Cedeño</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<idno type="DOI">=10.1007/978-3-642-00958-769</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval (ECIR &apos;09)</title>
				<editor>
			<persName><forename type="first">Mohand</forename><surname>Boughanem</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Catherine</forename><surname>Berrut</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Josiane</forename><surname>Mothe</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Chantal</forename><surname>Soule-Dupuy</surname></persName>
		</editor>
		<meeting>the 31th European Conference on IR Research on Advances in Information Retrieval (ECIR &apos;09)<address><addrLine>Berlin, Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer-Verlag</publisher>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="696" to="700" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">DCU@FIRE-2014: An Information Retrieval Approach for Source Code Plagiarism Detection</title>
		<author>
			<persName><forename type="first">D</forename><surname>Ganguly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Jones</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Plagiarism detection across programming languages</title>
		<author>
			<persName><forename type="first">C</forename><surname>Arwin</surname></persName>
		</author>
		<author>
			<persName><surname>Tahaghoghi</surname></persName>
		</author>
		<author>
			<persName><surname>Smm</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 29th Australasian Computer Science Conference</title>
				<meeting>the 29th Australasian Computer Science Conference</meeting>
		<imprint>
			<publisher>Australian Computer Society, Inc</publisher>
			<date type="published" when="2006">2006</date>
			<biblScope unit="volume">48</biblScope>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
