<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Overview of the 1st International Competition on Plagiarism Detection *</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Martin</forename><surname>Potthast</surname></persName>
							<affiliation key="aff0">
								<orgName type="laboratory">Web Technology &amp; Information Systems Group Natural Language Engineering Lab</orgName>
								<orgName type="institution">ELiRF Bauhaus-Universität Weimar Universidad Politécnica de Valencia</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Benno</forename><surname>Stein</surname></persName>
							<affiliation key="aff0">
								<orgName type="laboratory">Web Technology &amp; Information Systems Group Natural Language Engineering Lab</orgName>
								<orgName type="institution">ELiRF Bauhaus-Universität Weimar Universidad Politécnica de Valencia</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Andreas</forename><surname>Eiselt</surname></persName>
							<affiliation key="aff0">
								<orgName type="laboratory">Web Technology &amp; Information Systems Group Natural Language Engineering Lab</orgName>
								<orgName type="institution">ELiRF Bauhaus-Universität Weimar Universidad Politécnica de Valencia</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Alberto</forename><surname>Barrón-Cedeño</surname></persName>
							<affiliation key="aff0">
								<orgName type="laboratory">Web Technology &amp; Information Systems Group Natural Language Engineering Lab</orgName>
								<orgName type="institution">ELiRF Bauhaus-Universität Weimar Universidad Politécnica de Valencia</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Paolo</forename><surname>Rosso</surname></persName>
							<affiliation key="aff0">
								<orgName type="laboratory">Web Technology &amp; Information Systems Group Natural Language Engineering Lab</orgName>
								<orgName type="institution">ELiRF Bauhaus-Universität Weimar Universidad Politécnica de Valencia</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Overview of the 1st International Competition on Plagiarism Detection *</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">DCF49AF8DE76D19ECF57EA58CFC3EF53</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T07:25+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Plagiarism Detection</term>
					<term>Competition</term>
					<term>Evaluation Framework</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The 1st International Competition on Plagiarism Detection, held in conjunction with the 3rd PAN workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse, brought together researchers from many disciplines around the exciting retrieval task of automatic plagiarism detection. The competition was divided into the subtasks external plagiarism detection and intrinsic plagiarism detection, which were tackled by 13 participating groups.</p><p>An important by-product of the competition is an evaluation framework for plagiarism detection, which consists of a large-scale plagiarism corpus and detection quality measures. The framework may serve as a unified test environment to compare future plagiarism detection research. In this paper we describe the corpus design and the quality measures, survey the detection approaches developed by the participants, and compile the achieved performance results of the competitors.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Plagiarism and its automatic retrieval have attracted considerable attention from research and industry: various papers have been published on the topic, and many commercial software systems are being developed. However, when asked to name the best algorithm or the best system for plagiarism detection, hardly any evidence can be found to make an educated guess among the alternatives. One reason for this is that the research field of plagiarism detection lacks a controlled evaluation environment. This leads researchers to devise their own experimentation and methodologies, which are often not reproducible or comparable across papers. Furterhmore, it is unknown which detection quality can at least be expected from a plagiarism detection system.</p><p>To close this gap we have organized an international competition on plagiarism detection. We have set up, presumably for the first time, a controlled evaluation environment for plagiarism detection which consists of a largescale corpus of artificial plagiarism and de-</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Related Work</head><p>Research on plagiarism detection has been surveyed by <ref type="bibr" target="#b7">Maurer, Kappe, and Zaka (2006)</ref> and <ref type="bibr" target="#b2">Clough (2003)</ref>. Particularly the latter provides well thought-out insights into, even today, "[...] new challenges in automatic plagiarism detection", among which the need for a standardized evaluation framework is already mentioned.</p><p>With respect to the evaluation of commercial plagiarism detection systems, <ref type="bibr" target="#b17">Weber-Wulff and Köhler (2008)</ref> have conducted a manual evaluation: 31 handmade cases of plagiarism were submitted to 19 systems. The sources for the plagiarism cases were selected from the Web and the systems were judged by their capability to retrieve them. Due to the use of the Web, the experiment is not controlled which limits reproducibility, and since each case is only about two pages long there are concerns with respect to the study's representativeness. However, com-Stein, Rosso, Stamatatos, Koppel, Agirre (Eds.): PAN'09, pp. 1-9, 2009. mercial systems are usually not available for a close inspection which may leave no other choice to evaluate them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Plagiarism Detection</head><p>The literature on the subject often puts plagiarism detection on a level with the identification of highly similar sections in texts or other objects. But this does not show the whole picture. From our point of view plagiarism detection divides into two major problem classes, namely external plagiarism detection and intrinsic plagiarism detection.</p><p>Both of which include a number of subproblems and the frequently mentioned step-bystep comparison of two documents is only one of them.</p><p>For external plagiarism detection <ref type="bibr" target="#b15">Stein, Meyer zu Eissen, and Potthast (2007)</ref> introduce a generic three-step retrieval process. The authors consider that the source of a plagiarism case may be hidden in a large reference collection, as well as that the detection results may not be perfectly accurate. Figure <ref type="figure" target="#fig_0">1</ref> illustrates this retrieval process. In fact, all detection approaches submitted by the competition participants can be explained in terms of these building blocks (cf. Section 4).</p><p>The process starts with a suspicious document d q and a collection D of documents from which d q 's author may have plagiarized. Within a so-called heuristic retrieval step a small number of candidate documents D x , which are likely to be sources for plagiarism, are retrieved from D. Note that D is usually very large, e.g., in the size of the Web, so that it is impractical to compare d q one after the other with each document in D. Then, within a so-called detailed analysis step, d q is compared section-wise with the retrieved candidates. All pairs of sections (s q , s x ) with s q ∈ d q and s x ∈ d x , d x ∈ D x , are to be retrieved such that s q and s x have a high similarity under some retrieval model. In a knowledge-based post-processing step those sections are filtered for which certain exclusion criteria hold, such as the use of proper citation or literal speech. The remaining suspicious sections are presented to a human, who may decide whether or not a plagiarism offense is given.</p><p>Intrinsic plagiarism detection has been studied in detail by Meyer zu Eissen and <ref type="bibr" target="#b8">Stein (2006)</ref>. In this setting one is given a suspicious document d q but no reference collection D. Technology that tackles instances of this problem class resembles the human ability to spot potential cases of plagiarism just by reading d q .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3">Competition Agenda</head><p>We have set up a large-scale corpus (D q , D, S) of "artificial plagiarism" cases for the competition, where D q is a collection of suspicious documents, D is a collection of source documents, and S is the set of annotations of all plagiarism cases between D q and D. The competition divided into two tasks and into two phases for which the corpus was split up into 4 parts; one part for each combination of tasks and phases. For simplicity the sub-corpora are not denoted by different symbols.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Competition tasks and phases:</head><p>• External Plagiarism Detection Task.</p><p>Given D q and D the task is to identify the sections in D q which are plagiarized, and their source sections in D.</p><p>• Intrinsic Plagiarism Detection Task.</p><p>Given only D q the task is to identify the plagiarized sections.</p><p>• Training Phase. Release of a training corpus (D q , D, S) to allow for the development of a plagiarism detection system.</p><p>• Competition Phase. Release of a competition corpus (D q , D) whose plagiarism cases were to be detected and submitted as detection annotations, R.</p><p>Participants were allowed to compete in either of the two tasks or both. After the competition phase the participants' detections were evaluated, and the winner of each task as well as an overall winner was determined as that participant whose detections R best matched S in the respective competition corpora.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Plagiarism Corpus</head><p>The PAN plagiarism corpus, PAN-PC-09, comprises 41 223 text documents in which 94 202 cases of artificial plagiarism have been inserted automatically (Webis at Bauhaus-Universität Weimar and NLEL at Universidad Politécnica de Valencia, 2009). The corpus is based on 22 874 book-length documents from the Project Gutenberg.<ref type="foot" target="#foot_0">1</ref> All documents are, to the best of our knowledge, public domain; therefore the corpus is available free of charge to other researchers. Important parameters of the corpus are the following:</p><p>• Document Length. 50% of the documents are small (1-10 pages), 35% medium (10-100 pages), and 15% large (100-1000 pages).</p><p>• Suspicious-to-Source Ratio. 50% of the documents are designated as suspicious documents D q , and 50% are designated as source documents D (see Figure <ref type="figure" target="#fig_1">2</ref>).</p><p>• Plagiarism Percentage. The percentage θ of plagiarism per suspicious document d q ∈ D q ranges from 0% to 100%, whereas 50% of the suspicious documents contain no plagiarism at all. Figure <ref type="figure" target="#fig_3">3</ref> shows the distribution of the plagiarized documents for the external test corpus. For the intrinsic test corpus applies the hashed part of the distribution.</p><p>• Plagiarism Length. The length of a plagiarism case is evenly distributed between 50 words and 5000 words. • Plagiarism Languages. 90% of the cases are monolingual English plagiarism, the remainder of the cases are cross-lingual plagiarism which were translated automatically from German and Spanish to English.</p><p>• Plagiarism Obfuscation. The monolingual portion of the plagiarism in the external test corpus was obfuscated (cf. Section 2.1). The degree of obfuscation ranges evenly from none to high.</p><p>Note that for the estimation of the parameter distributions one cannot fall back on large case studies on real plagiarism cases. Hence, we decided to construct more simple cases than complex ones, where "simple" refers to short lengths, a small percentage θ, and less obfuscation. However, complex cases are overrepresented to allow for a better judgement whether a system detects them properly.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Obfuscation Synthesis</head><p>Plagiarists often modify or rewrite the sections they copy in order to obfuscate the plagiarism. In this respect, the automatic synthesis of plagiarism obfuscation we applied when constructing the corpus is of particular interest. The respective synthesis task reads  as follows: given a section of text s x , create a section s q which has a high content similarity to s x under some retrieval model but with a (substantially) different wording than s x .</p><p>An optimal obfuscation synthesizer, i.e., an automatic plagiarist, takes an s x and creates an s q which is human-readable and which creates the same ideas in mind as s x does when read by a human. Today, such a synthesizer cannot be constructed. Therefore, we approach the task from the basic understanding of content similarity in information retrieval, namely the bag-of-words model. By allowing our obfuscation synthesizers to construct texts which are not necessarily human-readable they can be greatly simplified. We have set up three heuristics to construct s q from s x :</p><p>• Random Text Operations. Given s x , s q is created by shuffling, removing, inserting, or replacing words or short phrases at random. Insertions and replacements are, for instance, taken from the document d q , the new context of s q .</p><p>• Semantic Word Variation. Given s x , s q is created by replacing each word by one of its synonyms, antonyms, hyponyms, or hypernyms, chosen at random. A word is retained if neither are available.</p><p>• POS-preserving Word Shuffling. Given s x its sequence of parts of speech (POS) is determined. Then, s q is created by shuffling words at random while the original POS sequence is maintained.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Critical Remarks</head><p>The corpus has been conceived and constructed only just in time for the competition so that there may still be errors in it. For instance, the participants pointed out that there are a number of unintended overlaps between unrelated documents. These accidental similarities do not occur frequently, so that an additional set of annotations solves this problem.</p><p>The obfuscation synthesizer based on random text operations produces anomalies in some of the obfuscated texts, such as sequences of punctuation marks and stop words. These issues were not entirely resolved so that it is possible to find some of the plagiarism cases by applying a kind of anomaly detection. Nevertheless, this was not observed during the competition.</p><p>Finally, by construction the corpus does not accurately simulate a heuristic retrieval situation in which the Web is used as reference collection. The source documents in the corpus do not resemble the Web appropriately. Note, however, that sampling the Web is also a problem for many ranking evaluation frameworks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Detection Quality Measures</head><p>A measure that quantifies the performance of a plagiarism detection algorithm will resemble concepts in terms of precision and recall. However, these concepts cannot be transferred one-to-one from the classical information retrieval situation to plagiarism detection. This section explains the underlying connections and introduces a reasonable measure that accounts for the particularities.</p><p>Let d q be a plagiarized document; d q defines a sequence of characters each of which is either labeled as plagiarized or nonplagiarized. A plagiarized section s forms a contiguous sequence of plagiarized characters in d q . The set of all plagiarized sections in d q is denoted by S, where ∀s i , s j ∈ S : i = j → (s i ∩ s j = ∅), i.e., the plagiarized sections do not intersect. Likewise, the set of all sections r ⊂ d q found by a plagiarism detection algorithm is denoted by R. See Figure <ref type="figure" target="#fig_4">4</ref> for an illustration. If the characters in d q are considered as basic retrieval units, precision and recall for a given d q , S, R compute straightforwardly. This view may be called micro-averaged or system-oriented. For the situation shown in Figure <ref type="figure" target="#fig_4">4</ref> the micro-averaged precision is 8/16, likewise, the micro-averaged recall is 8/13. The advantage of a micro-averaged view is its clear computational semantics, which comes at a price: given an imbalance in the lengths of the elements in S-which usually correlates with the detection difficulty of a plagiarized section-the explanatory power of the computed measures is limited.</p><p>It is more natural to treat the contiguous sequences of plagiarized characters as basic retrieval units. In this sense each s i ∈ S defines a query q i for which a plagiarism detection algorithm returns a result set R i ⊆ R. This view may be called macro-averaged or user-oriented. The recall of a plagiarism detection algorithm, r ec PDA , is then defined as the mean of the returned fractions of the plagiarized sections, averaged over all sections in S:</p><formula xml:id="formula_0">r ec PDA (S, R) = 1 |S| s∈S |s r∈R r| |s| ,<label>(1)</label></formula><p>where computes the positionally overlapping characters. Problem 1. The precision of a plagiarism detection algorithm is not defined under the macro-averaged view, which is rooted in the fact that a detection algorithm does not return a unique result set for each plagiarized section s ∈ S. This deficit can be resolved by switching the reference basis. Instead of the plagiarized sections, S, the algorithmically determined sections, R, become the targets: the precision with which the queries in S are answered is identified with the recall of R under S.<ref type="foot" target="#foot_2">2</ref> By computing the mean average over the r ∈ R one obtains a definite computation rule that captures the concept of retrieval precision for S:</p><formula xml:id="formula_1">prec PDA (S, R) = 1 |R| r∈R |r s∈S s| |r| ,<label>(2)</label></formula><p>where computes the positionally overlapping characters. The domain of prec PDA is [0, 1]; in particular it can be shown that this definition quantifies the necessary properties of a precision statistic. Problem 2. Both the micro-averaged view and the macro-averaged view are insensitive to the number of times an s ∈ S is detected in a detection result R, i.e., the granularity of R. We define the granularity of R for a set of plagiarized sections S by the average size of the existing covers: a detection r ∈ R belongs to the cover C s of an s ∈ S iff s and r overlap. Let S R ⊆ S denote the set of cases so that for each s ∈ S : |C s | &gt; 0. The granularity of R given S is defined as follows:</p><formula xml:id="formula_2">gran PDA (S, R) = 1 |S R | s∈S R |C s |,<label>(3)</label></formula><p>where</p><formula xml:id="formula_3">S R = {s | s ∈ S ∧ ∃r ∈ R : s ∩ r = ∅} and C s = {r | r ∈ R ∧ s ∩ r = ∅}. The domain of the granularity is [1, |R|],</formula><p>where 1 marks the desireable one-to-one correspondence between R and S, and where |R| marks the worst case, when a single s ∈ S is detected over an over again.</p><p>The measures ( <ref type="formula" target="#formula_0">1</ref>), (2), and ( <ref type="formula" target="#formula_2">3</ref>) are combined to an overall score:</p><formula xml:id="formula_4">overall PDA (S, R) = F log 2 (1 + gran PDA )</formula><p>,</p><p>where F denotes the F-Measure, i.e., the harmonic mean of the precision prec PDA and the recall r ec PDA . To smooth the influence of the granularity on the overall score we take its logarithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Survey of Detection Approaches</head><p>For the competition, 13 participants developed plagiarism detection systems to tackle one or both of the tasks external plagiarism detection and intrinsic plagiarism detection. The questions that naturally arise: how do they work and how well? To give an answer, we survey the approaches in a unified way and report on their detection quality in the competition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">External Plagiarism Detection</head><p>Most of the participants competed in the external plagiarism detection task of the competition; detection results were submitted for 10 systems. As it turns out, all systems are based on common approaches-although they perform very differently. As explained at the outset, external plagiarism detection divides into three steps (cf. Figure <ref type="figure" target="#fig_0">1</ref>): the heuristic retrieval step, the detailed analysis step, and the post-processing step. Table <ref type="table" target="#tab_0">1</ref> summarizes the participants' detection approaches in terms of these steps. However, the post-processing step was omitted here since neither of the participants applied noteworthy post-processing. Each row of the table summarizes one system; we restrict the survey to the top 6 systems since the overall performance of the remaining systems is negligible. Nevertheless, these systems also implement the generic three-step process. The focus of this survey is on describing algorithmic and retrieval aspects rather than implementation details. The latter are diverse in terms of applied languages, software, and their runtime efficiency; descriptions can be found in the respective references.</p><p>The heuristic retrieval step (column 1 of Table <ref type="table" target="#tab_0">1</ref>) involves the comparison of the corpus' suspicious documents D q with the source documents D. For this, each participant em- ploys a specific retrieval model, a comparison strategy, and a heuristic to select the candidate documents D x from the D. Most of the participants use a variation of the well-known vector space model (VSM) as retrieval model, whereas, the tokens are often character-or word-n-grams instead of single words. As comparison strategy, the top 3 approaches perform an exhaustive comparison of D q and D, i.e., each d q ∈ D q is compared with each</p><formula xml:id="formula_5">d x ∈ D in time O(|D q | • |D|)</formula><p>, while the remaining approaches employ data partitioning and space partitioning technologies to achieve lower runtime complexities. To select the candidate documents D x for a d q either its k nearest neighbors are selected or the documents which exceed a certain similarity threshold.</p><p>The detailed analysis step (column 2 of Table <ref type="table" target="#tab_0">1</ref>) involves the comparison of each d q ∈ D q with its respective candidate documents D x in order to extract pairs of sections (s q , s x ), where s q ∈ d q and s x ∈ d x , d x ∈ D x , from them which are highly similar, if any. For this, each participant first extracts all exact matches between d q and d x and then merges the matches heuristically to form suspicious sections (s q , s x ). While each participant uses the same type of token to extract exact matches as his respective retrieval model of the heuristic retrieval step, the match merging heuristics differ largely from one another. However, it can be said that in most approaches a kind of distance between exact matches is measured first, and then a custom algorithm is employed which clusters them to sections.</p><p>Table <ref type="table" target="#tab_1">2</ref> lists the detection performance results of all approaches, computed with the quality measures introduced in Section 3. Observe that the approach with top precision is the one on rank 6 which is based on fingerprinting, the approach with top recall is the one on rank 2, and the approach with top granularity is the one on rank 1. The latter is also the winner of this task since it provides the best trade off between the three quality measures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Intrinsic Plagiarism Detection</head><p>The intrinsic plagiarism detection task has gathered less attention than external plagiarism detection; detection results were submitted for 4 systems. Table <ref type="table" target="#tab_2">3</ref> lists their detection performance results. Unlike in external plagiarism detection, in this task the baseline performance is not 0. The reason for this is that intrinsic plagiarism detection is a one- </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Generic retrieval process for external plagiarism detection.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Distribution of suspicious documents (with and without plagiarism) and source documents.</figDesc><graphic coords="3,375.59,76.27,134.59,103.57" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Distribution of the plagiarism percentage θ in the external test corpus. For the intrinsic test corpus applies the hashed part only.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: A document as character sequence, including plagiarized sections S and detections R returned by a plagiarism detection algorithm. The figure is drawn at scale 1 : n chars, n 1.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Unified summary of the detection approaches of the participants.Comparison of Dq and D. ExhaustiveCandidates Dx ⊂ D for a dq. The 10 documents nearest to dq.</figDesc><table><row><cell cols="2">External Plagiarism Detection Approach</cell><cell></cell></row><row><cell>Heuristic Retrieval</cell><cell>Detailed Analysis</cell><cell>Participant</cell></row><row><cell>Retrieval Model.</cell><cell>Exact Matches of dq and dx ∈ Dx.</cell><cell>Grozea, Gehl, and</cell></row><row><cell>Character-16-gram VSM</cell><cell>Character-16-grams</cell><cell>Popescu (2009)</cell></row><row><cell>(frequency weights, cosine similarity)</cell><cell>Match Merging Heuristic to get (sq, sx).</cell><cell></cell></row><row><cell>Comparison of Dq and D.</cell><cell>Computation of the distances of adjacent</cell><cell></cell></row><row><cell>Exhaustive</cell><cell>matches. Joining of the matches based on a</cell><cell></cell></row><row><cell>Candidates Dx ⊂ D for a dq. The 51 documents most similar to dq.</cell><cell>Monte Carlo optimization. Refinement of the obtained section pairs, e.g., by discarding too small sections.</cell><cell></cell></row><row><cell>Retrieval Model.</cell><cell>Exact Matches of dq and dx ∈ Dx.</cell><cell>Kasprzak, Brandejs,</cell></row><row><cell>Word-5-gram VSM</cell><cell>Word-5-grams</cell><cell>and Křipač (2009)</cell></row><row><cell>(boolean weights, Jaccard similarity)</cell><cell>Match Merging Heuristic to get (sq, sx).</cell><cell></cell></row><row><cell>Comparison of Dq and D.</cell><cell>Extraction of the pairs of sections (sq, sx) of</cell><cell></cell></row><row><cell>Exhaustive</cell><cell>maximal size which share at least 20</cell><cell></cell></row><row><cell>Candidates Dx ⊂ D for a dq. Documents which share at least 20 n-grams with dq.</cell><cell>matches, including the first and the last n-gram of sq and sx, and for which 2 adjacent matches are at most 49 not-matching n-grams apart.</cell><cell></cell></row><row><cell>Retrieval Model.</cell><cell>Exact Matches of dq and dx ∈ Dx.</cell><cell>Basile et al. (2009)</cell></row><row><cell>Word-8-gram VSM</cell><cell>Word-8-grams</cell><cell></cell></row><row><cell>(frequency weights, custom distance)</cell><cell>Match Merging Heuristic to get (sq, sx).</cell><cell></cell></row><row><cell></cell><cell>Extraction of the pairs of sections (sq, sx)</cell><cell></cell></row><row><cell></cell><cell>which are obtained by greedily joining</cell><cell></cell></row><row><cell></cell><cell>consecutive matches if their distance is not</cell><cell></cell></row><row><cell></cell><cell>too high.</cell><cell></cell></row><row><cell cols="3">Using the commercial system Plagiarism Detector (http://plagiarism-detector.com) Palkovskii, Belov,</cell></row><row><cell></cell><cell></cell><cell>and Muzika (2009)</cell></row><row><cell>Retrieval Model.</cell><cell>Exact Matches of dq and dx ∈ Dx.</cell><cell>Muhr et al. (2009)</cell></row><row><cell>Word-1-gram VSM</cell><cell>Sentences</cell><cell></cell></row><row><cell>(frequency weights, cosine similarity)</cell><cell>Match Merging Heuristic to get (sq, sx).</cell><cell></cell></row><row><cell>Comparison of Dq and D.</cell><cell>Extraction of the pairs of sections (sq, sx)</cell><cell></cell></row><row><cell>Clustering-based data-partitioning of</cell><cell>which are obtained by greedily joining</cell><cell></cell></row><row><cell>D's sentences. Comparison of Dq's</cell><cell>consecutive sentences. Gaps are allowed if</cell><cell></cell></row><row><cell>sentences with each partitions' centroid.</cell><cell>the respective sentences are similar to the</cell><cell></cell></row><row><cell>Candidates Dx ⊂ D for a dq. For each sentence of dq, the documents</cell><cell>corresponding sentences in the other document.</cell><cell></cell></row><row><cell>from the 2 most similar partitions which</cell><cell></cell><cell></cell></row><row><cell>share similar sentences.</cell><cell></cell><cell></cell></row><row><cell>Retrieval Model.</cell><cell>Exact Matches of dq and dx ∈ Dx.</cell><cell>Scherbinin and</cell></row><row><cell>Winnowing fingerprinting</cell><cell>Fingerprint chunks</cell><cell>Butakov (2009)</cell></row><row><cell>50 char chunks with 30 char overlap</cell><cell>Match Merging Heuristic to get (sq, sx).</cell><cell></cell></row><row><cell>Comparison of Dq and D.</cell><cell>Extraction of the pairs of sections (sq, sx)</cell><cell></cell></row><row><cell>Exhaustive</cell><cell>which are obtained by enlarging matches</cell><cell></cell></row><row><cell>Candidates Dx ⊂ D for a dq. Documents whose fingerprints share at</cell><cell>and joining adjacent matches. Gaps must be below a certain Levenshtein distance.</cell><cell></cell></row><row><cell>least one value with dq's fingerprint.</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Performance results for the external plagiarism detection task.</figDesc><table><row><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="3">External Detection Quality</cell></row><row><cell cols="2">Rank Overall</cell><cell>F</cell><cell cols="4">Precision Recall Granularity Participant</cell></row><row><cell>1</cell><cell cols="2">0.6957 0.6976</cell><cell>0.7418</cell><cell>0.6585</cell><cell>1.0038</cell><cell>Grozea, Gehl, and Popescu (2009)</cell></row><row><cell>2</cell><cell cols="2">0.6093 0.6192</cell><cell>0.5573</cell><cell>0.6967</cell><cell>1.0228</cell><cell>Kasprzak, Brandejs, and Křipač (2009)</cell></row><row><cell>3</cell><cell cols="2">0.6041 0.6491</cell><cell>0.6727</cell><cell>0.6272</cell><cell>1.1060</cell><cell>Basile et al. (2009)</cell></row><row><cell>4</cell><cell cols="2">0.3045 0.5286</cell><cell>0.6689</cell><cell>0.4370</cell><cell>2.3317</cell><cell>Palkovskii, Belov, and Muzika (2009)</cell></row><row><cell>5</cell><cell cols="2">0.1885 0.4603</cell><cell>0.6051</cell><cell>0.3714</cell><cell>4.4354</cell><cell>Muhr et al. (2009)</cell></row><row><cell>6</cell><cell cols="2">0.1422 0.6190</cell><cell>0.7473</cell><cell>0.5284</cell><cell>19.4327</cell><cell>Scherbinin and Butakov (2009)</cell></row><row><cell>7</cell><cell cols="2">0.0649 0.1736</cell><cell>0.6552</cell><cell>0.1001</cell><cell>5.3966</cell><cell>Pereira, Moreira, and Galante (2009)</cell></row><row><cell>8</cell><cell cols="2">0.0264 0.0265</cell><cell>0.0136</cell><cell>0.4586</cell><cell>1.0068</cell><cell>Vallés Balaguer (2009)</cell></row><row><cell>9</cell><cell cols="2">0.0187 0.0553</cell><cell>0.0290</cell><cell>0.6048</cell><cell>6.7780</cell><cell>Malcolm and Lane (2009)</cell></row><row><cell>10</cell><cell cols="2">0.0117 0.0226</cell><cell>0.3684</cell><cell>0.0116</cell><cell>2.8256</cell><cell>Allen (2009)</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc>Performance results for the intrinsic plagiarism detection task.</figDesc><table><row><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="3">Intrinsic Detection Quality</cell></row><row><cell cols="2">Rank Overall</cell><cell>F</cell><cell cols="4">Precision Recall Granularity Participant</cell></row><row><cell>1</cell><cell cols="2">0.2462 0.3086</cell><cell>0.2321</cell><cell>0.4607</cell><cell>1.3839</cell><cell>Stamatatos (2009)</cell></row><row><cell>2</cell><cell cols="2">0.1955 0.1956</cell><cell>0.1091</cell><cell>0.9437</cell><cell>1.0007</cell><cell>Hagbi and Koppel (2009)</cell><cell>(Baseline)</cell></row><row><cell>3</cell><cell cols="2">0.1766 0.2286</cell><cell>0.1968</cell><cell>0.2724</cell><cell>1.4524</cell><cell>Muhr et al. (2009)</cell></row><row><cell>4</cell><cell cols="2">0.1219 0.1750</cell><cell>0.1036</cell><cell>0.5630</cell><cell>1.7049</cell><cell></cell></row></table><note><ref type="bibr" target="#b12">Seaward and Matwin (2009)</ref> </note></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://www.gutenberg.org</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1">Martin Potthast, Benno Stein, Andreas Eiselt, Alberto Barrón-Cedeño and Paolo Rosso   </note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_2">In<ref type="bibr" target="#b14">(Stein, 2007)</ref> this idea is mathematically derived as "precision stress" and "recall stress".</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>We thank Yahoo! Research and the University of the Basque Country for their sponsorship. This work was also partially funded by the Text-Enterprise 2.0 TIN2009-13391-C04-03 project and the CONACYT-MEXICO 192021 grant. Our special thanks go to the participants of the competition for their devoted work.</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>class classification problem in which it has to be decided for each section of a document whether it is plagiarized, or not. The baseline performance in such problems is commonly computed as the naive assumption that everything belongs to the target class, which is also what <ref type="bibr" target="#b4">Hagbi and Koppel (2009)</ref> did who classified almost everything as plagiarized. Interestingly, the baseline approach is on rank 2 while two approaches perform worse than the baseline. Only the approach of <ref type="bibr" target="#b13">Stamatatos (2009)</ref> performs better than the baseline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Overall Detection Results</head><p>To determine the overall winner of the competition, we have computed the combined detection performance of each participant on the competition corpora of both tasks. Table <ref type="table">4</ref> shows the results. Note that the competition corpus of the external plagiarism detection task is a lot bigger than the one for the intrinsic plagiarism detection task, which is why the top ranked approaches are those who performed best in the former task. Overall winner of the competition is the approach of <ref type="bibr" target="#b3">Grozea, Gehl, and Popescu (2009)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Summary</head><p>The 1st International Competition on Plagiarism Detection fostered research and brought a number of new insights into the problems of automatic plagiarism detection and its evaluation. An important by-product of the competition is a controlled large-scale evaluation framework which consists of a corpus of artificial plagiarism cases and new detection qual-ity measures. The corpus contains more than 40 000 documents and about 94 000 cases of plagiarism. Furthermore, in this paper we give a comprehensive overview about the competition and in particular about the plagiarism detection approaches of the competition's 13 participants. It turns out that all of the detection approaches follow a generic retrieval process scheme which consists of the three steps heuristic retrieval, detailed analysis, and knowledge-based post-processing. To ascertain this fact we have compiled a unified summary of the top approaches in Table <ref type="table">1</ref>.</p><p>The competition divided into the two tasks external plagiarism detection and intrinsic plagiarism detection. The winning approach for the former task achieves 0.74 precision at 0.65 recall at 1.00 granularity. The winning approach for the latter task improves 26% above the baseline approach and achieves 0.23 precision at 0.46 recall at 1.38 granularity.</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Submission to the 1st International Competition on Plagiarism Detection</title>
		<author>
			<persName><forename type="first">James</forename><surname>Allen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">From the Southern Methodist University in</title>
				<meeting><address><addrLine>Dallas, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">A Plagiarism Detection Procedure in Three Steps: Selection, Matches and &quot;Squares</title>
		<author>
			<persName><forename type="first">Chiara</forename><surname>Basile</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dario</forename><surname>Benedetto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Emanuele</forename><surname>Caglioti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Giampaolo</forename><surname>Cristadoro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mirko</forename><surname>Degli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Esposti</forename></persName>
		</author>
		<editor>Stein et al.</editor>
		<imprint>
			<date type="published" when="2009">2009. 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Old and new challenges in automatic plagiarism detection</title>
		<author>
			<persName><forename type="first">Paul</forename><surname>Clough</surname></persName>
		</author>
		<ptr target="http://ir.shef.ac.uk/cloughie/papers/pasplagiarism.pdf" />
		<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
		<respStmt>
			<orgName>National UK Plagiarism Advisory Service</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection</title>
		<author>
			<persName><forename type="first">Cristian</forename><surname>Grozea</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christian</forename><surname>Gehl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Marius</forename><surname>Popescu</surname></persName>
		</author>
		<editor>Stein et al.</editor>
		<imprint>
			<date type="published" when="2009">2009. 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Submission to the 1st International Competition on Plagiarism Detection</title>
		<author>
			<persName><forename type="first">Barak</forename><surname>Hagbi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Moshe</forename><surname>Koppel</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2009">2009</date>
			<pubPlace>Israel</pubPlace>
		</imprint>
		<respStmt>
			<orgName>From the Bar Ilan University</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Finding Plagiarism by Evaluating Document Similarities</title>
		<author>
			<persName><forename type="first">Jan</forename><surname>Kasprzak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Michal</forename><surname>Brandejs</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Miroslav</forename><surname>Křipač</surname></persName>
		</author>
		<editor>Stein et al.</editor>
		<imprint>
			<date type="published" when="2009">2009. 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Tackling the PAN&apos;09 External Plagiarism Detection Corpus with a Desktop Plagiarism Detector</title>
		<author>
			<persName><forename type="first">James</forename><forename type="middle">A</forename><surname>Malcolm</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">R</forename><surname>Peter</surname></persName>
		</author>
		<author>
			<persName><surname>Lane</surname></persName>
		</author>
		<editor>Stein et al.</editor>
		<imprint>
			<date type="published" when="2009">2009. 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Plagiarism -a survey</title>
		<author>
			<persName><forename type="first">Hermann</forename><surname>Maurer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Frank</forename><surname>Kappe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bilal</forename><surname>Zaka</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Universal Computer Science</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="issue">8</biblScope>
			<biblScope unit="page" from="1050" to="1084" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Intrinsic plagiarism detection</title>
		<author>
			<persName><forename type="first">Sven</forename><surname>Meyer Zu Eissen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Benno</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the European Conference on Information Retrieval (ECIR 2006)</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<editor>
			<persName><forename type="first">Mounia</forename><surname>Lalmas</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Andy</forename><surname>Macfarlane</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Stefan</forename><forename type="middle">M</forename><surname>Rüger</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Anastasios</forename><surname>Tombros</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Theodora</forename><surname>Tsikrika</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Alexei</forename><surname>Yavlinsky</surname></persName>
		</editor>
		<meeting>the European Conference on Information Retrieval (ECIR 2006)</meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2006">2006</date>
			<biblScope unit="volume">3936</biblScope>
			<biblScope unit="page" from="565" to="569" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">External and Intrinsic Plagiarism Detection Using Vector Space Models</title>
		<author>
			<persName><forename type="first">Markus</forename><surname>Muhr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mario</forename><surname>Zechner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Roman</forename><surname>Kern</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Michael</forename><surname>Granitzer</surname></persName>
		</author>
		<editor>Stein et al.</editor>
		<imprint>
			<date type="published" when="2009">2009. 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Submission to the 1st International Competition on Plagiarism Detection</title>
		<author>
			<persName><forename type="first">Yurii</forename><surname>Palkovskii</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alexei</forename><surname>Anatol'yevich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Irina</forename><forename type="middle">Alexandrovna</forename><surname>Vitalievich Belov</surname></persName>
		</author>
		<author>
			<persName><surname>Muzika</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2009">2009</date>
			<pubPlace>Ukraine</pubPlace>
		</imprint>
		<respStmt>
			<orgName>From the Zhytomyr State University</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Submission to the 1st International Competition on Plagiarism Detection</title>
		<author>
			<persName><forename type="first">Rafael</forename><forename type="middle">C</forename><surname>Pereira</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">P</forename><surname>Moreira</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Galante</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">From the Universidade Federal do Rio Grande do Sul, Brazil. Scherbinin</title>
				<editor>
			<persName><surname>Stein</surname></persName>
		</editor>
		<imprint>
			<publisher>Vladislav and Sergey Butakov</publisher>
			<date type="published" when="2009">2009. 2009. 2009</date>
		</imprint>
	</monogr>
	<note>Using Microsoft SQL Server Platform for Plagiarism Detection</note>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">Intrinsic Plagiarism Detection Using Complexity Analysis</title>
		<author>
			<persName><forename type="first">Leanne</forename><surname>Seaward</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stan</forename><surname>Matwin</surname></persName>
		</author>
		<editor>Stein et al.</editor>
		<imprint>
			<date type="published" when="2009">2009. 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Intrinsic Plagiarism Detection Using Character n-gram Profiles</title>
		<author>
			<persName><forename type="first">Efstathios</forename><surname>Stamatatos</surname></persName>
		</author>
		<editor>Stein et al.</editor>
		<imprint>
			<date type="published" when="2009">2009. 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Principles of hash-based text retrieval</title>
		<author>
			<persName><forename type="first">Benno</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">30th Annual International ACM SIGIR Conference</title>
				<editor>
			<persName><forename type="first">Charles</forename><surname>Clarke</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Norbert</forename><surname>Fuhr</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Noriko</forename><surname>Kando</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Wessel</forename><surname>Kraaij</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Arjen</forename><surname>De Vries</surname></persName>
		</editor>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2007-07">2007. July</date>
			<biblScope unit="page" from="527" to="534" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Strategies for Retrieving Plagiarized Documents</title>
		<author>
			<persName><forename type="first">Benno</forename><surname>Stein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sven</forename><surname>Meyer Zu Eissen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Martin</forename><surname>Potthast</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">30th Annual International ACM SIGIR Conference</title>
				<editor>
			<persName><forename type="first">Charles</forename><surname>Clarke</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Norbert</forename><surname>Fuhr</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Noriko</forename><surname>Kando</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Wessel</forename><surname>Kraaij</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Arjen</forename><surname>De Vries</surname></persName>
		</editor>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2007-07">2007. July</date>
			<biblScope unit="page" from="825" to="826" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><forename type="first">Benno</forename><surname>Stein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Paolo</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Efstathios</forename><surname>Stamatatos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Moshe</forename><surname>Koppel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eneko</forename><surname>Agirre</surname></persName>
		</author>
		<title level="m">Proceedings of the SEPLN Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse, PAN&apos;09</title>
				<editor>
			<persName><surname>Stein</surname></persName>
		</editor>
		<meeting>the SEPLN Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse, PAN&apos;09<address><addrLine>Donostia-San Sebastián, Spain; Vallés Balaguer, Enrique</addrLine></address></meeting>
		<imprint>
			<publisher>Universidad Polytécnica de Valencia</publisher>
			<date type="published" when="2009-09-10">2009. September 10 2009. 2009. 2009</date>
		</imprint>
	</monogr>
	<note>Putting Ourselves in SME&apos;s Shoes: Automatic Detection of Plagiarism by the WCopyFind tool</note>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">Plagiarism detection softwaretest</title>
		<author>
			<persName><forename type="first">Debora</forename><surname>Weber-Wulff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Katrin</forename><surname>Köhler</surname></persName>
		</author>
		<ptr target="http://plagiat.htw-berlin.de/software/2008/" />
		<imprint>
			<date type="published" when="2008">2008. 2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Webis at Bauhaus-Universität Weimar and NLEL at Universidad Politécnica de Valencia</title>
		<ptr target="http://www.webis.de/research/" />
	</analytic>
	<monogr>
		<title level="m">PAN Plagiarism Corpus PAN-PC-09</title>
				<editor>
			<persName><forename type="first">Martin</forename><surname>Potthast</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Andreas</forename><surname>Eiselt</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Benno</forename><surname>Stein</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Alberto</forename><surname>Barrón-Cedeño</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Paolo</forename><surname>Rosso</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
