-

Detection∗

Martin Potthast Benno Stein Andreas Eiselt

0 1

Alberto Barro´n-Ceden˜o Paolo Rosso

prosso@dsic.upv.es 0 1

Web Technology

0 1

Information Systems Group

0 1 0 Bauhaus-Universita ̈t Weimar <first name>.<last 1 Natural Language Engineering Lab, ELiRF Universidad Polit ́ecnica de Valencia

2009

3936 1 9

The 1st International Competition on Plagiarism Detection, held in conjunction with the 3rd PAN workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse, brought together researchers from many disciplines around the exciting retrieval task of automatic plagiarism detection. The competition was divided into the subtasks external plagiarism detection and intrinsic plagiarism detection, which were tackled by 13 participating groups. An important by-product of the competition is an evaluation framework for plagiarism detection, which consists of a large-scale plagiarism corpus and detection quality measures. The framework may serve as a unified test environment to compare future plagiarism detection research. In this paper we describe the corpus design and the quality measures, survey the detection approaches developed by the participants, and compile the achieved performance results of the competitors.

Plagiarism Detection Competition Evaluation Framework

Plagiarism and its automatic retrieval have attracted considerable attention from research and industry: various papers have been published on the topic, and many commercial software systems are being developed. However, when asked to name the best algorithm or the best system for plagiarism detection, hardly any evidence can be found to make an educated guess among the alternatives. One reason for this is that the research field of plagiarism detection lacks a controlled evaluation environment. This leads researchers to devise their own experimentation and methodologies, which are often not reproducible or comparable across papers. Furterhmore, it is unknown which detection quality can at least be expected from a plagiarism detection system.

To close this gap we have organized an international competition on plagiarism detection. We have set up, presumably for the first time, a controlled evaluation environment for plagiarism detection which consists of a largescale corpus of artificial plagiarism and de∗http://www.webis.de/research/workshopseries/pan-09 tection quality measures. In what follows we overview the corpus, the quality measures, the participants’ detection approaches, and their obtained results. 1.1

Related Work

Research on plagiarism detection has been surveyed by Maurer, Kappe, and Zaka (2006) and Clough (2003). Particularly the latter provides well thought-out insights into, even today, “[...] new challenges in automatic plagiarism detection”, among which the need for a standardized evaluation framework is already mentioned.

With respect to the evaluation of commercial plagiarism detection systems, WeberWulff and Ko¨hler (2008 ) have conducted a manual evaluation: 31 handmade cases of plagiarism were submitted to 19 systems. The sources for the plagiarism cases were selected from the Web and the systems were judged by their capability to retrieve them. Due to the use of the Web, the experiment is not controlled which limits reproducibility, and since each case is only about two pages long there are concerns with respect to the study’s representativeness. However, comdq Plagiarism Detection Heuristic retrieval

Candidate documents

Detailed analysis

Knowledge-based post-processing Reference collection D Suspicious sections mercial systems are usually not available for a close inspection which may leave no other choice to evaluate them. 1.2

Plagiarism Detection

The literature on the subject often puts plagiarism detection on a level with the identification of highly similar sections in texts or other objects. But this does not show the whole picture. From our point of view plagiarism detection divides into two major problem classes, namely external plagiarism detection and intrinsic plagiarism detection. Both of which include a number of subproblems and the frequently mentioned step-bystep comparison of two documents is only one of them.

For external plagiarism detection Stein, Meyer zu Eissen, and Potthast (2007) introduce a generic three-step retrieval process. The authors consider that the source of a plagiarism case may be hidden in a large reference collection, as well as that the detection results may not be perfectly accurate. Figure 1 illustrates this retrieval process. In fact, all detection approaches submitted by the competition participants can be explained in terms of these building blocks (cf. Section 4).

The process starts with a suspicious document dq and a collection D of documents from which dq’s author may have plagiarized. Within a so-called heuristic retrieval step a small number of candidate documents Dx, which are likely to be sources for plagiarism, are retrieved from D. Note that D is usually very large, e.g., in the size of the Web, so that it is impractical to compare dq one after the other with each document in D. Then, within a so-called detailed analysis step, dq is compared section-wise with the retrieved candidates. All pairs of sections (sq, sx) with sq ∈ dq and sx ∈ dx, dx ∈ Dx, are to be retrieved such that sq and sx have a high similarity under some retrieval model. In a knowledge-based post-processing step those sections are filtered for which certain exclusion criteria hold, such as the use of proper citation or literal speech. The remaining suspicious sections are presented to a human, who may decide whether or not a plagiarism offense is given.

Intrinsic plagiarism detection has been studied in detail by Meyer zu Eissen and Stein (2006). In this setting one is given a suspicious document dq but no reference collection D. Technology that tackles instances of this problem class resembles the human ability to spot potential cases of plagiarism just by reading dq. 1.3

Competition Agenda

We have set up a large-scale corpus (Dq, D, S) of “artificial plagiarism” cases for the competition, where Dq is a collection of suspicious documents, D is a collection of source documents, and S is the set of annotations of all plagiarism cases between Dq and D. The competition divided into two tasks and into two phases for which the corpus was split up into 4 parts; one part for each combination of tasks and phases. For simplicity the sub-corpora are not denoted by different symbols.

Competition tasks and phases:

• External Plagiarism Detection Task.

Given Dq and D the task is to identify the sections in Dq which are plagiarized, and their source sections in D. • Intrinsic Plagiarism Detection Task.

Given only Dq the task is to identify the plagiarized sections. • Training Phase. Release of a training corpus (Dq, D, S) to allow for the development of a plagiarism detection system. • Competition Phase. Release of a competition corpus (Dq, D) whose plagiarism cases were to be detected and submitted as detection annotations, R.

Participants were allowed to compete in either of the two tasks or both. After the competition phase the participants’ detections were evaluated, and the winner of each task as well as an overall winner was determined as that participant whose detections R best matched S in the respective competition corpora. 2

Plagiarism Corpus

The PAN plagiarism corpus, PAN-PC-09, comprises 41 223 text documents in which 94 202 cases of artificial plagiarism have been inserted automatically (Webis at BauhausUniversit¨at Weimar and NLEL at Universidad Polit´ecnica de Valencia, 2009) . The corpus is based on 22 874 book-length documents from the Project Gutenberg.1 All documents are, to the best of our knowledge, public domain; therefore the corpus is available free of charge to other researchers. Important parameters of the corpus are the following: • Document Length. 50% of the documents are small (1-10 pages), 35% medium (10-100 pages), and 15% large (100-1000 pages). • Suspicious-to-Source Ratio. 50% of the documents are designated as suspicious documents Dq, and 50% are designated as source documents D (see Figure 2). • Plagiarism Percentage. The percentage θ of plagiarism per suspicious document dq ∈ Dq ranges from 0% to 100%, whereas 50% of the suspicious documents contain no plagiarism at all. Figure 3 shows the distribution of the plagiarized documents for the external test corpus. For the intrinsic test corpus applies the hashed part of the distribution. • Plagiarism Length. The length of a plagiarism case is evenly distributed between 50 words and 5000 words. Suspicious documents without plagiarsim

Source doucments with plagiarsim

• Plagiarism Languages. 90% of the cases are monolingual English plagiarism, the remainder of the cases are cross-lingual plagiarism which were translated automatically from German and Spanish to English. • Plagiarism Obfuscation. The monolingual portion of the plagiarism in the external test corpus was obfuscated (cf. Section 2.1). The degree of obfuscation ranges evenly from none to high.

Note that for the estimation of the parameter distributions one cannot fall back on large case studies on real plagiarism cases. Hence, we decided to construct more simple cases than complex ones, where “simple” refers to short lengths, a small percentage θ, and less obfuscation. However, complex cases are overrepresented to allow for a better judgement whether a system detects them properly. 2.1

Obfuscation Synthesis

Plagiarists often modify or rewrite the sections they copy in order to obfuscate the plagiarism. In this respect, the automatic synthesis of plagiarism obfuscation we applied when constructing the corpus is of particular interest. The respective synthesis task reads y θ: 5 25 50 75 100% y as follows: given a section of text sx, create a section sq which has a high content similarity to sx under some retrieval model but with a (substantially) different wording than sx.

An optimal obfuscation synthesizer, i.e., an automatic plagiarist, takes an sx and creates an sq which is human-readable and which creates the same ideas in mind as sx does when read by a human. Today, such a synthesizer cannot be constructed. Therefore, we approach the task from the basic understanding of content similarity in information retrieval, namely the bag-of-words model. By allowing our obfuscation synthesizers to construct texts which are not necessarily human-readable they can be greatly simplified. We have set up three heuristics to construct sq from sx: • Random Text Operations. Given sx, sq is created by shuffling, removing, inserting, or replacing words or short phrases at random. Insertions and replacements are, for instance, taken from the document dq, the new context of sq. • Semantic Word Variation. Given sx, sq is created by replacing each word by one of its synonyms, antonyms, hyponyms, or hypernyms, chosen at random. A word is retained if neither are available. • POS-preserving Word Shuffling. Given sx its sequence of parts of speech (POS) is determined. Then, sq is created by shuffling words at random while the original POS sequence is maintained. 2.2

Critical Remarks

The corpus has been conceived and constructed only just in time for the competition so that there may still be errors in it. For instance, the participants pointed out that there are a number of unintended overlaps between unrelated documents. These accidental similarities do not occur frequently, so that an additional set of annotations solves this problem.

The obfuscation synthesizer based on random text operations produces anomalies in some of the obfuscated texts, such as sequences of punctuation marks and stop words. These issues were not entirely resolved so that it is possible to find some of the plagiarism cases by applying a kind of anomaly detection. Nevertheless, this was not observed during the competition.

Finally, by construction the corpus does not accurately simulate a heuristic retrieval situation in which the Web is used as reference collection. The source documents in the corpus do not resemble the Web appropriately. Note, however, that sampling the Web is also a problem for many ranking evaluation frameworks. 3

Detection Quality Measures A measure that quantifies the performance of a plagiarism detection algorithm will resemble concepts in terms of precision and recall. However, these concepts cannot be transferred one-to-one from the classical information retrieval situation to plagiarism detection. This section explains the underlying connections and introduces a reasonable measure that accounts for the particularities.

Let dq be a plagiarized document; dq defines a sequence of characters each of which is either labeled as plagiarized or nonplagiarized. A plagiarized section s forms a contiguous sequence of plagiarized characters in dq. The set of all plagiarized sections in dq is denoted by S, where ∀si, sj ∈ S : i = j → (si ∩ sj = ∅), i.e., the plagiarized sections do not intersect. Likewise, the set of all sections r ⊂ dq found by a plagiarism detection algorithm is denoted by R. See Figure 4 for an illustration.

s1 s2 s3

If the characters in dq are considered as basic retrieval units, precision and recall for a given dq, S, R compute straightforwardly. This view may be called micro-averaged or system-oriented. For the situation shown in Figure 4 the micro-averaged precision is 8/16, likewise, the micro-averaged recall is 8/13. The advantage of a micro-averaged view is its clear computational semantics, which comes r1 r2 r3 r4

r5 R

S ydocument as ycharactersequence yoriginal characters plagiarized characters detected characters at a price: given an imbalance in the lengths of the elements in S—which usually correlates with the detection difficulty of a plagiarized section—the explanatory power of the computed measures is limited.

It is more natural to treat the contiguous sequences of plagiarized characters as basic retrieval units. In this sense each si ∈ S defines a query qi for which a plagiarism detection algorithm returns a result set Ri ⊆ R. This view may be called macro-averaged or user-oriented. The recall of a plagiarism detection algorithm, r ecPDA, is then defined as the mean of the returned fractions of the plagiarized sections, averaged over all sections in S: r ecPDA(S, R) = 1 |S| s∈S |s

r∈R r| , |s| (1) where computes the positionally overlapping characters.

Problem 1. The precision of a plagiarism detection algorithm is not defined under the macro-averaged view, which is rooted in the fact that a detection algorithm does not return a unique result set for each plagiarized section s ∈ S. This deficit can be resolved by switching the reference basis. Instead of the plagiarized sections, S, the algorithmically determined sections, R, become the targets: the precision with which the queries in S are answered is identified with the recall of R under S.2 By computing the mean average over the r ∈ R one obtains a definite computation rule that captures the concept of retrieval precision for S: precPDA(S, R) = 1 |R| r∈R |r

s∈S s| , (2) |r| where computes the positionally overlapping characters. The domain of precPDA is [0, 1]; in particular it can be shown that this definition quantifies the necessary properties of a precision statistic.

Problem 2. Both the micro-averaged view and the macro-averaged view are insensitive to the number of times an s ∈ S is detected in a detection result R, i.e., the granularity of R. We define the granularity of R for a set of plagiarized sections S by the average size of the existing covers: a detection r ∈ R belongs 2In (Stein, 2007) this idea is mathematically derived as “precision stress” and “recall stress”. to the cover Cs of an s ∈ S iff s and r overlap. Let SR ⊆ S denote the set of cases so that for each s ∈ S : |Cs| > 0. The granularity of R given S is defined as follows: g ranPDA(S, R) = |Cs|,

(3) 1 |SR| s∈SR where SR = {s | s ∈ S ∧ ∃r ∈ R : s ∩ r = ∅} and Cs = {r | r ∈ R ∧ s ∩ r = ∅}. The domain of the granularity is [1, |R|], where 1 marks the desireable one-to-one correspondence between R and S, and where |R| marks the worst case, when a single s ∈ S is detected over an over again.

The measures (1), (2), and (3) are combined to an overall score: overallPDA(S, R) =

F log2(1 + g ranPDA) where F denotes the F-Measure, i.e., the harmonic mean of the precision precPDA and the recall r ecPDA. To smooth the influence of the granularity on the overall score we take its logarithm. 4

Survey of Detection Approaches For the competition, 13 participants developed plagiarism detection systems to tackle one or both of the tasks external plagiarism detection and intrinsic plagiarism detection. The questions that naturally arise: how do they work and how well? To give an answer, we survey the approaches in a unified way and report on their detection quality in the competition. 4.1

External Plagiarism Detection

Most of the participants competed in the external plagiarism detection task of the competition; detection results were submitted for 10 systems. As it turns out, all systems are based on common approaches—although they perform very differently.

As explained at the outset, external plagiarism detection divides into three steps (cf. Figure 1): the heuristic retrieval step, the detailed analysis step, and the post-processing step. Table 1 summarizes the participants’ detection approaches in terms of these steps. However, the post-processing step was omitted here since neither of the participants applied noteworthy post-processing. Each row of the table summarizes one system; we restrict the survey to the top 6 systems since Heuristic Retrieval Retrieval Model.

Character-16-gram VSM (frequency weights, cosine similarity) Comparison of Dq and D.

Exhaustive Candidates Dx ⊂ D for a dq.

The 51 documents most similar to dq.

Retrieval Model.

Word-5-gram VSM (boolean weights, Jaccard similarity) Comparison of Dq and D.

Exhaustive Candidates Dx ⊂ D for a dq.

Documents which share at least 20 n-grams with dq.

Retrieval Model.

Word-8-gram VSM (frequency weights, custom distance) Comparison of Dq and D.

Exhaustive Candidates Dx ⊂ D for a dq.

The 10 documents nearest to dq.

Detailed Analysis Exact Matches of dq and dx ∈ Dx.

Character-16-grams Match Merging Heuristic to get (sq, sx).

Computation of the distances of adjacent matches. Joining of the matches based on a Monte Carlo optimization. Refinement of the obtained section pairs, e.g., by discarding too small sections.

Exact Matches of dq and dx ∈ Dx.

Word-5-grams Match Merging Heuristic to get (sq, sx).

Extraction of the pairs of sections (sq, sx) of maximal size which share at least 20 matches, including the first and the last n-gram of sq and sx, and for which 2 adjacent matches are at most 49 not-matching n-grams apart.

Exact Matches of dq and dx ∈ Dx.

Word-8-grams Match Merging Heuristic to get (sq, sx).

Extraction of the pairs of sections (sq, sx) which are obtained by greedily joining consecutive matches if their distance is not too high.

External Plagiarism Detection Approach the overall performance of the remaining systems is negligible. Nevertheless, these systems also implement the generic three-step process. The focus of this survey is on describing algorithmic and retrieval aspects rather than implementation details. The latter are diverse in terms of applied languages, software, and their runtime efficiency; descriptions can be found in the respective references.

The heuristic retrieval step (column 1 of Table 1) involves the comparison of the corpus’ suspicious documents Dq with the source documents D. For this, each participant emCandidates Dx ⊂ D for a dq.

For each sentence of dq, the documents from the 2 most similar partitions which share similar sentences.

Retrieval Model.

Winnowing fingerprinting 50 char chunks with 30 char overlap Comparison of Dq and D.

Exhaustive Candidates Dx ⊂ D for a dq.

Documents whose fingerprints share at least one value with dq’s fingerprint.

Exact Matches of dq and dx ∈ Dx.

Fingerprint chunks Match Merging Heuristic to get (sq, sx).

Extraction of the pairs of sections (sq, sx) which are obtained by enlarging matches and joining adjacent matches. Gaps must be below a certain Levenshtein distance. ploys a specific retrieval model, a comparison strategy, and a heuristic to select the candidate documents Dx from the D. Most of the participants use a variation of the well-known vector space model (VSM) as retrieval model, whereas, the tokens are often character- or word-n-grams instead of single words. As comparison strategy, the top 3 approaches perform an exhaustive comparison of Dq and D, i.e., each dq ∈ Dq is compared with each dx ∈ D in time O(|Dq| · |D|), while the remaining approaches employ data partitioning and space partitioning technologies to achieve lower runtime complexities. To select the candidate documents Dx for a dq either its k nearest neighbors are selected or the documents which exceed a certain similarity threshold.

The detailed analysis step (column 2 of Table 1) involves the comparison of each dq ∈ Dq with its respective candidate documents Dx in order to extract pairs of sections (sq, sx), where sq ∈ dq and sx ∈ dx, dx ∈ Dx, from them which are highly similar, if any. For this, each participant first extracts all exact matches between dq and dx and then merges the matches heuristically to form suspicious sections (sq, sx). While each participant uses the same type of token to Rank Overall extract exact matches as his respective retrieval model of the heuristic retrieval step, the match merging heuristics differ largely from one another. However, it can be said that in most approaches a kind of distance between exact matches is measured first, and then a custom algorithm is employed which clusters them to sections.

Table 2 lists the detection performance results of all approaches, computed with the quality measures introduced in Section 3. Observe that the approach with top precision is the one on rank 6 which is based on fingerprinting, the approach with top recall is the one on rank 2, and the approach with top granularity is the one on rank 1. The latter is also the winner of this task since it provides the best trade off between the three quality measures. 4.2

Intrinsic Plagiarism Detection

The intrinsic plagiarism detection task has gathered less attention than external plagiarism detection; detection results were submitted for 4 systems. Table 3 lists their detection performance results. Unlike in external plagiarism detection, in this task the baseline performance is not 0. The reason for this is that intrinsic plagiarism detection is a oneRank Overall class classification problem in which it has to be decided for each section of a document whether it is plagiarized, or not. The baseline performance in such problems is commonly computed as the naive assumption that everything belongs to the target class, which is also what Hagbi and Koppel (2009) did who classified almost everything as plagiarized. Interestingly, the baseline approach is on rank 2 while two approaches perform worse than the baseline. Only the approach of Stamatatos (2009) performs better than the baseline. 4.3

Overall Detection Results

To determine the overall winner of the competition, we have computed the combined detection performance of each participant on the competition corpora of both tasks. Table 4 shows the results. Note that the competition corpus of the external plagiarism detection task is a lot bigger than the one for the intrinsic plagiarism detection task, which is why the top ranked approaches are those who performed best in the former task. Overall winner of the competition is the approach of Grozea, Gehl, and Popescu (2009). 5

Summary

The 1st International Competition on Plagiarism Detection fostered research and brought a number of new insights into the problems of automatic plagiarism detection and its evaluation. An important by-product of the competition is a controlled large-scale evaluation framework which consists of a corpus of artificial plagiarism cases and new detection quality measures. The corpus contains more than 40 000 documents and about 94 000 cases of plagiarism.

Furthermore, in this paper we give a comprehensive overview about the competition and in particular about the plagiarism detection approaches of the competition’s 13 participants. It turns out that all of the detection approaches follow a generic retrieval process scheme which consists of the three steps heuristic retrieval, detailed analysis, and knowledge-based post-processing. To ascertain this fact we have compiled a unified summary of the top approaches in Table 1.

The competition divided into the two tasks external plagiarism detection and intrinsic plagiarism detection. The winning approach for the former task achieves 0.74 precision at 0.65 recall at 1.00 granularity. The winning approach for the latter task improves 26% above the baseline approach and achieves 0.23 precision at 0.46 recall at 1.38 granularity.

Acknowledgements

We thank Yahoo! Research and the University of the Basque Country for their sponsorship. This work was also partially funded by the Text-Enterprise 2.0 TIN2009-13391-C0403 project and the CONACYT-MEXICO 192021 grant. Our special thanks go to the participants of the competition for their devoted work.

Stein, Benno. 2007. Principles of hash-based Clough, Paul. 2003. Old and new challenges in text retrieval. In Charles Clarke, Norbert Fuhr, automatic plagiarism detection. National UK Noriko Kando, Wessel Kraaij, and Arjen de Plagiarism Advisory Service, Vries, editors, 30th Annual International ACM http://ir.shef.ac.uk/cloughie/papers/pas plagiarism.pdf. SIGIR Conference, pages 527–534. ACM, July. Grozea, Cristian, Christian Gehl, and Marius Popescu. 2009. ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection. In Stein et al. (Stein et al., 2009) .

Hagbi, Barak and Moshe Koppel. 2009.

Submission to the 1st International Competition on Plagiarism Detection. From the Bar Ilan University, Israel.

Kasprzak, Jan, Michal Brandejs, and Miroslav Kˇripaˇc. 2009. Finding Plagiarism by Evaluating Document Similarities. In Stein et al. (Stein et al., 2009) .

Malcolm, James A. and Peter C. R. Lane. 2009.

Tackling the PAN’09 External Plagiarism Detection Corpus with a Desktop Plagiarism Detector. In Stein et al. (Stein et al., 2009) .

Maurer, Hermann, Frank Kappe, and Bilal Zaka. 2006. Plagiarism - a survey. Journal of Universal Computer Science, 12(8):1050–1084.

Muhr, Markus, Mario Zechner, Roman Kern, and Michael Granitzer. 2009. External and Intrinsic Plagiarism Detection Using Vector Space Models. In Stein et al. (Stein et al., 2009) .

Palkovskii, Yurii Anatol’yevich,

Alexei Vitalievich Belov, and Irina Alexandrovna Muzika. 2009. Submission to the 1st International Competition on Plagiarism Detection. From the Zhytomyr State University, Ukraine.

Stein, Benno, Sven Meyer zu Eissen, and Martin Potthast. 2007. Strategies for Retrieving Plagiarized Documents. In Charles Clarke, Norbert Fuhr, Noriko Kando, Wessel Kraaij, and Arjen de Vries, editors, 30th Annual International ACM SIGIR Conference, pages 825–826. ACM, July.

Stein, Benno, Paolo Rosso, Efstathios

Stamatatos, Moshe Koppel, and Eneko Agirre, editors. 2009. Proceedings of the SEPLN Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse, PAN’09, September 10 2009, Donostia-San Sebasti´an, Spain. Universidad Polyt´ecnica de Valencia.

Vall´es Balaguer, Enrique. 2009. Putting Ourselves in SME’s Shoes: Automatic Detection of Plagiarism by the WCopyFind tool. In Stein et al. (Stein et al., 2009) .

Allen , James. 2009 . Submission to the 1st International Competition on Plagiarism Detection . From the Southern Methodist University in Dallas, USA.

Basile , Chiara, Dario Benedetto, Emanuele Caglioti, Giampaolo Cristadoro, and Mirko Degli Esposti. 2009 . A Plagiarism Detection Procedure in Three Steps: Selection, Matches and “Squares” . In Stein et al. ( Stein et al ., 2009 ).

Scherbinin , Vladislav and Sergey

Butakov . 2009 .

Using Microsoft SQL Server Platform for Plagiarism Detection . In Stein et al. ( Stein et al ., 2009 ).

Seaward , Leanne and Stan

Matwin . 2009 .

Intrinsic

Plagiarism Detection Using Complexity Analysis. In Stein et al. ( Stein et al ., 2009 ).

Stamatatos , Efstathios. 2009 . Intrinsic Plagiarism Detection Using Character n-gram Profiles . In Stein et al. ( Stein et al ., 2009 ).

Weber-Wulff , Debora and Katrin Ko¨hler. 2008 .

Plagiarism detection softwaretest 2008 .

http://plagiat.htw-berlin.de/software/2008/.

2009. PAN

Plagiarism Corpus

PAN -PC- 09 .