Overview of the 1st International Competition on Plagiarism Detection *

Overview of the 1st International Competition on Plagiarism Detection * MartinPotthast Web Technology & Information Systems Group Natural Language Engineering Lab ELiRF Bauhaus-Universität Weimar Universidad Politécnica de Valencia BennoStein Web Technology & Information Systems Group Natural Language Engineering Lab ELiRF Bauhaus-Universität Weimar Universidad Politécnica de Valencia AndreasEiselt Web Technology & Information Systems Group Natural Language Engineering Lab ELiRF Bauhaus-Universität Weimar Universidad Politécnica de Valencia AlbertoBarrón-Cedeño Web Technology & Information Systems Group Natural Language Engineering Lab ELiRF Bauhaus-Universität Weimar Universidad Politécnica de Valencia PaoloRosso Web Technology & Information Systems Group Natural Language Engineering Lab ELiRF Bauhaus-Universität Weimar Universidad Politécnica de Valencia Overview of the 1st International Competition on Plagiarism Detection * DCF49AF8DE76D19ECF57EA58CFC3EF53 GROBID - A machine learning software for extracting information from scholarly documents Plagiarism Detection Competition Evaluation Framework

The 1st International Competition on Plagiarism Detection, held in conjunction with the 3rd PAN workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse, brought together researchers from many disciplines around the exciting retrieval task of automatic plagiarism detection. The competition was divided into the subtasks external plagiarism detection and intrinsic plagiarism detection, which were tackled by 13 participating groups.

An important by-product of the competition is an evaluation framework for plagiarism detection, which consists of a large-scale plagiarism corpus and detection quality measures. The framework may serve as a unified test environment to compare future plagiarism detection research. In this paper we describe the corpus design and the quality measures, survey the detection approaches developed by the participants, and compile the achieved performance results of the competitors.

Introduction

Plagiarism and its automatic retrieval have attracted considerable attention from research and industry: various papers have been published on the topic, and many commercial software systems are being developed. However, when asked to name the best algorithm or the best system for plagiarism detection, hardly any evidence can be found to make an educated guess among the alternatives. One reason for this is that the research field of plagiarism detection lacks a controlled evaluation environment. This leads researchers to devise their own experimentation and methodologies, which are often not reproducible or comparable across papers. Furterhmore, it is unknown which detection quality can at least be expected from a plagiarism detection system.

To close this gap we have organized an international competition on plagiarism detection. We have set up, presumably for the first time, a controlled evaluation environment for plagiarism detection which consists of a largescale corpus of artificial plagiarism and de-

Related Work

Research on plagiarism detection has been surveyed by Maurer, Kappe, and Zaka (2006) and Clough (2003). Particularly the latter provides well thought-out insights into, even today, "[...] new challenges in automatic plagiarism detection", among which the need for a standardized evaluation framework is already mentioned.

With respect to the evaluation of commercial plagiarism detection systems, Weber-Wulff and Köhler (2008) have conducted a manual evaluation: 31 handmade cases of plagiarism were submitted to 19 systems. The sources for the plagiarism cases were selected from the Web and the systems were judged by their capability to retrieve them. Due to the use of the Web, the experiment is not controlled which limits reproducibility, and since each case is only about two pages long there are concerns with respect to the study's representativeness. However, com-Stein, Rosso, Stamatatos, Koppel, Agirre (Eds.): PAN'09, pp. 1-9, 2009. mercial systems are usually not available for a close inspection which may leave no other choice to evaluate them.

Plagiarism Detection

The literature on the subject often puts plagiarism detection on a level with the identification of highly similar sections in texts or other objects. But this does not show the whole picture. From our point of view plagiarism detection divides into two major problem classes, namely external plagiarism detection and intrinsic plagiarism detection.

Both of which include a number of subproblems and the frequently mentioned step-bystep comparison of two documents is only one of them.

For external plagiarism detection Stein, Meyer zu Eissen, and Potthast (2007) introduce a generic three-step retrieval process. The authors consider that the source of a plagiarism case may be hidden in a large reference collection, as well as that the detection results may not be perfectly accurate. Figure 1 illustrates this retrieval process. In fact, all detection approaches submitted by the competition participants can be explained in terms of these building blocks (cf. Section 4).

The process starts with a suspicious document d q and a collection D of documents from which d q 's author may have plagiarized. Within a so-called heuristic retrieval step a small number of candidate documents D x , which are likely to be sources for plagiarism, are retrieved from D. Note that D is usually very large, e.g., in the size of the Web, so that it is impractical to compare d q one after the other with each document in D. Then, within a so-called detailed analysis step, d q is compared section-wise with the retrieved candidates. All pairs of sections (s q , s x ) with s q ∈ d q and s x ∈ d x , d x ∈ D x , are to be retrieved such that s q and s x have a high similarity under some retrieval model. In a knowledge-based post-processing step those sections are filtered for which certain exclusion criteria hold, such as the use of proper citation or literal speech. The remaining suspicious sections are presented to a human, who may decide whether or not a plagiarism offense is given.

Intrinsic plagiarism detection has been studied in detail by Meyer zu Eissen and Stein (2006). In this setting one is given a suspicious document d q but no reference collection D. Technology that tackles instances of this problem class resembles the human ability to spot potential cases of plagiarism just by reading d q .

Competition Agenda

We have set up a large-scale corpus (D q , D, S) of "artificial plagiarism" cases for the competition, where D q is a collection of suspicious documents, D is a collection of source documents, and S is the set of annotations of all plagiarism cases between D q and D. The competition divided into two tasks and into two phases for which the corpus was split up into 4 parts; one part for each combination of tasks and phases. For simplicity the sub-corpora are not denoted by different symbols.

Competition tasks and phases:

• External Plagiarism Detection Task.

Given D q and D the task is to identify the sections in D q which are plagiarized, and their source sections in D.

• Intrinsic Plagiarism Detection Task.

Given only D q the task is to identify the plagiarized sections.

• Training Phase. Release of a training corpus (D q , D, S) to allow for the development of a plagiarism detection system.

• Competition Phase. Release of a competition corpus (D q , D) whose plagiarism cases were to be detected and submitted as detection annotations, R.

Participants were allowed to compete in either of the two tasks or both. After the competition phase the participants' detections were evaluated, and the winner of each task as well as an overall winner was determined as that participant whose detections R best matched S in the respective competition corpora.

Plagiarism Corpus

The PAN plagiarism corpus, PAN-PC-09, comprises 41 223 text documents in which 94 202 cases of artificial plagiarism have been inserted automatically (Webis at Bauhaus-Universität Weimar and NLEL at Universidad Politécnica de Valencia, 2009). The corpus is based on 22 874 book-length documents from the Project Gutenberg.1 All documents are, to the best of our knowledge, public domain; therefore the corpus is available free of charge to other researchers. Important parameters of the corpus are the following:

• Document Length. 50% of the documents are small (1-10 pages), 35% medium (10-100 pages), and 15% large (100-1000 pages).

• Suspicious-to-Source Ratio. 50% of the documents are designated as suspicious documents D q , and 50% are designated as source documents D (see Figure 2).

• Plagiarism Percentage. The percentage θ of plagiarism per suspicious document d q ∈ D q ranges from 0% to 100%, whereas 50% of the suspicious documents contain no plagiarism at all. Figure 3 shows the distribution of the plagiarized documents for the external test corpus. For the intrinsic test corpus applies the hashed part of the distribution.

• Plagiarism Length. The length of a plagiarism case is evenly distributed between 50 words and 5000 words. • Plagiarism Languages. 90% of the cases are monolingual English plagiarism, the remainder of the cases are cross-lingual plagiarism which were translated automatically from German and Spanish to English.

• Plagiarism Obfuscation. The monolingual portion of the plagiarism in the external test corpus was obfuscated (cf. Section 2.1). The degree of obfuscation ranges evenly from none to high.

Note that for the estimation of the parameter distributions one cannot fall back on large case studies on real plagiarism cases. Hence, we decided to construct more simple cases than complex ones, where "simple" refers to short lengths, a small percentage θ, and less obfuscation. However, complex cases are overrepresented to allow for a better judgement whether a system detects them properly.

Obfuscation Synthesis

Plagiarists often modify or rewrite the sections they copy in order to obfuscate the plagiarism. In this respect, the automatic synthesis of plagiarism obfuscation we applied when constructing the corpus is of particular interest. The respective synthesis task reads as follows: given a section of text s x , create a section s q which has a high content similarity to s x under some retrieval model but with a (substantially) different wording than s x .

An optimal obfuscation synthesizer, i.e., an automatic plagiarist, takes an s x and creates an s q which is human-readable and which creates the same ideas in mind as s x does when read by a human. Today, such a synthesizer cannot be constructed. Therefore, we approach the task from the basic understanding of content similarity in information retrieval, namely the bag-of-words model. By allowing our obfuscation synthesizers to construct texts which are not necessarily human-readable they can be greatly simplified. We have set up three heuristics to construct s q from s x :

• Random Text Operations. Given s x , s q is created by shuffling, removing, inserting, or replacing words or short phrases at random. Insertions and replacements are, for instance, taken from the document d q , the new context of s q .

• Semantic Word Variation. Given s x , s q is created by replacing each word by one of its synonyms, antonyms, hyponyms, or hypernyms, chosen at random. A word is retained if neither are available.

• POS-preserving Word Shuffling. Given s x its sequence of parts of speech (POS) is determined. Then, s q is created by shuffling words at random while the original POS sequence is maintained.

Critical Remarks

The corpus has been conceived and constructed only just in time for the competition so that there may still be errors in it. For instance, the participants pointed out that there are a number of unintended overlaps between unrelated documents. These accidental similarities do not occur frequently, so that an additional set of annotations solves this problem.

The obfuscation synthesizer based on random text operations produces anomalies in some of the obfuscated texts, such as sequences of punctuation marks and stop words. These issues were not entirely resolved so that it is possible to find some of the plagiarism cases by applying a kind of anomaly detection. Nevertheless, this was not observed during the competition.

Finally, by construction the corpus does not accurately simulate a heuristic retrieval situation in which the Web is used as reference collection. The source documents in the corpus do not resemble the Web appropriately. Note, however, that sampling the Web is also a problem for many ranking evaluation frameworks.

Detection Quality Measures

A measure that quantifies the performance of a plagiarism detection algorithm will resemble concepts in terms of precision and recall. However, these concepts cannot be transferred one-to-one from the classical information retrieval situation to plagiarism detection. This section explains the underlying connections and introduces a reasonable measure that accounts for the particularities.

Let d q be a plagiarized document; d q defines a sequence of characters each of which is either labeled as plagiarized or nonplagiarized. A plagiarized section s forms a contiguous sequence of plagiarized characters in d q . The set of all plagiarized sections in d q is denoted by S, where ∀s i , s j ∈ S : i = j → (s i ∩ s j = ∅), i.e., the plagiarized sections do not intersect. Likewise, the set of all sections r ⊂ d q found by a plagiarism detection algorithm is denoted by R. See Figure 4 for an illustration. If the characters in d q are considered as basic retrieval units, precision and recall for a given d q , S, R compute straightforwardly. This view may be called micro-averaged or system-oriented. For the situation shown in Figure 4 the micro-averaged precision is 8/16, likewise, the micro-averaged recall is 8/13. The advantage of a micro-averaged view is its clear computational semantics, which comes at a price: given an imbalance in the lengths of the elements in S-which usually correlates with the detection difficulty of a plagiarized section-the explanatory power of the computed measures is limited.

It is more natural to treat the contiguous sequences of plagiarized characters as basic retrieval units. In this sense each s i ∈ S defines a query q i for which a plagiarism detection algorithm returns a result set R i ⊆ R. This view may be called macro-averaged or user-oriented. The recall of a plagiarism detection algorithm, r ec PDA , is then defined as the mean of the returned fractions of the plagiarized sections, averaged over all sections in S:

r ec PDA (S, R) = 1 |S| s∈S |s r∈R r| |s| ,(1)

where computes the positionally overlapping characters. Problem 1. The precision of a plagiarism detection algorithm is not defined under the macro-averaged view, which is rooted in the fact that a detection algorithm does not return a unique result set for each plagiarized section s ∈ S. This deficit can be resolved by switching the reference basis. Instead of the plagiarized sections, S, the algorithmically determined sections, R, become the targets: the precision with which the queries in S are answered is identified with the recall of R under S.2 By computing the mean average over the r ∈ R one obtains a definite computation rule that captures the concept of retrieval precision for S:

prec PDA (S, R) = 1 |R| r∈R |r s∈S s| |r| ,(2)

where computes the positionally overlapping characters. The domain of prec PDA is [0, 1]; in particular it can be shown that this definition quantifies the necessary properties of a precision statistic. Problem 2. Both the micro-averaged view and the macro-averaged view are insensitive to the number of times an s ∈ S is detected in a detection result R, i.e., the granularity of R. We define the granularity of R for a set of plagiarized sections S by the average size of the existing covers: a detection r ∈ R belongs to the cover C s of an s ∈ S iff s and r overlap. Let S R ⊆ S denote the set of cases so that for each s ∈ S : |C s | > 0. The granularity of R given S is defined as follows:

gran PDA (S, R) = 1 |S R | s∈S R |C s |,(3)

where

S R = {s | s ∈ S ∧ ∃r ∈ R : s ∩ r = ∅} and C s = {r | r ∈ R ∧ s ∩ r = ∅}. The domain of the granularity is [1, |R|],

where 1 marks the desireable one-to-one correspondence between R and S, and where |R| marks the worst case, when a single s ∈ S is detected over an over again.

The measures ( 1), (2), and ( 3) are combined to an overall score:

overall PDA (S, R) = F log 2 (1 + gran PDA )

where F denotes the F-Measure, i.e., the harmonic mean of the precision prec PDA and the recall r ec PDA . To smooth the influence of the granularity on the overall score we take its logarithm.

Survey of Detection Approaches

For the competition, 13 participants developed plagiarism detection systems to tackle one or both of the tasks external plagiarism detection and intrinsic plagiarism detection. The questions that naturally arise: how do they work and how well? To give an answer, we survey the approaches in a unified way and report on their detection quality in the competition.

External Plagiarism Detection

Most of the participants competed in the external plagiarism detection task of the competition; detection results were submitted for 10 systems. As it turns out, all systems are based on common approaches-although they perform very differently. As explained at the outset, external plagiarism detection divides into three steps (cf. Figure 1): the heuristic retrieval step, the detailed analysis step, and the post-processing step. Table 1 summarizes the participants' detection approaches in terms of these steps. However, the post-processing step was omitted here since neither of the participants applied noteworthy post-processing. Each row of the table summarizes one system; we restrict the survey to the top 6 systems since the overall performance of the remaining systems is negligible. Nevertheless, these systems also implement the generic three-step process. The focus of this survey is on describing algorithmic and retrieval aspects rather than implementation details. The latter are diverse in terms of applied languages, software, and their runtime efficiency; descriptions can be found in the respective references.

The heuristic retrieval step (column 1 of Table 1) involves the comparison of the corpus' suspicious documents D q with the source documents D. For this, each participant em- ploys a specific retrieval model, a comparison strategy, and a heuristic to select the candidate documents D x from the D. Most of the participants use a variation of the well-known vector space model (VSM) as retrieval model, whereas, the tokens are often character-or word-n-grams instead of single words. As comparison strategy, the top 3 approaches perform an exhaustive comparison of D q and D, i.e., each d q ∈ D q is compared with each

d x ∈ D in time O(|D q | • |D|)

, while the remaining approaches employ data partitioning and space partitioning technologies to achieve lower runtime complexities. To select the candidate documents D x for a d q either its k nearest neighbors are selected or the documents which exceed a certain similarity threshold.

The detailed analysis step (column 2 of Table 1) involves the comparison of each d q ∈ D q with its respective candidate documents D x in order to extract pairs of sections (s q , s x ), where s q ∈ d q and s x ∈ d x , d x ∈ D x , from them which are highly similar, if any. For this, each participant first extracts all exact matches between d q and d x and then merges the matches heuristically to form suspicious sections (s q , s x ). While each participant uses the same type of token to extract exact matches as his respective retrieval model of the heuristic retrieval step, the match merging heuristics differ largely from one another. However, it can be said that in most approaches a kind of distance between exact matches is measured first, and then a custom algorithm is employed which clusters them to sections.

Table 2 lists the detection performance results of all approaches, computed with the quality measures introduced in Section 3. Observe that the approach with top precision is the one on rank 6 which is based on fingerprinting, the approach with top recall is the one on rank 2, and the approach with top granularity is the one on rank 1. The latter is also the winner of this task since it provides the best trade off between the three quality measures.

Intrinsic Plagiarism Detection

The intrinsic plagiarism detection task has gathered less attention than external plagiarism detection; detection results were submitted for 4 systems. Table 3 lists their detection performance results. Unlike in external plagiarism detection, in this task the baseline performance is not 0. The reason for this is that intrinsic plagiarism detection is a one-

Figure 1 :1Figure 1: Generic retrieval process for external plagiarism detection.

Figure 2 :2Figure 2: Distribution of suspicious documents (with and without plagiarism) and source documents.

Figure 3 :3Figure 3: Distribution of the plagiarism percentage θ in the external test corpus. For the intrinsic test corpus applies the hashed part only.

Figure 4 :4Figure 4: A document as character sequence, including plagiarized sections S and detections R returned by a plagiarism detection algorithm. The figure is drawn at scale 1 : n chars, n 1.

Table 1 :1Unified summary of the detection approaches of the participants.Comparison of Dq and D. ExhaustiveCandidates Dx ⊂ D for a dq. The 10 documents nearest to dq.External Plagiarism Detection ApproachHeuristic RetrievalDetailed AnalysisParticipantRetrieval Model.Exact Matches of dq and dx ∈ Dx.Grozea, Gehl, andCharacter-16-gram VSMCharacter-16-gramsPopescu (2009)(frequency weights, cosine similarity)Match Merging Heuristic to get (sq, sx).Comparison of Dq and D.Computation of the distances of adjacentExhaustivematches. Joining of the matches based on aCandidates Dx ⊂ D for a dq. The 51 documents most similar to dq.Monte Carlo optimization. Refinement of the obtained section pairs, e.g., by discarding too small sections.Retrieval Model.Exact Matches of dq and dx ∈ Dx.Kasprzak, Brandejs,Word-5-gram VSMWord-5-gramsand Křipač (2009)(boolean weights, Jaccard similarity)Match Merging Heuristic to get (sq, sx).Comparison of Dq and D.Extraction of the pairs of sections (sq, sx) ofExhaustivemaximal size which share at least 20Candidates Dx ⊂ D for a dq. Documents which share at least 20 n-grams with dq.matches, including the first and the last n-gram of sq and sx, and for which 2 adjacent matches are at most 49 not-matching n-grams apart.Retrieval Model.Exact Matches of dq and dx ∈ Dx.Basile et al. (2009)Word-8-gram VSMWord-8-grams(frequency weights, custom distance)Match Merging Heuristic to get (sq, sx).Extraction of the pairs of sections (sq, sx)which are obtained by greedily joiningconsecutive matches if their distance is nottoo high.Using the commercial system Plagiarism Detector (http://plagiarism-detector.com) Palkovskii, Belov,and Muzika (2009)Retrieval Model.Exact Matches of dq and dx ∈ Dx.Muhr et al. (2009)Word-1-gram VSMSentences(frequency weights, cosine similarity)Match Merging Heuristic to get (sq, sx).Comparison of Dq and D.Extraction of the pairs of sections (sq, sx)Clustering-based data-partitioning ofwhich are obtained by greedily joiningD's sentences. Comparison of Dq'sconsecutive sentences. Gaps are allowed ifsentences with each partitions' centroid.the respective sentences are similar to theCandidates Dx ⊂ D for a dq. For each sentence of dq, the documentscorresponding sentences in the other document.from the 2 most similar partitions whichshare similar sentences.Retrieval Model.Exact Matches of dq and dx ∈ Dx.Scherbinin andWinnowing fingerprintingFingerprint chunksButakov (2009)50 char chunks with 30 char overlapMatch Merging Heuristic to get (sq, sx).Comparison of Dq and D.Extraction of the pairs of sections (sq, sx)Exhaustivewhich are obtained by enlarging matchesCandidates Dx ⊂ D for a dq. Documents whose fingerprints share atand joining adjacent matches. Gaps must be below a certain Levenshtein distance.least one value with dq's fingerprint.

Table 2 :2Performance results for the external plagiarism detection task.External Detection QualityRank OverallFPrecision Recall Granularity Participant10.6957 0.69760.74180.65851.0038Grozea, Gehl, and Popescu (2009)20.6093 0.61920.55730.69671.0228Kasprzak, Brandejs, and Křipač (2009)30.6041 0.64910.67270.62721.1060Basile et al. (2009)40.3045 0.52860.66890.43702.3317Palkovskii, Belov, and Muzika (2009)50.1885 0.46030.60510.37144.4354Muhr et al. (2009)60.1422 0.61900.74730.528419.4327Scherbinin and Butakov (2009)70.0649 0.17360.65520.10015.3966Pereira, Moreira, and Galante (2009)80.0264 0.02650.01360.45861.0068Vallés Balaguer (2009)90.0187 0.05530.02900.60486.7780Malcolm and Lane (2009)100.0117 0.02260.36840.01162.8256Allen (2009)

Table 3 :3Performance results for the intrinsic plagiarism detection task.Intrinsic Detection QualityRank OverallFPrecision Recall Granularity Participant10.2462 0.30860.23210.46071.3839Stamatatos (2009)20.1955 0.19560.10910.94371.0007Hagbi and Koppel (2009)(Baseline)30.1766 0.22860.19680.27241.4524Muhr et al. (2009)40.1219 0.17500.10360.56301.7049

Seaward and Matwin (2009) http://www.gutenberg.org Martin Potthast, Benno Stein, Andreas Eiselt, Alberto Barrón-Cedeño and Paolo Rosso In(Stein, 2007) this idea is mathematically derived as "precision stress" and "recall stress".

Acknowledgements

We thank Yahoo! Research and the University of the Basque Country for their sponsorship. This work was also partially funded by the Text-Enterprise 2.0 TIN2009-13391-C04-03 project and the CONACYT-MEXICO 192021 grant. Our special thanks go to the participants of the competition for their devoted work.

class classification problem in which it has to be decided for each section of a document whether it is plagiarized, or not. The baseline performance in such problems is commonly computed as the naive assumption that everything belongs to the target class, which is also what Hagbi and Koppel (2009) did who classified almost everything as plagiarized. Interestingly, the baseline approach is on rank 2 while two approaches perform worse than the baseline. Only the approach of Stamatatos (2009) performs better than the baseline.

Overall Detection Results

To determine the overall winner of the competition, we have computed the combined detection performance of each participant on the competition corpora of both tasks. Table 4 shows the results. Note that the competition corpus of the external plagiarism detection task is a lot bigger than the one for the intrinsic plagiarism detection task, which is why the top ranked approaches are those who performed best in the former task. Overall winner of the competition is the approach of Grozea, Gehl, and Popescu (2009).

Summary

The 1st International Competition on Plagiarism Detection fostered research and brought a number of new insights into the problems of automatic plagiarism detection and its evaluation. An important by-product of the competition is a controlled large-scale evaluation framework which consists of a corpus of artificial plagiarism cases and new detection qual-ity measures. The corpus contains more than 40 000 documents and about 94 000 cases of plagiarism. Furthermore, in this paper we give a comprehensive overview about the competition and in particular about the plagiarism detection approaches of the competition's 13 participants. It turns out that all of the detection approaches follow a generic retrieval process scheme which consists of the three steps heuristic retrieval, detailed analysis, and knowledge-based post-processing. To ascertain this fact we have compiled a unified summary of the top approaches in Table 1.

The competition divided into the two tasks external plagiarism detection and intrinsic plagiarism detection. The winning approach for the former task achieves 0.74 precision at 0.65 recall at 1.00 granularity. The winning approach for the latter task improves 26% above the baseline approach and achieves 0.23 precision at 0.46 recall at 1.38 granularity.

Submission to the 1st International Competition on Plagiarism Detection JamesAllen From the Southern Methodist University in

Dallas, USA

2009 A Plagiarism Detection Procedure in Three Steps: Selection, Matches and "Squares ChiaraBasile DarioBenedetto EmanueleCaglioti GiampaoloCristadoro MirkoDegli Esposti Stein et al. 2009. 2009 Old and new challenges in automatic plagiarism detection PaulClough 2003 National UK Plagiarism Advisory Service ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection CristianGrozea ChristianGehl MariusPopescu Stein et al. 2009. 2009 Submission to the 1st International Competition on Plagiarism Detection BarakHagbi MosheKoppel 2009 Israel From the Bar Ilan University Finding Plagiarism by Evaluating Document Similarities JanKasprzak MichalBrandejs MiroslavKřipač Stein et al. 2009. 2009 Tackling the PAN'09 External Plagiarism Detection Corpus with a Desktop Plagiarism Detector JamesAMalcolm CRPeter Lane Stein et al. 2009. 2009 Plagiarism -a survey HermannMaurer FrankKappe BilalZaka Journal of Universal Computer Science 12 8 2006 Intrinsic plagiarism detection SvenMeyer Zu Eissen BennoStein Proceedings of the European Conference on Information Retrieval (ECIR 2006) Lecture Notes in Computer Science MouniaLalmas AndyMacfarlane StefanMRüger AnastasiosTombros TheodoraTsikrika AlexeiYavlinsky the European Conference on Information Retrieval (ECIR 2006) Springer 2006 3936 External and Intrinsic Plagiarism Detection Using Vector Space Models MarkusMuhr MarioZechner RomanKern MichaelGranitzer Stein et al. 2009. 2009 Submission to the 1st International Competition on Plagiarism Detection YuriiPalkovskii AlexeiAnatol'yevich IrinaAlexandrovnaVitalievich Belov Muzika 2009 Ukraine From the Zhytomyr State University Submission to the 1st International Competition on Plagiarism Detection RafaelCPereira VPMoreira RGalante From the Universidade Federal do Rio Grande do Sul, Brazil. Scherbinin Stein Vladislav and Sergey Butakov 2009. 2009. 2009 Using Microsoft SQL Server Platform for Plagiarism Detection Intrinsic Plagiarism Detection Using Complexity Analysis LeanneSeaward StanMatwin Stein et al. 2009. 2009 Intrinsic Plagiarism Detection Using Character n-gram Profiles EfstathiosStamatatos Stein et al. 2009. 2009 Principles of hash-based text retrieval BennoStein 30th Annual International ACM SIGIR Conference CharlesClarke NorbertFuhr NorikoKando WesselKraaij ArjenDe Vries ACM 2007. July Strategies for Retrieving Plagiarized Documents BennoStein SvenMeyer Zu Eissen MartinPotthast 30th Annual International ACM SIGIR Conference CharlesClarke NorbertFuhr NorikoKando WesselKraaij ArjenDe Vries ACM 2007. July BennoStein PaoloRosso EfstathiosStamatatos MosheKoppel EnekoAgirre Proceedings of the SEPLN Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse, PAN'09 Stein the SEPLN Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse, PAN'09

Donostia-San Sebastián, Spain; Vallés Balaguer, Enrique

Universidad Polytécnica de Valencia 2009. September 10 2009. 2009. 2009 Putting Ourselves in SME's Shoes: Automatic Detection of Plagiarism by the WCopyFind tool Plagiarism detection softwaretest DeboraWeber-Wulff KatrinKöhler 2008. 2008 Webis at Bauhaus-Universität Weimar and NLEL at Universidad Politécnica de Valencia PAN Plagiarism Corpus PAN-PC-09 MartinPotthast AndreasEiselt BennoStein AlbertoBarrón-Cedeño PaoloRosso 2009