<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Detection∗</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martin Potthast Benno Stein Andreas Eiselt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Barro´n-Ceden˜o Paolo Rosso</string-name>
          <email>prosso@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Web Technology</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Information Systems Group</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bauhaus-Universita ̈t Weimar &lt;first name&gt;.&lt;last</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Natural Language Engineering Lab, ELiRF Universidad Polit ́ecnica de Valencia</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2009</year>
      </pub-date>
      <volume>3936</volume>
      <fpage>1</fpage>
      <lpage>9</lpage>
      <abstract>
        <p>The 1st International Competition on Plagiarism Detection, held in conjunction with the 3rd PAN workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse, brought together researchers from many disciplines around the exciting retrieval task of automatic plagiarism detection. The competition was divided into the subtasks external plagiarism detection and intrinsic plagiarism detection, which were tackled by 13 participating groups. An important by-product of the competition is an evaluation framework for plagiarism detection, which consists of a large-scale plagiarism corpus and detection quality measures. The framework may serve as a unified test environment to compare future plagiarism detection research. In this paper we describe the corpus design and the quality measures, survey the detection approaches developed by the participants, and compile the achieved performance results of the competitors.</p>
      </abstract>
      <kwd-group>
        <kwd>Plagiarism Detection</kwd>
        <kwd>Competition</kwd>
        <kwd>Evaluation Framework</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Plagiarism and its automatic retrieval have
attracted considerable attention from
research and industry: various papers have
been published on the topic, and many
commercial software systems are being
developed. However, when asked to name the best
algorithm or the best system for plagiarism
detection, hardly any evidence can be found
to make an educated guess among the
alternatives. One reason for this is that the
research field of plagiarism detection lacks
a controlled evaluation environment. This
leads researchers to devise their own
experimentation and methodologies, which are
often not reproducible or comparable across
papers. Furterhmore, it is unknown which
detection quality can at least be expected
from a plagiarism detection system.</p>
      <p>To close this gap we have organized an
international competition on plagiarism
detection. We have set up, presumably for the first
time, a controlled evaluation environment for
plagiarism detection which consists of a
largescale corpus of artificial plagiarism and
de∗http://www.webis.de/research/workshopseries/pan-09
tection quality measures. In what follows we
overview the corpus, the quality measures,
the participants’ detection approaches, and
their obtained results.
1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Research on plagiarism detection has been
surveyed by Maurer, Kappe, and Zaka (2006)
and Clough (2003). Particularly the latter
provides well thought-out insights into, even
today, “[...] new challenges in automatic
plagiarism detection”, among which the need for
a standardized evaluation framework is
already mentioned.</p>
      <p>
        With respect to the evaluation of
commercial plagiarism detection systems,
        <xref ref-type="bibr" rid="ref8">WeberWulff and Ko¨hler (2008</xref>
        ) have conducted a
manual evaluation: 31 handmade cases of
plagiarism were submitted to 19 systems.
The sources for the plagiarism cases were
selected from the Web and the systems were
judged by their capability to retrieve them.
Due to the use of the Web, the experiment
is not controlled which limits reproducibility,
and since each case is only about two pages
long there are concerns with respect to the
study’s representativeness. However,
comdq
Plagiarism Detection
Heuristic
retrieval
      </p>
      <p>Candidate
documents</p>
      <p>Detailed
analysis</p>
      <p>Knowledge-based
post-processing
Reference
collection D
Suspicious
sections
mercial systems are usually not available for
a close inspection which may leave no other
choice to evaluate them.
1.2</p>
    </sec>
    <sec id="sec-3">
      <title>Plagiarism Detection</title>
      <p>The literature on the subject often puts
plagiarism detection on a level with the
identification of highly similar sections in texts
or other objects. But this does not show
the whole picture. From our point of view
plagiarism detection divides into two major
problem classes, namely external plagiarism
detection and intrinsic plagiarism detection.
Both of which include a number of
subproblems and the frequently mentioned
step-bystep comparison of two documents is only one
of them.</p>
      <p>For external plagiarism detection Stein,
Meyer zu Eissen, and Potthast (2007)
introduce a generic three-step retrieval process.
The authors consider that the source of a
plagiarism case may be hidden in a large
reference collection, as well as that the detection
results may not be perfectly accurate.
Figure 1 illustrates this retrieval process. In fact,
all detection approaches submitted by the
competition participants can be explained in
terms of these building blocks (cf. Section 4).</p>
      <p>The process starts with a suspicious
document dq and a collection D of documents
from which dq’s author may have plagiarized.
Within a so-called heuristic retrieval step a
small number of candidate documents Dx,
which are likely to be sources for plagiarism,
are retrieved from D. Note that D is usually
very large, e.g., in the size of the Web, so
that it is impractical to compare dq one after
the other with each document in D. Then,
within a so-called detailed analysis step, dq
is compared section-wise with the retrieved
candidates. All pairs of sections (sq, sx) with
sq ∈ dq and sx ∈ dx, dx ∈ Dx, are to be
retrieved such that sq and sx have a high
similarity under some retrieval model. In a
knowledge-based post-processing step those
sections are filtered for which certain
exclusion criteria hold, such as the use of proper
citation or literal speech. The remaining
suspicious sections are presented to a human,
who may decide whether or not a plagiarism
offense is given.</p>
      <p>Intrinsic plagiarism detection has been
studied in detail by Meyer zu Eissen and
Stein (2006). In this setting one is given a
suspicious document dq but no reference
collection D. Technology that tackles instances
of this problem class resembles the human
ability to spot potential cases of plagiarism
just by reading dq.
1.3</p>
    </sec>
    <sec id="sec-4">
      <title>Competition Agenda</title>
      <p>We have set up a large-scale corpus
(Dq, D, S) of “artificial plagiarism” cases for
the competition, where Dq is a collection of
suspicious documents, D is a collection of
source documents, and S is the set of
annotations of all plagiarism cases between Dq and
D. The competition divided into two tasks
and into two phases for which the corpus was
split up into 4 parts; one part for each
combination of tasks and phases. For simplicity
the sub-corpora are not denoted by different
symbols.</p>
      <sec id="sec-4-1">
        <title>Competition tasks and phases:</title>
        <p>• External Plagiarism Detection Task.</p>
        <p>Given Dq and D the task is to identify
the sections in Dq which are plagiarized,
and their source sections in D.
• Intrinsic Plagiarism Detection Task.</p>
        <p>Given only Dq the task is to identify the
plagiarized sections.
• Training Phase. Release of a training
corpus (Dq, D, S) to allow for the
development of a plagiarism detection system.
• Competition Phase. Release of a
competition corpus (Dq, D) whose plagiarism
cases were to be detected and submitted
as detection annotations, R.</p>
        <p>Participants were allowed to compete in
either of the two tasks or both. After the
competition phase the participants’
detections were evaluated, and the winner of each
task as well as an overall winner was
determined as that participant whose detections R
best matched S in the respective competition
corpora.
2</p>
        <sec id="sec-4-1-1">
          <title>Plagiarism Corpus</title>
          <p>
            The PAN plagiarism corpus, PAN-PC-09,
comprises 41 223 text documents in which
94 202 cases of artificial plagiarism have been
inserted automatically
            <xref ref-type="bibr" rid="ref3 ref5">(Webis at
BauhausUniversit¨at Weimar and NLEL at
Universidad Polit´ecnica de Valencia, 2009)</xref>
            . The
corpus is based on 22 874 book-length
documents from the Project Gutenberg.1 All
documents are, to the best of our knowledge,
public domain; therefore the corpus is
available free of charge to other researchers.
Important parameters of the corpus are the
following:
• Document Length. 50% of the
documents are small (1-10 pages), 35%
medium (10-100 pages), and 15% large
(100-1000 pages).
• Suspicious-to-Source Ratio. 50% of the
documents are designated as suspicious
documents Dq, and 50% are designated
as source documents D (see Figure 2).
• Plagiarism Percentage. The
percentage θ of plagiarism per suspicious
document dq ∈ Dq ranges from 0% to 100%,
whereas 50% of the suspicious
documents contain no plagiarism at all.
Figure 3 shows the distribution of the
plagiarized documents for the external test
corpus. For the intrinsic test corpus
applies the hashed part of the distribution.
• Plagiarism Length. The length of a
plagiarism case is evenly distributed
between 50 words and 5000 words.
Suspicious
documents
without
plagiarsim
          </p>
          <p>Source
doucments
with
plagiarsim</p>
          <p>• Plagiarism Languages. 90% of the cases
are monolingual English plagiarism, the
remainder of the cases are cross-lingual
plagiarism which were translated
automatically from German and Spanish to
English.
• Plagiarism Obfuscation. The
monolingual portion of the plagiarism in the
external test corpus was obfuscated (cf.
Section 2.1). The degree of obfuscation
ranges evenly from none to high.</p>
          <p>Note that for the estimation of the
parameter distributions one cannot fall back on
large case studies on real plagiarism cases.
Hence, we decided to construct more
simple cases than complex ones, where “simple”
refers to short lengths, a small percentage
θ, and less obfuscation. However, complex
cases are overrepresented to allow for a
better judgement whether a system detects them
properly.
2.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Obfuscation Synthesis</title>
      <p>Plagiarists often modify or rewrite the
sections they copy in order to obfuscate the
plagiarism. In this respect, the automatic
synthesis of plagiarism obfuscation we applied
when constructing the corpus is of particular
interest. The respective synthesis task reads
y
θ: 5
25
50
75
100%
y
as follows: given a section of text sx, create a
section sq which has a high content similarity
to sx under some retrieval model but with a
(substantially) different wording than sx.</p>
      <p>An optimal obfuscation synthesizer, i.e.,
an automatic plagiarist, takes an sx and
creates an sq which is human-readable and
which creates the same ideas in mind as sx
does when read by a human. Today, such
a synthesizer cannot be constructed.
Therefore, we approach the task from the basic
understanding of content similarity in
information retrieval, namely the bag-of-words
model. By allowing our obfuscation
synthesizers to construct texts which are not
necessarily human-readable they can be greatly
simplified. We have set up three heuristics to
construct sq from sx:
• Random Text Operations. Given sx, sq
is created by shuffling, removing,
inserting, or replacing words or short phrases
at random. Insertions and replacements
are, for instance, taken from the
document dq, the new context of sq.
• Semantic Word Variation. Given sx, sq
is created by replacing each word by one
of its synonyms, antonyms, hyponyms,
or hypernyms, chosen at random. A
word is retained if neither are available.
• POS-preserving Word Shuffling. Given
sx its sequence of parts of speech (POS)
is determined. Then, sq is created by
shuffling words at random while the
original POS sequence is maintained.
2.2</p>
    </sec>
    <sec id="sec-6">
      <title>Critical Remarks</title>
      <p>The corpus has been conceived and
constructed only just in time for the competition
so that there may still be errors in it. For
instance, the participants pointed out that
there are a number of unintended overlaps
between unrelated documents. These
accidental similarities do not occur frequently, so
that an additional set of annotations solves
this problem.</p>
      <p>The obfuscation synthesizer based on
random text operations produces anomalies in
some of the obfuscated texts, such as
sequences of punctuation marks and stop
words. These issues were not entirely
resolved so that it is possible to find some of
the plagiarism cases by applying a kind of
anomaly detection. Nevertheless, this was
not observed during the competition.</p>
      <p>Finally, by construction the corpus does
not accurately simulate a heuristic retrieval
situation in which the Web is used as
reference collection. The source documents in the
corpus do not resemble the Web
appropriately. Note, however, that sampling the Web
is also a problem for many ranking evaluation
frameworks.
3</p>
      <p>Detection Quality Measures
A measure that quantifies the performance of
a plagiarism detection algorithm will
resemble concepts in terms of precision and recall.
However, these concepts cannot be
transferred one-to-one from the classical
information retrieval situation to plagiarism
detection. This section explains the underlying
connections and introduces a reasonable
measure that accounts for the particularities.</p>
      <p>Let dq be a plagiarized document; dq
defines a sequence of characters each of
which is either labeled as plagiarized or
nonplagiarized. A plagiarized section s forms a
contiguous sequence of plagiarized characters
in dq. The set of all plagiarized sections in dq
is denoted by S, where ∀si, sj ∈ S : i = j →
(si ∩ sj = ∅), i.e., the plagiarized sections do
not intersect. Likewise, the set of all sections
r ⊂ dq found by a plagiarism detection
algorithm is denoted by R. See Figure 4 for an
illustration.</p>
      <p>s1
s2
s3</p>
      <p>If the characters in dq are considered as
basic retrieval units, precision and recall for
a given dq, S, R compute straightforwardly.
This view may be called micro-averaged or
system-oriented. For the situation shown in
Figure 4 the micro-averaged precision is 8/16,
likewise, the micro-averaged recall is 8/13.
The advantage of a micro-averaged view is its
clear computational semantics, which comes
r1
r2 r3
r4</p>
      <p>r5
R</p>
      <p>S
ydocument as ycharactersequence
yoriginal characters
plagiarized characters
detected characters
at a price: given an imbalance in the lengths
of the elements in S—which usually
correlates with the detection difficulty of a
plagiarized section—the explanatory power of the
computed measures is limited.</p>
      <p>It is more natural to treat the contiguous
sequences of plagiarized characters as basic
retrieval units. In this sense each si ∈ S
defines a query qi for which a plagiarism
detection algorithm returns a result set Ri ⊆ R.
This view may be called macro-averaged or
user-oriented. The recall of a plagiarism
detection algorithm, r ecPDA, is then defined as
the mean of the returned fractions of the
plagiarized sections, averaged over all sections
in S:
r ecPDA(S, R) =
1
|S| s∈S
|s</p>
      <p>r∈R r| ,
|s|
(1)
where computes the positionally
overlapping characters.</p>
      <p>Problem 1. The precision of a plagiarism
detection algorithm is not defined under the
macro-averaged view, which is rooted in the
fact that a detection algorithm does not
return a unique result set for each plagiarized
section s ∈ S. This deficit can be resolved
by switching the reference basis. Instead of
the plagiarized sections, S, the
algorithmically determined sections, R, become the
targets: the precision with which the queries in
S are answered is identified with the recall
of R under S.2 By computing the mean
average over the r ∈ R one obtains a definite
computation rule that captures the concept
of retrieval precision for S:
precPDA(S, R) =
1
|R| r∈R
|r</p>
      <p>s∈S s| , (2)
|r|
where computes the positionally
overlapping characters. The domain of precPDA is
[0, 1]; in particular it can be shown that this
definition quantifies the necessary properties
of a precision statistic.</p>
      <p>Problem 2. Both the micro-averaged view
and the macro-averaged view are insensitive
to the number of times an s ∈ S is detected
in a detection result R, i.e., the granularity of
R. We define the granularity of R for a set of
plagiarized sections S by the average size of
the existing covers: a detection r ∈ R belongs
2In (Stein, 2007) this idea is mathematically
derived as “precision stress” and “recall stress”.
to the cover Cs of an s ∈ S iff s and r overlap.
Let SR ⊆ S denote the set of cases so that
for each s ∈ S : |Cs| &gt; 0. The granularity of
R given S is defined as follows:
g ranPDA(S, R) =
|Cs|,</p>
      <p>(3)
1
|SR| s∈SR
where SR = {s | s ∈ S ∧ ∃r ∈ R : s ∩ r = ∅}
and Cs = {r | r ∈ R ∧ s ∩ r = ∅}. The
domain of the granularity is [1, |R|], where
1 marks the desireable one-to-one
correspondence between R and S, and where |R| marks
the worst case, when a single s ∈ S is
detected over an over again.</p>
      <p>The measures (1), (2), and (3) are
combined to an overall score:
overallPDA(S, R) =</p>
      <p>F
log2(1 + g ranPDA)
where F denotes the F-Measure, i.e., the
harmonic mean of the precision precPDA and the
recall r ecPDA. To smooth the influence of the
granularity on the overall score we take its
logarithm.
4</p>
      <p>Survey of Detection Approaches
For the competition, 13 participants
developed plagiarism detection systems to tackle
one or both of the tasks external plagiarism
detection and intrinsic plagiarism detection.
The questions that naturally arise: how do
they work and how well? To give an answer,
we survey the approaches in a unified way
and report on their detection quality in the
competition.
4.1</p>
    </sec>
    <sec id="sec-7">
      <title>External Plagiarism Detection</title>
      <p>Most of the participants competed in the
external plagiarism detection task of the
competition; detection results were submitted for
10 systems. As it turns out, all systems
are based on common approaches—although
they perform very differently.</p>
      <p>As explained at the outset, external
plagiarism detection divides into three steps (cf.
Figure 1): the heuristic retrieval step, the
detailed analysis step, and the post-processing
step. Table 1 summarizes the participants’
detection approaches in terms of these steps.
However, the post-processing step was
omitted here since neither of the participants
applied noteworthy post-processing. Each row
of the table summarizes one system; we
restrict the survey to the top 6 systems since
Heuristic Retrieval
Retrieval Model.</p>
      <p>Character-16-gram VSM
(frequency weights, cosine similarity)
Comparison of Dq and D.</p>
      <p>Exhaustive
Candidates Dx ⊂ D for a dq.</p>
      <p>The 51 documents most similar to dq.</p>
      <p>Retrieval Model.</p>
      <p>Word-5-gram VSM
(boolean weights, Jaccard similarity)
Comparison of Dq and D.</p>
      <p>Exhaustive
Candidates Dx ⊂ D for a dq.</p>
      <p>Documents which share at least 20
n-grams with dq.</p>
      <p>Retrieval Model.</p>
      <p>Word-8-gram VSM
(frequency weights, custom distance)
Comparison of Dq and D.</p>
      <p>Exhaustive
Candidates Dx ⊂ D for a dq.</p>
      <p>The 10 documents nearest to dq.</p>
      <p>Detailed Analysis
Exact Matches of dq and dx ∈ Dx.</p>
      <p>Character-16-grams
Match Merging Heuristic to get (sq, sx).</p>
      <p>Computation of the distances of adjacent
matches. Joining of the matches based on a
Monte Carlo optimization. Refinement of
the obtained section pairs, e.g., by
discarding too small sections.</p>
      <p>Exact Matches of dq and dx ∈ Dx.</p>
      <p>Word-5-grams
Match Merging Heuristic to get (sq, sx).</p>
      <p>Extraction of the pairs of sections (sq, sx) of
maximal size which share at least 20
matches, including the first and the last
n-gram of sq and sx, and for which 2
adjacent matches are at most 49
not-matching n-grams apart.</p>
      <p>Exact Matches of dq and dx ∈ Dx.</p>
      <p>Word-8-grams
Match Merging Heuristic to get (sq, sx).</p>
      <p>Extraction of the pairs of sections (sq, sx)
which are obtained by greedily joining
consecutive matches if their distance is not
too high.</p>
      <p>External Plagiarism Detection Approach
the overall performance of the remaining
systems is negligible. Nevertheless, these
systems also implement the generic three-step
process. The focus of this survey is on
describing algorithmic and retrieval aspects
rather than implementation details. The
latter are diverse in terms of applied languages,
software, and their runtime efficiency;
descriptions can be found in the respective
references.</p>
      <p>The heuristic retrieval step (column 1 of
Table 1) involves the comparison of the
corpus’ suspicious documents Dq with the source
documents D. For this, each participant
emCandidates Dx ⊂ D for a dq.</p>
      <p>For each sentence of dq, the documents
from the 2 most similar partitions which
share similar sentences.</p>
      <p>Retrieval Model.</p>
      <p>Winnowing fingerprinting
50 char chunks with 30 char overlap
Comparison of Dq and D.</p>
      <p>Exhaustive
Candidates Dx ⊂ D for a dq.</p>
      <p>Documents whose fingerprints share at
least one value with dq’s fingerprint.</p>
      <p>Exact Matches of dq and dx ∈ Dx.</p>
      <p>Fingerprint chunks
Match Merging Heuristic to get (sq, sx).</p>
      <p>Extraction of the pairs of sections (sq, sx)
which are obtained by enlarging matches
and joining adjacent matches. Gaps must be
below a certain Levenshtein distance.
ploys a specific retrieval model, a comparison
strategy, and a heuristic to select the
candidate documents Dx from the D. Most of the
participants use a variation of the well-known
vector space model (VSM) as retrieval model,
whereas, the tokens are often character- or
word-n-grams instead of single words. As
comparison strategy, the top 3 approaches
perform an exhaustive comparison of Dq and
D, i.e., each dq ∈ Dq is compared with each
dx ∈ D in time O(|Dq| · |D|), while the
remaining approaches employ data
partitioning and space partitioning technologies to
achieve lower runtime complexities. To
select the candidate documents Dx for a dq
either its k nearest neighbors are selected or
the documents which exceed a certain
similarity threshold.</p>
      <p>The detailed analysis step (column 2 of
Table 1) involves the comparison of each
dq ∈ Dq with its respective candidate
documents Dx in order to extract pairs of
sections (sq, sx), where sq ∈ dq and sx ∈ dx,
dx ∈ Dx, from them which are highly
similar, if any. For this, each participant first
extracts all exact matches between dq and dx
and then merges the matches heuristically to
form suspicious sections (sq, sx). While each
participant uses the same type of token to
Rank Overall
extract exact matches as his respective
retrieval model of the heuristic retrieval step,
the match merging heuristics differ largely
from one another. However, it can be said
that in most approaches a kind of distance
between exact matches is measured first, and
then a custom algorithm is employed which
clusters them to sections.</p>
      <p>Table 2 lists the detection performance
results of all approaches, computed with the
quality measures introduced in Section 3.
Observe that the approach with top
precision is the one on rank 6 which is based on
fingerprinting, the approach with top recall is
the one on rank 2, and the approach with top
granularity is the one on rank 1. The latter is
also the winner of this task since it provides
the best trade off between the three quality
measures.
4.2</p>
    </sec>
    <sec id="sec-8">
      <title>Intrinsic Plagiarism Detection</title>
      <p>
        The intrinsic plagiarism detection task has
gathered less attention than external
plagiarism detection; detection results were
submitted for 4 systems. Table 3 lists their
detection performance results. Unlike in external
plagiarism detection, in this task the baseline
performance is not 0. The reason for this is
that intrinsic plagiarism detection is a
oneRank Overall
class classification problem in which it has
to be decided for each section of a document
whether it is plagiarized, or not. The baseline
performance in such problems is commonly
computed as the naive assumption that
everything belongs to the target class, which
is also what Hagbi and Koppel (2009) did
who classified almost everything as
plagiarized. Interestingly, the baseline approach
is on rank 2 while two approaches perform
worse than the baseline. Only the approach
of
        <xref ref-type="bibr" rid="ref7">Stamatatos (2009)</xref>
        performs better than
the baseline.
4.3
      </p>
    </sec>
    <sec id="sec-9">
      <title>Overall Detection Results</title>
      <p>To determine the overall winner of the
competition, we have computed the combined
detection performance of each participant on
the competition corpora of both tasks.
Table 4 shows the results. Note that the
competition corpus of the external plagiarism
detection task is a lot bigger than the one for
the intrinsic plagiarism detection task, which
is why the top ranked approaches are those
who performed best in the former task.
Overall winner of the competition is the approach
of Grozea, Gehl, and Popescu (2009).
5</p>
      <sec id="sec-9-1">
        <title>Summary</title>
        <p>The 1st International Competition on
Plagiarism Detection fostered research and brought
a number of new insights into the problems of
automatic plagiarism detection and its
evaluation. An important by-product of the
competition is a controlled large-scale evaluation
framework which consists of a corpus of
artificial plagiarism cases and new detection
quality measures. The corpus contains more than
40 000 documents and about 94 000 cases of
plagiarism.</p>
        <p>Furthermore, in this paper we give a
comprehensive overview about the competition
and in particular about the plagiarism
detection approaches of the competition’s 13
participants. It turns out that all of the
detection approaches follow a generic retrieval
process scheme which consists of the three
steps heuristic retrieval, detailed analysis,
and knowledge-based post-processing. To
ascertain this fact we have compiled a unified
summary of the top approaches in Table 1.</p>
        <p>The competition divided into the two
tasks external plagiarism detection and
intrinsic plagiarism detection. The winning
approach for the former task achieves 0.74
precision at 0.65 recall at 1.00 granularity.
The winning approach for the latter task
improves 26% above the baseline approach and
achieves 0.23 precision at 0.46 recall at 1.38
granularity.</p>
      </sec>
      <sec id="sec-9-2">
        <title>Acknowledgements</title>
        <p>We thank Yahoo! Research and the
University of the Basque Country for their
sponsorship. This work was also partially funded by
the Text-Enterprise 2.0
TIN2009-13391-C0403 project and the CONACYT-MEXICO
192021 grant. Our special thanks go to the
participants of the competition for their
devoted work.</p>
        <p>
          Stein, Benno. 2007. Principles of hash-based
Clough, Paul. 2003. Old and new challenges in text retrieval. In Charles Clarke, Norbert Fuhr,
automatic plagiarism detection. National UK Noriko Kando, Wessel Kraaij, and Arjen de
Plagiarism Advisory Service, Vries, editors, 30th Annual International ACM
http://ir.shef.ac.uk/cloughie/papers/pas plagiarism.pdf. SIGIR Conference, pages 527–534. ACM, July.
Grozea, Cristian, Christian Gehl, and Marius
Popescu. 2009. ENCOPLOT: Pairwise Sequence
Matching in Linear Time Applied to Plagiarism
Detection. In Stein et al.
          <xref ref-type="bibr" rid="ref6">(Stein et al., 2009)</xref>
          .
        </p>
        <p>Hagbi, Barak and Moshe Koppel. 2009.</p>
        <p>Submission to the 1st International Competition
on Plagiarism Detection. From the Bar Ilan
University, Israel.</p>
        <p>
          Kasprzak, Jan, Michal Brandejs, and Miroslav
Kˇripaˇc. 2009. Finding Plagiarism by Evaluating
Document Similarities. In Stein et al.
          <xref ref-type="bibr" rid="ref6">(Stein et
al., 2009)</xref>
          .
        </p>
        <p>Malcolm, James A. and Peter C. R. Lane. 2009.</p>
        <p>
          Tackling the PAN’09 External Plagiarism
Detection Corpus with a Desktop Plagiarism
Detector. In Stein et al.
          <xref ref-type="bibr" rid="ref6">(Stein et al., 2009)</xref>
          .
        </p>
        <p>Maurer, Hermann, Frank Kappe, and Bilal
Zaka. 2006. Plagiarism - a survey. Journal of
Universal Computer Science, 12(8):1050–1084.</p>
        <p>
          Muhr, Markus, Mario Zechner, Roman Kern,
and Michael Granitzer. 2009. External and
Intrinsic Plagiarism Detection Using Vector
Space Models. In Stein et al.
          <xref ref-type="bibr" rid="ref6">(Stein et al., 2009)</xref>
          .
        </p>
        <sec id="sec-9-2-1">
          <title>Palkovskii, Yurii Anatol’yevich,</title>
          <p>Alexei Vitalievich Belov, and
Irina Alexandrovna Muzika. 2009. Submission
to the 1st International Competition on
Plagiarism Detection. From the Zhytomyr State
University, Ukraine.</p>
          <p>Stein, Benno, Sven Meyer zu Eissen, and Martin
Potthast. 2007. Strategies for Retrieving
Plagiarized Documents. In Charles Clarke,
Norbert Fuhr, Noriko Kando, Wessel Kraaij, and
Arjen de Vries, editors, 30th Annual
International ACM SIGIR Conference, pages
825–826. ACM, July.</p>
        </sec>
        <sec id="sec-9-2-2">
          <title>Stein, Benno, Paolo Rosso, Efstathios</title>
          <p>Stamatatos, Moshe Koppel, and Eneko Agirre,
editors. 2009. Proceedings of the SEPLN
Workshop on Uncovering Plagiarism,
Authorship, and Social Software Misuse,
PAN’09, September 10 2009, Donostia-San
Sebasti´an, Spain. Universidad Polyt´ecnica de
Valencia.</p>
          <p>
            Vall´es Balaguer, Enrique. 2009. Putting
Ourselves in SME’s Shoes: Automatic Detection
of Plagiarism by the WCopyFind tool. In Stein
et al.
            <xref ref-type="bibr" rid="ref6">(Stein et al., 2009)</xref>
            .
          </p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Allen</surname>
          </string-name>
          , James.
          <year>2009</year>
          .
          <article-title>Submission to the 1st International Competition on Plagiarism Detection</article-title>
          . From the Southern Methodist University in Dallas, USA.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Basile</surname>
          </string-name>
          , Chiara, Dario Benedetto, Emanuele Caglioti, Giampaolo Cristadoro, and Mirko Degli Esposti.
          <year>2009</year>
          .
          <article-title>A Plagiarism Detection Procedure in Three Steps: Selection, Matches and “Squares”</article-title>
          . In Stein et al. (
          <article-title>Stein et al</article-title>
          .,
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Scherbinin</surname>
            , Vladislav and
            <given-names>Sergey</given-names>
          </string-name>
          <string-name>
            <surname>Butakov</surname>
          </string-name>
          .
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>Using Microsoft SQL Server Platform for Plagiarism Detection</article-title>
          . In Stein et al. (
          <article-title>Stein et al</article-title>
          .,
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Seaward</surname>
            , Leanne and
            <given-names>Stan</given-names>
          </string-name>
          <string-name>
            <surname>Matwin</surname>
          </string-name>
          .
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Intrinsic</given-names>
            <surname>Plagiarism Detection Using Complexity Analysis. In</surname>
          </string-name>
          Stein et al. (
          <article-title>Stein et al</article-title>
          .,
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , Efstathios.
          <year>2009</year>
          .
          <article-title>Intrinsic Plagiarism Detection Using Character n-gram Profiles</article-title>
          . In Stein et al. (
          <article-title>Stein et al</article-title>
          .,
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Weber-Wulff</surname>
          </string-name>
          , Debora and Katrin Ko¨hler.
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          Plagiarism detection softwaretest
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>http://plagiat.htw-berlin.de/software/2008/.</mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>2009. PAN</given-names>
            <surname>Plagiarism Corpus</surname>
          </string-name>
          <string-name>
            <surname>PAN</surname>
          </string-name>
          -PC-
          <volume>09</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>