=Paper= {{Paper |id=Vol-502/paper-5 |storemode=property |title=Tackling the PAN'09 External Plagiarism Detection Corpus with a Desktop Plagiarism Detector |pdfUrl=https://ceur-ws.org/Vol-502/paper5.pdf |volume=Vol-502 }} ==Tackling the PAN'09 External Plagiarism Detection Corpus with a Desktop Plagiarism Detector== https://ceur-ws.org/Vol-502/paper5.pdf

Tackling the PAN’09 External Plagiarism Detection Corpus
with a Desktop Plagiarism Detector∗
James A Malcolm Peter C R Lane
University of Hertfordshire University of Hertfordshire
College Lane, Hatﬁeld, Herts College Lane, Hatﬁeld, Herts
j.a.malcolm@herts.ac.uk p.c.lane@herts.ac.uk

Abstract: Ferret is a fast and eﬀective tool for detecting similarities in a group of
ﬁles. Applying it to the PAN’09 corpus required modiﬁcations to meet the require-
ments of the competition, mainly to deal with the very large number of ﬁles, the
large size of some of them, and to automate some of the decisions that would nor-
mally be made by a human operator. Ferret was able to detect numerous ﬁles in the
development corpus that contain substantial similarities not marked as plagiarism,
but it also identiﬁed quite a lot of pairs where random similarities masked actual
plagiarism. An improved metric is therefore indicated if the “plagiarised” or “not
plagiarised” decision is to be automated.
Keywords: Ferret, n-grams, plagiarism detection, similarity

1 Introduction all, plagiarism is an academic judgement not
something that can be measured by a ma-
In this paper we describe how we approached
chine (Flint, Clegg, and Macdonald, 2006;
the challenge of the PAN’09 Plagiarism De-
Lyon, Barrett, and Malcolm, 2004). Before
tection Competition using the Ferret plagia-
tackling the competition we had done little
rism detection software. We outline Ferret’s
to automate the decision making process: “is
strengths in normal use, highlight the diﬃ-
this plagiarism or not”; Ferret tells its user
culties we had in using Ferret for the com-
which pairs to look at (and helps in review-
petition task, and describe the results of the
ing those pairs) but leaves the actual decision
improvements that we made as a result of en-
has to him or her. There is therefore no need
tering the competition.
for the software to draw a dividing line – a
The “external plagiarism analysis” task of ranked list is suﬃcient.
the PAN’09 Plagiarism Detection Competi- The competition did highlight what we
tion (International Competition on Plagia- knew to be Ferret’s strengths and weak-
rism Detection, 2009) is an example of cate- nesses: we came second on recall, but pre-
gory 1 plagiarism (Lyon and Malcolm, 2002), cision was poor as Ferret is too ﬁne grained
as we have the source(s) in our hand. This in its identiﬁcation of similarities.
suggests that Ferret is the tool for the job.
Ferret (Lyon, Barrett, and Malcolm, 2004; 2 Ferret’s Strengths
Lyon, Malcolm, and Barrett, 2005) is a tool
that (when used on student work) is primar- Assuming, as in the competition, that we al-
ily good for detecting collusion rather than ready have the sources of all the copying,
plagiarism (though it has been extended to then there are two tasks in identifying pla-
generate search terms to drive an Internet giarism in a collection of documents: ﬁnding
search (Malcolm and Lane, 2008b)). It is which pairs of documents to look at and (once
a desktop plagiarism detector, which means a pair has been selected for further examina-
that it is fast and interactive. It has to tion) ﬁnding the blocks of matching text.
be fast, because a human is waiting for the In the case of the competition there ap-
results, and because it is interactive, hu- pear to be (7214)2 comparisons to be made.
man input is available and appropriate: after In the more general case usually considered
by Ferret, every ﬁle needs to be compared
∗
We thank Bob Dickerson who developed the core with every other, giving n·n−1
2 comparisons
from which the original Ferret code was developed. (which for source and suspicious ﬁles together

Stein, Rosso, Stamatatos, Koppel, Agirre (Eds.): PAN'09, pp. 29-33, 2009.
30 James A. Malcolm and Peter C. R. Lane

is 104,076,378 pairs). But it is important to If M is the available memory, the num-
note that the way Ferret works these com- ber of batches of source documents, NO , will
parisons are not made explicitly; as the doc- be 2|O|
M (where |O| is the total size of all the
uments are read by Ferret it creates a data source documents in the corpus). The num-
structure which enables the most similar pair ber of batches of suspicious documents, NU ,
of documents to be selected without making should be 2|U |
M . The number of runs, NO ·NU ,
a direct comparison of those two documents.
To be more speciﬁc, it remembers every three will thus be 4|O||U
M2
|
which is quadratic in cor-
word sequence (triple) that it has seen, and pus size; doubling the memory available will
which input ﬁles that triple appears in. Sim- make the system four times faster.
ilar ﬁles are those with the largest number 3.2 Automation
of common triples. This simple approach is
what makes Ferret fast. We needed to automate the decision between
The Ferret user interface then allows the “plagiarised” and “innocent similarity”. At
operator to display the similar documents present Ferret supports two possible metrics:
side by side with the similar text highlighted. a count of the number of common triples, and
For large documents, it can take as long to the Jacquard coeﬃcient, ‘R’.
display one pair as it does to ﬁnd all the sim- Some of the ﬁles are very big, but these
ilarities in the set (we mention this to high- are mixed with quite small ﬁles so our current
light the speed of the ﬁrst phase, rather than metric does not work very well. The copied
as a deﬁciency of the user interface). chunks vary from a couple of sentences up,
There is no speciﬁc support in Ferret for but it is the huge size of some of the ﬁles
ﬁnding the sections of a source that has been that causes the problem, because the number
copied. This is done by the operator (al- of randomly matching triples in a huge ﬁle
though he or she can click on any one of is bigger than the size of one of the smaller
the matching triples to ﬁnd where in the two copied chunks.
compared documents it appears). Examining results for the ﬁrst 100 sus-
We expect to look at the most similar pairs picious ﬁles in the Development corpus we
(in descending order of similarity) and stop found that we should take the 50 most simi-
when we judge that plagiarism is no longer lar pairs to catch all the suspicious ﬁles where
occurring. In eﬀect this is a corpus spe- artiﬁcial plagiarism had been introduced.
ciﬁc threshold (partly mitigating the problem Ferret picks out many very small similar-
mentioned in section 3.2). ities. Eliminating these “accidental” similar-
ities (common triples, typically isolated) was
a problem that we had to address with code
3 Adapting for the Competition if we were to have success in the competition.
The problems that have to be solved in order It was not a problem we had seen before be-
to use Ferret for the competition relate ﬁrstly cause of the way we use Ferret.
to the large volume of data to be examined
and secondly to the diﬀerence between how 3.3 Improvements
we would normally use Ferret and how the Our submission was the ﬁrst run of the com-
competition is run. plete system.
We later revised it to take some account
3.1 Scale of the order of triples in the source docu-
To deal with the large volume of data in the ment when deciding if matching triples in the
competition, we had to divide the input into suspicious document are part of a matching
batches that were processed in turn as we did block or just a random match. This code is
not have a machine with suﬃcient RAM to quite slow, and there are now a lot of param-
deal with all the data in one go. We esti- eters that can be ﬁddled to change how well
mate that 32GB would be enough; our ma- Ferret would perform in the competition:
chine was 9GB (for normal use, Ferret runs
in 512KB or less). If batching is necessary, • How many triples matching is considered
it is most eﬃcient if half the available mem- too small to be worth considering as in-
ory is used for source documents, and half for put to the second phase: currently less
suspicious documents. than 50 (or R < 0.007).
Tackling the PAN'09 External Plagiarism Detection Corpus with a Desktop Plagiarism Detector 31

• How many documents to keep in the the data, but it was initially easier to test a
“most similar pairs” list: 5 on the ﬁrst few suspicious ﬁles at a time.
(submitted) run, but considering 50. Step 1 produces a lot of output. As an
indication of the scale of the problem, run-
• The unmatched gap between matching
ning the Unix wc (word count) utility on the
sections that can be merged: 1 in the
complete set of output ﬁles took about 37
ﬁrst run; considering up to 4.
minutes, involving 7 minutes CPU time. The
• How many sections before or after the average size of the ﬁles is around 45MB, and
current section we can jump when merg- the total size about 6.5GB.
ing: no restriction in ﬁrst run; consider- This emphasises the quadratic nature of
ing a range from 5 before to 10 after. plagiarism detection: every suspicious ﬁle has
to be checked against every possible source
4 How the system operates ﬁle. In the case of the competition this is
4.1 Identifying Similar Pairs 52,041,796 comparisons. As mentioned ear-
lier, Ferret usually compares every ﬁle with
First we run Ferret on the complete corpus.
every other, but fortunately we had already
A bash script make-input-document-list
implemented a grouping facility (it has sev-
that creates an input ﬁle for the ferret -f
eral other applications) whereby ﬁles in a
(deﬁnition ﬁle) option. Several copies of the
group are not compared with each other, but
make-input-document-list script are com-
only with ﬁles in other groups. We should
bined in a script that does multiple runs of
have ﬁltered out the least similar pairs be-
Ferret to do (small) groups of suspicious ﬁles
fore generating the output from phase 1, as
against (largish) groups of source ﬁles to pro-
the set of phase 1 output ﬁles is considerably
duce a set of output ﬁles.
larger than the set of suspicious documents.
4.2 Identifying Sources In normal use, Ferret displays a list of the
We read the output ﬁles generated by step most similar pairs; the user only looks at the
1 to select the likely sources for a given sus- most similar and because the rest of the list is
picious document. This uses a ruby script in memory there is little extra cost involved.
process-output that runs ferret -x to
6 Results
produce an XML ﬁle highlighting the simi-
larities in a particular suspicious-source pair Here now are our observations on running the
where the number of matches and/or resem- system we built around Ferret on the develop-
blance metric meets hard coded (but easy ment corpus (we do not yet know the “correct
to change) constraints. This produces a set answers” for the competition corpus).
of XML ﬁles (one for each ferret -x run); We group the pairs into 5 types depending
these are scanned by another ruby script: on the kind of artiﬁcial plagiarism:
read-ferret-xml-files to produce output
in the required XML format. Formatting the • raw plagiarism (without obfuscation)
results to meet the competition requirements • low obfuscation
raised a minor diﬃculty that the source oﬀ-
• high obfuscation
set required was not available in our system;
fortunately it did not appear to be part of • plagiarism by translation
the evaluation metric. • no artiﬁcial plagiarism
5 Resource Analysis We plotted the value of R for each of these
Tackling more than about 500 ﬁles at a time types: ﬁgure 1 shows (for ﬁrst 500 suspicious
on a 1GB laptop led to it thrashing. On a documents of each type in the development
9GB server, about 5000 ﬁles at a time could corpus) how documents which were identiﬁed
be dealt with comfortably. as plagiarised (with various degrees of obfus-
For the development corpus, the out- cation) diﬀered from those where no similar-
put from Ferret was divided into 146 ﬁles: ity was intended.
each ﬁle has results for about half of the We see immediately that Ferret is (as ex-
source ﬁles (numbers 1-3999 or 4000-7214) pected) ineﬀective at detecting plagiarism by
and about 100 suspicious ﬁles. As explained translation (the line on top of the x-axis), so
above, this is not the best way of organising this is left out of the later graphs.
32 James A. Malcolm and Peter C. R. Lane

Suspicious Source R Cause
00569 04692 .533 editions
01302 05069 .917 read me
01656 04163 .501 editions
01756 03972 .662 editions
01862 03766 .640 editions
02483 05054 .550 fragments
02730 01431 .629 fragments
04740 04973 .717 editions
05096 00620 .622 fragments
05959 06555 .706 editions
05964 06148 .816 editions
Figure 1: R for the ﬁrst 500 pairs of each type 06188 05357 .798 plagiarism?

We also note some very high values for R, Table 1: Most similar non-plagiarised pairs
and not just where there is introduced plagia-
rism. Some of the pairs where there was no Gutenberg used to be distributed, so it is
artiﬁcial plagiarism showed very high similar- hard to tell how the similarity to other frag-
ity using Ferret; there are some very high val- ments came about.
ues of R before the graph ﬂattens oﬀ. In to-
We would presume that R values below 0.5
tal 49 pairs (in the development corpus) have
also indicate similar situations, as previous
R > 0.5. Twelve of these are pairs where no
work has shown that R > 0.04 is the limit
plagiarism is alleged. We looked at each of
for “accidental similarity” but maybe a larger
these pairs in detail, and present the results
value is appropriate for this corpus.
in Table 1.
The worst case was suspicious document 1
No obfuscation

1302 which had R = 0.91 when compared to 0.9
Low obfuscation
High obfuscation
Truncated

source document 5069. It turned out that 0.8
No plagiarism

these were both Project Gutenberg (1971- 0.7

2009) “READ ME” documents with no other 0.6

text. I guess this could be viewed as an acci- 0.5
dent on the part of the compilers of the cor-
0.4
pus (Potthast, Martin et al. (editors), 2009).
0.3
Some of the other similarities are more in-
teresting, such as R = 0.80 between “The 0.2

Impossibles” and “Out Like a Light”, both 0.1

by the same author. Despite the consider- 0
0 10 20 30 40 50 60 70 80 90 100

able number of small changes between the
two documents, Wikipedia Authors (2009) Figure 2: R for the ﬁrst 100 pairs of each type
suggest this is a re-publication of the same
work under a new title. In ﬁgure 2 (where R is plotted for only the
Most of the high similarities in documents ﬁrst 100 pairs of each type), it can be seen
not alleged to be plagiarised were diﬀerent more clearly that the graphs for obfuscated
editions of the same work, such as a volume plagiarism are higher than for raw, maybe be-
which is repeated (with numerous spelling cause the obfuscated cases are longer: the av-
corrections) in the collected works or a “Sec- erage amount of source material in suspicious
ond Edition Revised and Enlarged” with ob- documents containing raw (un-obfuscated)
viously a considerable overlap with the ﬁrst plagiarism was 20,628 whereas for low and
edition. high obfuscation the average lengths were
In most of these 12 cases, we have diﬀer- about 50% higher at 30,402 and 33,330 re-
ent editions of the same work, or diﬀerent spectively.
collections containing most of the same ma- Figure 3 compares R for the four types
terial. A few are incomplete fragments, pos- across the full range of pairs. Here the x-
sibly arising from the way in which Project axis is a percentage of the total number of
Tackling the PAN'09 External Plagiarism Detection Corpus with a Desktop Plagiarism Detector 33

pairs of the type and the most similar pairs RAM, so can now process the entire compe-
of each type (R > 0.4) have been omitted. tition corpus in a single ferret run of 1h42m.
The big diﬀerence between plagiarised and This works out at an eﬀective rate of compar-
non plagiarised is evident, but also we note ison of 50,000 pairs per second. The input is
that at the RHS of the diagram these lines read at 450kB/s (this includes all ferret’s pro-
merge. This is because the inﬂuence on the R cessing, including writing out the very large
metric for small pieces of plagiarism in large results ﬁle).
documents is rather too small.
References
0.4
No plagiarism
No obfuscation
Low obfuscation
Flint, Abbi, Sue Clegg, and Ranald Mac-
0.35 High obfuscation
donald. 2006. Exploring staﬀ percep-
0.3 tions of student plagiarism. Journal of
Further and Higher Education, 30(2):145–
0.25
156, May.
0.2
International Competition on Plagiarism De-
0.15
tection. 2009. http://www.webis.de/
0.1 pan-09/competition.php.
0.05 Lyon, Caroline, Ruth Barrett, and James
Malcolm. 2004. A theoretical basis to the
0
0 10 20 30 40 50 60 70 80 90 100
automated detection of copying between
texts, and its practical implementation in
Figure 3: R for all the pairs of each type the ferret plagiarism and collusion detec-
tor. In Plagiarism: Prevention, Practice
and Policies Conference, June.
7 Conclusions and Future Work Lyon, Caroline and James Malcolm. 2002.
Producing a corpus with no similar text ex- Experience of plagiarism detection and
cept that which has been added deliberately prevention in higher education. In Pro-
is hard: there is far too much duplicate data ceedings of the World Congress, Net-
on the Internet to make it easy. It is for this worked Learning in a Global Environment:
reason that this paper concentrates on the Challenges and Solutions for Virtual Ed-
Development corpus, as we know which parts ucation. ICSC-NAISO Academic Press.
of it are supposed to be plagiarised. However Lyon, Caroline, James Malcolm, and Ruth
the competition organisers have done an ex- Barrett. 2005. The ferret copy detector:
cellent job in encouraging research. ﬁnding similar passages in large quantities
The competition has clearly shown us that of text. In Submitted to the 43rd Annual
a better metric than R is needed. As we have Meeting of the Association for Computa-
suggested before (Malcolm and Lane, 2008a), tional Linguistics.
a metric that takes account of the order of the
similar features looks promising. Calculating Malcolm, J.A. and P.C.R. Lane. 2008a.
the longest common subsequence of triples An approach to detecting article spin-
would probably work well, but is computa- ning. Proceedings of the Third Interna-
tionally costly; we want to take care not to tional Conference on Plagiarism.
slow Ferret down. Malcolm, James A. and Peter C. R. Lane.
We need to develop better approaches to 2008b. Eﬃcient search for plagiarism on
spanning gaps caused by obfuscation, espe- the web. In Proceedings of i-TCE 2008.
cially in very long ﬁles as our current algo- Potthast, Martin et al. (editors). 2009. PAN
rithm can still get two isolated triples that Plagiarism Corpus PAN-PC-09. http:
happen to be in the right order, for example a //www.webis.de/research/corpora.
b c X k l m turning into a 7 word match, which
is probably long enough to appear in the out- Project Gutenberg. 1971-2009. http:
put. We should also optimise the other pa- //www.gutenberg.org/.
rameters listed in subsection 3.3. Wikipedia Authors. 2009. http:
Ferret’s strength is its speed: we were able //en.wikipedia.org/wiki/Mark
to upgrade our machine from 9 to 32GB of Phillips (author).