=Paper=
{{Paper
|id=Vol-502/paper-5
|storemode=property
|title=Tackling the PAN'09 External Plagiarism Detection Corpus with a Desktop Plagiarism Detector
|pdfUrl=https://ceur-ws.org/Vol-502/paper5.pdf
|volume=Vol-502
}}
==Tackling the PAN'09 External Plagiarism Detection Corpus with a Desktop Plagiarism Detector==
Tackling the PAN’09 External Plagiarism Detection Corpus
with a Desktop Plagiarism Detector∗
James A Malcolm Peter C R Lane
University of Hertfordshire University of Hertfordshire
College Lane, Hatfield, Herts College Lane, Hatfield, Herts
j.a.malcolm@herts.ac.uk p.c.lane@herts.ac.uk
Abstract: Ferret is a fast and effective tool for detecting similarities in a group of
files. Applying it to the PAN’09 corpus required modifications to meet the require-
ments of the competition, mainly to deal with the very large number of files, the
large size of some of them, and to automate some of the decisions that would nor-
mally be made by a human operator. Ferret was able to detect numerous files in the
development corpus that contain substantial similarities not marked as plagiarism,
but it also identified quite a lot of pairs where random similarities masked actual
plagiarism. An improved metric is therefore indicated if the “plagiarised” or “not
plagiarised” decision is to be automated.
Keywords: Ferret, n-grams, plagiarism detection, similarity
1 Introduction all, plagiarism is an academic judgement not
something that can be measured by a ma-
In this paper we describe how we approached
chine (Flint, Clegg, and Macdonald, 2006;
the challenge of the PAN’09 Plagiarism De-
Lyon, Barrett, and Malcolm, 2004). Before
tection Competition using the Ferret plagia-
tackling the competition we had done little
rism detection software. We outline Ferret’s
to automate the decision making process: “is
strengths in normal use, highlight the diffi-
this plagiarism or not”; Ferret tells its user
culties we had in using Ferret for the com-
which pairs to look at (and helps in review-
petition task, and describe the results of the
ing those pairs) but leaves the actual decision
improvements that we made as a result of en-
has to him or her. There is therefore no need
tering the competition.
for the software to draw a dividing line – a
The “external plagiarism analysis” task of ranked list is sufficient.
the PAN’09 Plagiarism Detection Competi- The competition did highlight what we
tion (International Competition on Plagia- knew to be Ferret’s strengths and weak-
rism Detection, 2009) is an example of cate- nesses: we came second on recall, but pre-
gory 1 plagiarism (Lyon and Malcolm, 2002), cision was poor as Ferret is too fine grained
as we have the source(s) in our hand. This in its identification of similarities.
suggests that Ferret is the tool for the job.
Ferret (Lyon, Barrett, and Malcolm, 2004; 2 Ferret’s Strengths
Lyon, Malcolm, and Barrett, 2005) is a tool
that (when used on student work) is primar- Assuming, as in the competition, that we al-
ily good for detecting collusion rather than ready have the sources of all the copying,
plagiarism (though it has been extended to then there are two tasks in identifying pla-
generate search terms to drive an Internet giarism in a collection of documents: finding
search (Malcolm and Lane, 2008b)). It is which pairs of documents to look at and (once
a desktop plagiarism detector, which means a pair has been selected for further examina-
that it is fast and interactive. It has to tion) finding the blocks of matching text.
be fast, because a human is waiting for the In the case of the competition there ap-
results, and because it is interactive, hu- pear to be (7214)2 comparisons to be made.
man input is available and appropriate: after In the more general case usually considered
by Ferret, every file needs to be compared
∗
We thank Bob Dickerson who developed the core with every other, giving n·n−1
2 comparisons
from which the original Ferret code was developed. (which for source and suspicious files together
Stein, Rosso, Stamatatos, Koppel, Agirre (Eds.): PAN'09, pp. 29-33, 2009.
30 James A. Malcolm and Peter C. R. Lane
is 104,076,378 pairs). But it is important to If M is the available memory, the num-
note that the way Ferret works these com- ber of batches of source documents, NO , will
parisons are not made explicitly; as the doc- be 2|O|
M (where |O| is the total size of all the
uments are read by Ferret it creates a data source documents in the corpus). The num-
structure which enables the most similar pair ber of batches of suspicious documents, NU ,
of documents to be selected without making should be 2|U |
M . The number of runs, NO ·NU ,
a direct comparison of those two documents.
To be more specific, it remembers every three will thus be 4|O||U
M2
|
which is quadratic in cor-
word sequence (triple) that it has seen, and pus size; doubling the memory available will
which input files that triple appears in. Sim- make the system four times faster.
ilar files are those with the largest number 3.2 Automation
of common triples. This simple approach is
what makes Ferret fast. We needed to automate the decision between
The Ferret user interface then allows the “plagiarised” and “innocent similarity”. At
operator to display the similar documents present Ferret supports two possible metrics:
side by side with the similar text highlighted. a count of the number of common triples, and
For large documents, it can take as long to the Jacquard coefficient, ‘R’.
display one pair as it does to find all the sim- Some of the files are very big, but these
ilarities in the set (we mention this to high- are mixed with quite small files so our current
light the speed of the first phase, rather than metric does not work very well. The copied
as a deficiency of the user interface). chunks vary from a couple of sentences up,
There is no specific support in Ferret for but it is the huge size of some of the files
finding the sections of a source that has been that causes the problem, because the number
copied. This is done by the operator (al- of randomly matching triples in a huge file
though he or she can click on any one of is bigger than the size of one of the smaller
the matching triples to find where in the two copied chunks.
compared documents it appears). Examining results for the first 100 sus-
We expect to look at the most similar pairs picious files in the Development corpus we
(in descending order of similarity) and stop found that we should take the 50 most simi-
when we judge that plagiarism is no longer lar pairs to catch all the suspicious files where
occurring. In effect this is a corpus spe- artificial plagiarism had been introduced.
cific threshold (partly mitigating the problem Ferret picks out many very small similar-
mentioned in section 3.2). ities. Eliminating these “accidental” similar-
ities (common triples, typically isolated) was
a problem that we had to address with code
3 Adapting for the Competition if we were to have success in the competition.
The problems that have to be solved in order It was not a problem we had seen before be-
to use Ferret for the competition relate firstly cause of the way we use Ferret.
to the large volume of data to be examined
and secondly to the difference between how 3.3 Improvements
we would normally use Ferret and how the Our submission was the first run of the com-
competition is run. plete system.
We later revised it to take some account
3.1 Scale of the order of triples in the source docu-
To deal with the large volume of data in the ment when deciding if matching triples in the
competition, we had to divide the input into suspicious document are part of a matching
batches that were processed in turn as we did block or just a random match. This code is
not have a machine with sufficient RAM to quite slow, and there are now a lot of param-
deal with all the data in one go. We esti- eters that can be fiddled to change how well
mate that 32GB would be enough; our ma- Ferret would perform in the competition:
chine was 9GB (for normal use, Ferret runs
in 512KB or less). If batching is necessary, • How many triples matching is considered
it is most efficient if half the available mem- too small to be worth considering as in-
ory is used for source documents, and half for put to the second phase: currently less
suspicious documents. than 50 (or R < 0.007).
Tackling the PAN'09 External Plagiarism Detection Corpus with a Desktop Plagiarism Detector 31
• How many documents to keep in the the data, but it was initially easier to test a
“most similar pairs” list: 5 on the first few suspicious files at a time.
(submitted) run, but considering 50. Step 1 produces a lot of output. As an
indication of the scale of the problem, run-
• The unmatched gap between matching
ning the Unix wc (word count) utility on the
sections that can be merged: 1 in the
complete set of output files took about 37
first run; considering up to 4.
minutes, involving 7 minutes CPU time. The
• How many sections before or after the average size of the files is around 45MB, and
current section we can jump when merg- the total size about 6.5GB.
ing: no restriction in first run; consider- This emphasises the quadratic nature of
ing a range from 5 before to 10 after. plagiarism detection: every suspicious file has
to be checked against every possible source
4 How the system operates file. In the case of the competition this is
4.1 Identifying Similar Pairs 52,041,796 comparisons. As mentioned ear-
lier, Ferret usually compares every file with
First we run Ferret on the complete corpus.
every other, but fortunately we had already
A bash script make-input-document-list
implemented a grouping facility (it has sev-
that creates an input file for the ferret -f
eral other applications) whereby files in a
(definition file) option. Several copies of the
group are not compared with each other, but
make-input-document-list script are com-
only with files in other groups. We should
bined in a script that does multiple runs of
have filtered out the least similar pairs be-
Ferret to do (small) groups of suspicious files
fore generating the output from phase 1, as
against (largish) groups of source files to pro-
the set of phase 1 output files is considerably
duce a set of output files.
larger than the set of suspicious documents.
4.2 Identifying Sources In normal use, Ferret displays a list of the
We read the output files generated by step most similar pairs; the user only looks at the
1 to select the likely sources for a given sus- most similar and because the rest of the list is
picious document. This uses a ruby script in memory there is little extra cost involved.
process-output that runs ferret -x to
6 Results
produce an XML file highlighting the simi-
larities in a particular suspicious-source pair Here now are our observations on running the
where the number of matches and/or resem- system we built around Ferret on the develop-
blance metric meets hard coded (but easy ment corpus (we do not yet know the “correct
to change) constraints. This produces a set answers” for the competition corpus).
of XML files (one for each ferret -x run); We group the pairs into 5 types depending
these are scanned by another ruby script: on the kind of artificial plagiarism:
read-ferret-xml-files to produce output
in the required XML format. Formatting the • raw plagiarism (without obfuscation)
results to meet the competition requirements • low obfuscation
raised a minor difficulty that the source off-
• high obfuscation
set required was not available in our system;
fortunately it did not appear to be part of • plagiarism by translation
the evaluation metric. • no artificial plagiarism
5 Resource Analysis We plotted the value of R for each of these
Tackling more than about 500 files at a time types: figure 1 shows (for first 500 suspicious
on a 1GB laptop led to it thrashing. On a documents of each type in the development
9GB server, about 5000 files at a time could corpus) how documents which were identified
be dealt with comfortably. as plagiarised (with various degrees of obfus-
For the development corpus, the out- cation) differed from those where no similar-
put from Ferret was divided into 146 files: ity was intended.
each file has results for about half of the We see immediately that Ferret is (as ex-
source files (numbers 1-3999 or 4000-7214) pected) ineffective at detecting plagiarism by
and about 100 suspicious files. As explained translation (the line on top of the x-axis), so
above, this is not the best way of organising this is left out of the later graphs.
32 James A. Malcolm and Peter C. R. Lane
Suspicious Source R Cause
00569 04692 .533 editions
01302 05069 .917 read me
01656 04163 .501 editions
01756 03972 .662 editions
01862 03766 .640 editions
02483 05054 .550 fragments
02730 01431 .629 fragments
04740 04973 .717 editions
05096 00620 .622 fragments
05959 06555 .706 editions
05964 06148 .816 editions
Figure 1: R for the first 500 pairs of each type 06188 05357 .798 plagiarism?
We also note some very high values for R, Table 1: Most similar non-plagiarised pairs
and not just where there is introduced plagia-
rism. Some of the pairs where there was no Gutenberg used to be distributed, so it is
artificial plagiarism showed very high similar- hard to tell how the similarity to other frag-
ity using Ferret; there are some very high val- ments came about.
ues of R before the graph flattens off. In to-
We would presume that R values below 0.5
tal 49 pairs (in the development corpus) have
also indicate similar situations, as previous
R > 0.5. Twelve of these are pairs where no
work has shown that R > 0.04 is the limit
plagiarism is alleged. We looked at each of
for “accidental similarity” but maybe a larger
these pairs in detail, and present the results
value is appropriate for this corpus.
in Table 1.
The worst case was suspicious document 1
No obfuscation
1302 which had R = 0.91 when compared to 0.9
Low obfuscation
High obfuscation
Truncated
source document 5069. It turned out that 0.8
No plagiarism
these were both Project Gutenberg (1971- 0.7
2009) “READ ME” documents with no other 0.6
text. I guess this could be viewed as an acci- 0.5
dent on the part of the compilers of the cor-
0.4
pus (Potthast, Martin et al. (editors), 2009).
0.3
Some of the other similarities are more in-
teresting, such as R = 0.80 between “The 0.2
Impossibles” and “Out Like a Light”, both 0.1
by the same author. Despite the consider- 0
0 10 20 30 40 50 60 70 80 90 100
able number of small changes between the
two documents, Wikipedia Authors (2009) Figure 2: R for the first 100 pairs of each type
suggest this is a re-publication of the same
work under a new title. In figure 2 (where R is plotted for only the
Most of the high similarities in documents first 100 pairs of each type), it can be seen
not alleged to be plagiarised were different more clearly that the graphs for obfuscated
editions of the same work, such as a volume plagiarism are higher than for raw, maybe be-
which is repeated (with numerous spelling cause the obfuscated cases are longer: the av-
corrections) in the collected works or a “Sec- erage amount of source material in suspicious
ond Edition Revised and Enlarged” with ob- documents containing raw (un-obfuscated)
viously a considerable overlap with the first plagiarism was 20,628 whereas for low and
edition. high obfuscation the average lengths were
In most of these 12 cases, we have differ- about 50% higher at 30,402 and 33,330 re-
ent editions of the same work, or different spectively.
collections containing most of the same ma- Figure 3 compares R for the four types
terial. A few are incomplete fragments, pos- across the full range of pairs. Here the x-
sibly arising from the way in which Project axis is a percentage of the total number of
Tackling the PAN'09 External Plagiarism Detection Corpus with a Desktop Plagiarism Detector 33
pairs of the type and the most similar pairs RAM, so can now process the entire compe-
of each type (R > 0.4) have been omitted. tition corpus in a single ferret run of 1h42m.
The big difference between plagiarised and This works out at an effective rate of compar-
non plagiarised is evident, but also we note ison of 50,000 pairs per second. The input is
that at the RHS of the diagram these lines read at 450kB/s (this includes all ferret’s pro-
merge. This is because the influence on the R cessing, including writing out the very large
metric for small pieces of plagiarism in large results file).
documents is rather too small.
References
0.4
No plagiarism
No obfuscation
Low obfuscation
Flint, Abbi, Sue Clegg, and Ranald Mac-
0.35 High obfuscation
donald. 2006. Exploring staff percep-
0.3 tions of student plagiarism. Journal of
Further and Higher Education, 30(2):145–
0.25
156, May.
0.2
International Competition on Plagiarism De-
0.15
tection. 2009. http://www.webis.de/
0.1 pan-09/competition.php.
0.05 Lyon, Caroline, Ruth Barrett, and James
Malcolm. 2004. A theoretical basis to the
0
0 10 20 30 40 50 60 70 80 90 100
automated detection of copying between
texts, and its practical implementation in
Figure 3: R for all the pairs of each type the ferret plagiarism and collusion detec-
tor. In Plagiarism: Prevention, Practice
and Policies Conference, June.
7 Conclusions and Future Work Lyon, Caroline and James Malcolm. 2002.
Producing a corpus with no similar text ex- Experience of plagiarism detection and
cept that which has been added deliberately prevention in higher education. In Pro-
is hard: there is far too much duplicate data ceedings of the World Congress, Net-
on the Internet to make it easy. It is for this worked Learning in a Global Environment:
reason that this paper concentrates on the Challenges and Solutions for Virtual Ed-
Development corpus, as we know which parts ucation. ICSC-NAISO Academic Press.
of it are supposed to be plagiarised. However Lyon, Caroline, James Malcolm, and Ruth
the competition organisers have done an ex- Barrett. 2005. The ferret copy detector:
cellent job in encouraging research. finding similar passages in large quantities
The competition has clearly shown us that of text. In Submitted to the 43rd Annual
a better metric than R is needed. As we have Meeting of the Association for Computa-
suggested before (Malcolm and Lane, 2008a), tional Linguistics.
a metric that takes account of the order of the
similar features looks promising. Calculating Malcolm, J.A. and P.C.R. Lane. 2008a.
the longest common subsequence of triples An approach to detecting article spin-
would probably work well, but is computa- ning. Proceedings of the Third Interna-
tionally costly; we want to take care not to tional Conference on Plagiarism.
slow Ferret down. Malcolm, James A. and Peter C. R. Lane.
We need to develop better approaches to 2008b. Efficient search for plagiarism on
spanning gaps caused by obfuscation, espe- the web. In Proceedings of i-TCE 2008.
cially in very long files as our current algo- Potthast, Martin et al. (editors). 2009. PAN
rithm can still get two isolated triples that Plagiarism Corpus PAN-PC-09. http:
happen to be in the right order, for example a //www.webis.de/research/corpora.
b c X k l m turning into a 7 word match, which
is probably long enough to appear in the out- Project Gutenberg. 1971-2009. http:
put. We should also optimise the other pa- //www.gutenberg.org/.
rameters listed in subsection 3.3. Wikipedia Authors. 2009. http:
Ferret’s strength is its speed: we were able //en.wikipedia.org/wiki/Mark
to upgrade our machine from 9 to 32GB of Phillips (author).