=Paper= {{Paper |id=Vol-502/paper-5 |storemode=property |title=Tackling the PAN'09 External Plagiarism Detection Corpus with a Desktop Plagiarism Detector |pdfUrl=https://ceur-ws.org/Vol-502/paper5.pdf |volume=Vol-502 }} ==Tackling the PAN'09 External Plagiarism Detection Corpus with a Desktop Plagiarism Detector== https://ceur-ws.org/Vol-502/paper5.pdf
    Tackling the PAN’09 External Plagiarism Detection Corpus
                with a Desktop Plagiarism Detector∗
                    James A Malcolm                                                  Peter C R Lane
                 University of Hertfordshire                                     University of Hertfordshire
                College Lane, Hatfield, Herts                                    College Lane, Hatfield, Herts
                j.a.malcolm@herts.ac.uk                                           p.c.lane@herts.ac.uk

         Abstract: Ferret is a fast and effective tool for detecting similarities in a group of
         files. Applying it to the PAN’09 corpus required modifications to meet the require-
         ments of the competition, mainly to deal with the very large number of files, the
         large size of some of them, and to automate some of the decisions that would nor-
         mally be made by a human operator. Ferret was able to detect numerous files in the
         development corpus that contain substantial similarities not marked as plagiarism,
         but it also identified quite a lot of pairs where random similarities masked actual
         plagiarism. An improved metric is therefore indicated if the “plagiarised” or “not
         plagiarised” decision is to be automated.
         Keywords: Ferret, n-grams, plagiarism detection, similarity




1     Introduction                                                          all, plagiarism is an academic judgement not
                                                                            something that can be measured by a ma-
In this paper we describe how we approached
                                                                            chine (Flint, Clegg, and Macdonald, 2006;
the challenge of the PAN’09 Plagiarism De-
                                                                            Lyon, Barrett, and Malcolm, 2004). Before
tection Competition using the Ferret plagia-
                                                                            tackling the competition we had done little
rism detection software. We outline Ferret’s
                                                                            to automate the decision making process: “is
strengths in normal use, highlight the diffi-
                                                                            this plagiarism or not”; Ferret tells its user
culties we had in using Ferret for the com-
                                                                            which pairs to look at (and helps in review-
petition task, and describe the results of the
                                                                            ing those pairs) but leaves the actual decision
improvements that we made as a result of en-
                                                                            has to him or her. There is therefore no need
tering the competition.
                                                                            for the software to draw a dividing line – a
   The “external plagiarism analysis” task of                               ranked list is sufficient.
the PAN’09 Plagiarism Detection Competi-                                        The competition did highlight what we
tion (International Competition on Plagia-                                  knew to be Ferret’s strengths and weak-
rism Detection, 2009) is an example of cate-                                nesses: we came second on recall, but pre-
gory 1 plagiarism (Lyon and Malcolm, 2002),                                 cision was poor as Ferret is too fine grained
as we have the source(s) in our hand. This                                  in its identification of similarities.
suggests that Ferret is the tool for the job.
   Ferret (Lyon, Barrett, and Malcolm, 2004;                                2    Ferret’s Strengths
Lyon, Malcolm, and Barrett, 2005) is a tool
that (when used on student work) is primar-                                 Assuming, as in the competition, that we al-
ily good for detecting collusion rather than                                ready have the sources of all the copying,
plagiarism (though it has been extended to                                  then there are two tasks in identifying pla-
generate search terms to drive an Internet                                  giarism in a collection of documents: finding
search (Malcolm and Lane, 2008b)). It is                                    which pairs of documents to look at and (once
a desktop plagiarism detector, which means                                  a pair has been selected for further examina-
that it is fast and interactive. It has to                                  tion) finding the blocks of matching text.
be fast, because a human is waiting for the                                    In the case of the competition there ap-
results, and because it is interactive, hu-                                 pear to be (7214)2 comparisons to be made.
man input is available and appropriate: after                               In the more general case usually considered
                                                                            by Ferret, every file needs to be compared
∗
   We thank Bob Dickerson who developed the core                            with every other, giving n·n−1
                                                                                                         2    comparisons
from which the original Ferret code was developed.                          (which for source and suspicious files together

Stein, Rosso, Stamatatos, Koppel, Agirre (Eds.): PAN'09, pp. 29-33, 2009.
30     James A. Malcolm and Peter C. R. Lane

is 104,076,378 pairs). But it is important to        If M is the available memory, the num-
note that the way Ferret works these com-         ber of batches of source documents, NO , will
parisons are not made explicitly; as the doc-     be 2|O|
                                                       M (where |O| is the total size of all the
uments are read by Ferret it creates a data       source documents in the corpus). The num-
structure which enables the most similar pair     ber of batches of suspicious documents, NU ,
of documents to be selected without making        should be 2|U |
                                                             M . The number of runs, NO ·NU ,
a direct comparison of those two documents.
To be more specific, it remembers every three      will thus be 4|O||U
                                                                  M2
                                                                      |
                                                                        which is quadratic in cor-
word sequence (triple) that it has seen, and      pus size; doubling the memory available will
which input files that triple appears in. Sim-     make the system four times faster.
ilar files are those with the largest number       3.2   Automation
of common triples. This simple approach is
what makes Ferret fast.                           We needed to automate the decision between
    The Ferret user interface then allows the     “plagiarised” and “innocent similarity”. At
operator to display the similar documents         present Ferret supports two possible metrics:
side by side with the similar text highlighted.   a count of the number of common triples, and
For large documents, it can take as long to       the Jacquard coefficient, ‘R’.
display one pair as it does to find all the sim-       Some of the files are very big, but these
ilarities in the set (we mention this to high-    are mixed with quite small files so our current
light the speed of the first phase, rather than    metric does not work very well. The copied
as a deficiency of the user interface).            chunks vary from a couple of sentences up,
    There is no specific support in Ferret for     but it is the huge size of some of the files
finding the sections of a source that has been     that causes the problem, because the number
copied. This is done by the operator (al-         of randomly matching triples in a huge file
though he or she can click on any one of          is bigger than the size of one of the smaller
the matching triples to find where in the two      copied chunks.
compared documents it appears).                       Examining results for the first 100 sus-
    We expect to look at the most similar pairs   picious files in the Development corpus we
(in descending order of similarity) and stop      found that we should take the 50 most simi-
when we judge that plagiarism is no longer        lar pairs to catch all the suspicious files where
occurring. In effect this is a corpus spe-         artificial plagiarism had been introduced.
cific threshold (partly mitigating the problem         Ferret picks out many very small similar-
mentioned in section 3.2).                        ities. Eliminating these “accidental” similar-
                                                  ities (common triples, typically isolated) was
                                                  a problem that we had to address with code
3     Adapting for the Competition                if we were to have success in the competition.
The problems that have to be solved in order      It was not a problem we had seen before be-
to use Ferret for the competition relate firstly   cause of the way we use Ferret.
to the large volume of data to be examined
and secondly to the difference between how         3.3   Improvements
we would normally use Ferret and how the          Our submission was the first run of the com-
competition is run.                               plete system.
                                                     We later revised it to take some account
3.1    Scale                                      of the order of triples in the source docu-
To deal with the large volume of data in the      ment when deciding if matching triples in the
competition, we had to divide the input into      suspicious document are part of a matching
batches that were processed in turn as we did     block or just a random match. This code is
not have a machine with sufficient RAM to           quite slow, and there are now a lot of param-
deal with all the data in one go. We esti-        eters that can be fiddled to change how well
mate that 32GB would be enough; our ma-           Ferret would perform in the competition:
chine was 9GB (for normal use, Ferret runs
in 512KB or less). If batching is necessary,        • How many triples matching is considered
it is most efficient if half the available mem-         too small to be worth considering as in-
ory is used for source documents, and half for        put to the second phase: currently less
suspicious documents.                                 than 50 (or R < 0.007).
                 Tackling the PAN'09 External Plagiarism Detection Corpus with a Desktop Plagiarism Detector   31

    • How many documents to keep in the                      the data, but it was initially easier to test a
      “most similar pairs” list: 5 on the first               few suspicious files at a time.
      (submitted) run, but considering 50.                       Step 1 produces a lot of output. As an
                                                             indication of the scale of the problem, run-
    • The unmatched gap between matching
                                                             ning the Unix wc (word count) utility on the
      sections that can be merged: 1 in the
                                                             complete set of output files took about 37
      first run; considering up to 4.
                                                             minutes, involving 7 minutes CPU time. The
    • How many sections before or after the                  average size of the files is around 45MB, and
      current section we can jump when merg-                 the total size about 6.5GB.
      ing: no restriction in first run; consider-                 This emphasises the quadratic nature of
      ing a range from 5 before to 10 after.                 plagiarism detection: every suspicious file has
                                                             to be checked against every possible source
4 How the system operates                                    file. In the case of the competition this is
4.1 Identifying Similar Pairs                                52,041,796 comparisons. As mentioned ear-
                                                             lier, Ferret usually compares every file with
First we run Ferret on the complete corpus.
                                                             every other, but fortunately we had already
A bash script make-input-document-list
                                                             implemented a grouping facility (it has sev-
that creates an input file for the ferret -f
                                                             eral other applications) whereby files in a
(definition file) option. Several copies of the
                                                             group are not compared with each other, but
make-input-document-list script are com-
                                                             only with files in other groups. We should
bined in a script that does multiple runs of
                                                             have filtered out the least similar pairs be-
Ferret to do (small) groups of suspicious files
                                                             fore generating the output from phase 1, as
against (largish) groups of source files to pro-
                                                             the set of phase 1 output files is considerably
duce a set of output files.
                                                             larger than the set of suspicious documents.
4.2     Identifying Sources                                  In normal use, Ferret displays a list of the
We read the output files generated by step                    most similar pairs; the user only looks at the
1 to select the likely sources for a given sus-              most similar and because the rest of the list is
picious document. This uses a ruby script                    in memory there is little extra cost involved.
process-output that runs ferret -x to
                                                             6     Results
produce an XML file highlighting the simi-
larities in a particular suspicious-source pair              Here now are our observations on running the
where the number of matches and/or resem-                    system we built around Ferret on the develop-
blance metric meets hard coded (but easy                     ment corpus (we do not yet know the “correct
to change) constraints. This produces a set                  answers” for the competition corpus).
of XML files (one for each ferret -x run);                       We group the pairs into 5 types depending
these are scanned by another ruby script:                    on the kind of artificial plagiarism:
read-ferret-xml-files to produce output
in the required XML format. Formatting the                       • raw plagiarism (without obfuscation)
results to meet the competition requirements                     • low obfuscation
raised a minor difficulty that the source off-
                                                                 • high obfuscation
set required was not available in our system;
fortunately it did not appear to be part of                      • plagiarism by translation
the evaluation metric.                                           • no artificial plagiarism
5     Resource Analysis                                         We plotted the value of R for each of these
Tackling more than about 500 files at a time                  types: figure 1 shows (for first 500 suspicious
on a 1GB laptop led to it thrashing. On a                    documents of each type in the development
9GB server, about 5000 files at a time could                  corpus) how documents which were identified
be dealt with comfortably.                                   as plagiarised (with various degrees of obfus-
   For the development corpus, the out-                      cation) differed from those where no similar-
put from Ferret was divided into 146 files:                   ity was intended.
each file has results for about half of the                      We see immediately that Ferret is (as ex-
source files (numbers 1-3999 or 4000-7214)                    pected) ineffective at detecting plagiarism by
and about 100 suspicious files. As explained                  translation (the line on top of the x-axis), so
above, this is not the best way of organising                this is left out of the later graphs.
32     James A. Malcolm and Peter C. R. Lane


                                                    Suspicious            Source             R             Cause
                                                             00569             04692        .533      editions
                                                             01302             05069        .917      read me
                                                             01656             04163        .501      editions
                                                             01756             03972        .662      editions
                                                             01862             03766        .640      editions
                                                             02483             05054        .550     fragments
                                                             02730             01431        .629     fragments
                                                             04740             04973        .717      editions
                                                             05096             00620        .622     fragments
                                                             05959             06555        .706      editions
                                                             05964             06148        .816      editions
Figure 1: R for the first 500 pairs of each type              06188             05357        .798    plagiarism?


    We also note some very high values for R,     Table 1: Most similar non-plagiarised pairs
and not just where there is introduced plagia-
rism. Some of the pairs where there was no        Gutenberg used to be distributed, so it is
artificial plagiarism showed very high similar-    hard to tell how the similarity to other frag-
ity using Ferret; there are some very high val-   ments came about.
ues of R before the graph flattens off. In to-
                                                     We would presume that R values below 0.5
tal 49 pairs (in the development corpus) have
                                                  also indicate similar situations, as previous
R > 0.5. Twelve of these are pairs where no
                                                  work has shown that R > 0.04 is the limit
plagiarism is alleged. We looked at each of
                                                  for “accidental similarity” but maybe a larger
these pairs in detail, and present the results
                                                  value is appropriate for this corpus.
in Table 1.
    The worst case was suspicious document          1
                                                                                                         No obfuscation

1302 which had R = 0.91 when compared to           0.9
                                                                                                        Low obfuscation
                                                                                                        High obfuscation
                                                                                                              Truncated

source document 5069. It turned out that           0.8
                                                                                                           No plagiarism



these were both Project Gutenberg (1971-           0.7

2009) “READ ME” documents with no other            0.6

text. I guess this could be viewed as an acci-     0.5
dent on the part of the compilers of the cor-
                                                   0.4
pus (Potthast, Martin et al. (editors), 2009).
                                                   0.3
Some of the other similarities are more in-
teresting, such as R = 0.80 between “The           0.2



Impossibles” and “Out Like a Light”, both          0.1


by the same author. Despite the consider-           0
                                                         0     10    20   30     40    50    60    70        80        90   100

able number of small changes between the
two documents, Wikipedia Authors (2009)           Figure 2: R for the first 100 pairs of each type
suggest this is a re-publication of the same
work under a new title.                              In figure 2 (where R is plotted for only the
    Most of the high similarities in documents    first 100 pairs of each type), it can be seen
not alleged to be plagiarised were different       more clearly that the graphs for obfuscated
editions of the same work, such as a volume       plagiarism are higher than for raw, maybe be-
which is repeated (with numerous spelling         cause the obfuscated cases are longer: the av-
corrections) in the collected works or a “Sec-    erage amount of source material in suspicious
ond Edition Revised and Enlarged” with ob-        documents containing raw (un-obfuscated)
viously a considerable overlap with the first      plagiarism was 20,628 whereas for low and
edition.                                          high obfuscation the average lengths were
    In most of these 12 cases, we have differ-     about 50% higher at 30,402 and 33,330 re-
ent editions of the same work, or different        spectively.
collections containing most of the same ma-          Figure 3 compares R for the four types
terial. A few are incomplete fragments, pos-      across the full range of pairs. Here the x-
sibly arising from the way in which Project       axis is a percentage of the total number of
                           Tackling the PAN'09 External Plagiarism Detection Corpus with a Desktop Plagiarism Detector      33

pairs of the type and the most similar pairs                                  RAM, so can now process the entire compe-
of each type (R > 0.4) have been omitted.                                     tition corpus in a single ferret run of 1h42m.
The big difference between plagiarised and                                     This works out at an effective rate of compar-
non plagiarised is evident, but also we note                                  ison of 50,000 pairs per second. The input is
that at the RHS of the diagram these lines                                    read at 450kB/s (this includes all ferret’s pro-
merge. This is because the influence on the R                                  cessing, including writing out the very large
metric for small pieces of plagiarism in large                                results file).
documents is rather too small.
                                                                              References
     0.4
                                                       No plagiarism
                                                     No obfuscation
                                                    Low obfuscation
                                                                              Flint, Abbi, Sue Clegg, and Ranald Mac-
    0.35                                            High obfuscation
                                                                                 donald. 2006. Exploring staff percep-
     0.3                                                                         tions of student plagiarism. Journal of
                                                                                 Further and Higher Education, 30(2):145–
    0.25
                                                                                 156, May.
     0.2
                                                                              International Competition on Plagiarism De-
    0.15
                                                                                 tection. 2009. http://www.webis.de/
     0.1                                                                         pan-09/competition.php.
    0.05                                                                      Lyon, Caroline, Ruth Barrett, and James
                                                                                Malcolm. 2004. A theoretical basis to the
      0
           0     10   20   30   40   50   60   70        80        90   100
                                                                                automated detection of copying between
                                                                                texts, and its practical implementation in
    Figure 3: R for all the pairs of each type                                  the ferret plagiarism and collusion detec-
                                                                                tor. In Plagiarism: Prevention, Practice
                                                                                and Policies Conference, June.
7              Conclusions and Future Work                                    Lyon, Caroline and James Malcolm. 2002.
Producing a corpus with no similar text ex-                                     Experience of plagiarism detection and
cept that which has been added deliberately                                     prevention in higher education. In Pro-
is hard: there is far too much duplicate data                                   ceedings of the World Congress, Net-
on the Internet to make it easy. It is for this                                 worked Learning in a Global Environment:
reason that this paper concentrates on the                                      Challenges and Solutions for Virtual Ed-
Development corpus, as we know which parts                                      ucation. ICSC-NAISO Academic Press.
of it are supposed to be plagiarised. However                                 Lyon, Caroline, James Malcolm, and Ruth
the competition organisers have done an ex-                                     Barrett. 2005. The ferret copy detector:
cellent job in encouraging research.                                            finding similar passages in large quantities
    The competition has clearly shown us that                                   of text. In Submitted to the 43rd Annual
a better metric than R is needed. As we have                                    Meeting of the Association for Computa-
suggested before (Malcolm and Lane, 2008a),                                     tional Linguistics.
a metric that takes account of the order of the
similar features looks promising. Calculating                                 Malcolm, J.A. and P.C.R. Lane. 2008a.
the longest common subsequence of triples                                       An approach to detecting article spin-
would probably work well, but is computa-                                       ning. Proceedings of the Third Interna-
tionally costly; we want to take care not to                                    tional Conference on Plagiarism.
slow Ferret down.                                                             Malcolm, James A. and Peter C. R. Lane.
    We need to develop better approaches to                                     2008b. Efficient search for plagiarism on
spanning gaps caused by obfuscation, espe-                                      the web. In Proceedings of i-TCE 2008.
cially in very long files as our current algo-                                 Potthast, Martin et al. (editors). 2009. PAN
rithm can still get two isolated triples that                                   Plagiarism Corpus PAN-PC-09. http:
happen to be in the right order, for example a                                  //www.webis.de/research/corpora.
b c X k l m turning into a 7 word match, which
is probably long enough to appear in the out-                                 Project Gutenberg.  1971-2009.             http:
put. We should also optimise the other pa-                                      //www.gutenberg.org/.
rameters listed in subsection 3.3.                                            Wikipedia Authors.    2009.    http:
    Ferret’s strength is its speed: we were able                                //en.wikipedia.org/wiki/Mark
to upgrade our machine from 9 to 32GB of                                        Phillips (author).