=Paper= {{Paper |id=Vol-1177/CLEF2011wn-PAN-CookeEt2011 |storemode=property |title=A High-performance Plagiarism Detection System - Notebook for PAN at CLEF 2011 |pdfUrl=https://ceur-ws.org/Vol-1177/CLEF2011wn-PAN-CookeEt2011.pdf |volume=Vol-1177 }} ==A High-performance Plagiarism Detection System - Notebook for PAN at CLEF 2011== https://ceur-ws.org/Vol-1177/CLEF2011wn-PAN-CookeEt2011.pdf
       A high-performance plagiarism detection
                     system
                       Notebook for PAN at CLEF 2011

         Neil Cooke, Lee Gillam, Peter Wrobel, Henry Cooke, Fahad Al-Obaidli

                                      University of Surrey

          N.Cooke@surrey.ac.uk, L.Gillam@surrey.ac.uk, Peter.Wrobel@gmail.com,
             Henry.Cooke@students.plymouth.ac.uk, fa00057@surrey.ac.uk



       Abstract. In this paper we report on our high-performance plagiarism detection
       system which is able to process the PAN plagiarism corpus for the external
       plagiarism detection task within relatively short timescales in contrast to
       previously reported state-of-the-art, and still produce a reasonable degree of
       performance (PAN 11, 4th place, PlagDet=0.2467329, Recall=0.1500480,
       Precision=0.7106536, Granularity=1.0058894). At the core of our system is a
       simple method which avoids the use of hash-type approaches, but about which
       we are unable to disclose too many details due to a patent application in
       progress. We optimised our performance using the PAN10 collection, and used
       the best parameters for the final submission. We anticipated a relatively similar
       performance at PAN11, modulo changes to the plagiarism cases, and 4th place
       this year put us between participants who had been 5th and 6th in PAN 10.




1 Introduction

According to a Detica and Cabinet Office report on the UK, the annual cost of theft of
intellectual property is £9.2 billion, and “this cyber criminal activity is greatly assisted
by an ‘insider’”1 Whilst organisations may have rules and policies geared towards
prevention, without computational prevention the emphasis remains on detection.
Here, difficulty arises in that companies need to be able to search both the web and
the document repositories of collaborators, partners, competitors etc. without
revealing the content of their query, and in a highly efficient way. In principal, a
proxy could be used to hold corporate repositories of different companies, and
searched to identify shared content that might be of concern to collaborating
organisations. However such a proxy is unlikely to be trusted by all without
considerable and provable security constraints on the proxy. A means that allows each
to hold their own, provide indications of shared content, and yet not reveal the query,
is required. Further, an indication of shared content might suggest an IP infringement


1
       http://www.cabinetoffice.gov.uk/sites/default/files/resources/the-cost-of-cyber-
crime-full-report.pdf
of which investigation would be necessary yet the indication must not reveal the IP
involved, and it needs to be up to the organisations concerned to determine how to
resolve this.
   The reason we have been investigating the applicability of plagiarism detection
approaches to such a problem should be apparent. However, typical plagiarism
detection approaches prove unsuitable for a number of reasons including (i)
computational inefficiency in producing useful patterns; (ii) brittleness to single
character variations in strings; (iii) key proximity not offering a useful indication of
data similarity; (iv) susceptibility to brute force attacks. Some use small-stepped
overlaps (shingles) to attempt to address brittleness; this needs additional hashing
operations which increase computational inefficiency, increase susceptibility to brute
force attack, and does nothing for key proximity. We have been attempting to develop
an approach that overcomes all four of these problems, and our research to date
suggests we are being successful. When we set aside systemic overheads (computer
setup, translation, XML production, remapping for translation, zip and send), it takes
a mere 12 minutes to process the entire PAN 11 collection using a moderately
specified server, we have addressed certain kinds of brittleness, our key proximity is
meaningful, and brute force attacks are rather more difficult. However, due to a patent
application it is not presently possible to disclose how the approach works or to offer
code which would demonstrate. We intend to inform interested parties related to the
PAN competitions of how this works at a later time – and have some confidence that
the simplicity of the approach will be refreshing. Our approach draws inspiration from
early work in information retrieval [1].


2 External Plagiarism Detection

Our system has been built in Amazon Web Services (AWS), so we can deploy our
system as required, have dedicated use of the required resources for that period, and
can release them when they are no longer required with costs of participating in PAN
quantifiable in relation to our resource use. This offers clear incentives for a
computationally efficient approach, though we have been fortunate in obtaining a
grant from Amazon for use of AWS so we also did not need to pay for our use.
     We used one virtual core in a single High-Memory Quadruple Extra Large
Instance (m2.4xlarge) instance. An Amazon m2.4xlarge costs $2 per hour, and offers
the following specification2:68.4 GB of memory; 26 EC2 Compute Units (8 virtual
cores with 3.25 EC2 Compute Units each); 1690 GB of instance storage; 64-bit
platform. A key benefit of using AWS was that we could rapidly scale the computing
resource to the problem space and identify the right size of server needed for
competition purposes via trial and error. Programs, data, and results, were placed into
a 100GB machine-independent Elastic Block Storage (EBS) to allow for this. Had we
only needed to use AWS for the time from release of data until our first results were
submitted, without treatment of translations, we would certainly have spent no more
than $4. Even with overheads, our subsequent runs are relatively inexpensive.


2
    http://aws.amazon.com/ec2/instance-types/
   During each PAN 11 run, the memory load increases to about 17GB, which is very
safely below that available. Furthermore, the process runs on single core – whilst
parts of the work could be parallelized across the cores, we did not see an immediate
performance gain for implementing this in contrast to efforts needed; nor did we see a
need to construct a cluster for our purposes. Our system has been composed from a
variety of shell scripts, Python, Java and C++, and external libraries and APIs, and is
configured and run as follows:
        1. Create and start an Amazon EC2 instance that uses a Canonical Ubuntu
             image, has ports 80 and 22 open to allow for http and ssh traffic from
             required hosts as necessary, has associated keypair information, and
             access to the working team
        2. Use Ubuntu’s Advanced Packaging Tool (APT) to install required
             external software libraries
        3. Add a 100GB EBS volume to this instance and unpack the competition
             data.
        4. Download, install, and use language identification software such as
             TextCat against the complete PAN 11 dataset
        5. Use a machine translation engine such as Google Translate to convert text
             to English, replacing translated texts into the source files and noting size
             differences from originals by number of characters.
        6. Index the source files.
        7. Run the detector to discover instances of plagiarism against the index.
        8. Run the stitcher to reduce granularity. This is a blind stitching approach,
             i.e. only uses metadata - filename, run length and position with run
             lengths greater than a given size, within a maximum spanning distance,
             stitched into one run.
        9. Produce competition XML applying cosine distance measures on the fly
             to validate passages.
        10. Modify the XML to adjust offset and length information for translated
             texts.
        11. Submit results
   Steps 1 to 3 to set up the system can be readily undertaken in under 15 minutes.
This need only be undertaken once.
   Steps 4 and 5 ensure comparability with others using such translation. Language
identification requires each text to be processed, and runs at about 90 texts/minute. On
a single CPU, this means about 5 hours (26,000 texts ~ 289 minutes) for language
identification and results in some 890 texts identified for translation. Translating 890
identified texts took 5 hours 15 minutes via Google Translate. This need only be
undertaken once.
   Steps 6 to 8 represent the core of the method for detection purposes, and require 12
minutes to process the entire collection. This speed means various values of the
different parameters can be tested quickly by iterating these steps. Stitch time per
PAN11 run is approximately 30 seconds.
   Step 9 adds about 45 minutes per run in conversion, computation and filtering to
create the valid XML for each run. Cosine calculations are used on short runs – long
detections typically imply high confidence – and tap the data during the conversion,
so add minimal processing time. Cosine distance use also bears novelty: we use a
fixed set of frequent words and prepare counts during mapping to save file overheads
and avoid the cost of producing detection-specific cosines.
   Step 10 is complementary to steps 4 and 5 for transforming character positions and
counts in the ratio indicated by respective file sizes. This adds 5 minutes per run.
   Step 11 adds about another 5 minutes to the entire process per run, depending on
the speed of the submitting user.
   The most time-consuming work is being done in relation to translation, requiring
almost 10½ hours to complete relevant processes – over 10 times the duration of all
other required efforts combined.
   In contrast to PAN 10 competitors: our indexing, detection and format conversion
takes half the time of the PAN10 competition winners and uses just 1 core instead of
their 24 [3]. Our resource requirements are also lower than the threaded use of 32
cores by last year’s runners up [3]; we cannot comment on run times of the 3rd placed
competitors as runtime is discussed but not made explicit [5]; and we are more
efficient than the 40 hours taken using 8 cores by the 4th placed competitors [6].
   Our competition entry used detection runs over 50 words, and a cosine distance of
0.757. One of our pre-final submissions could have achieved a marginally better
overall score (0.2485299) but this would not have changed the final rankings.


3 Evaluation

    We initially tested our approach internally using just 100 of the PAN09 suspicious
files with their associated reference material (i.e. not the whole PAN09 sources). On
completion of this phase, we transferred our system to AWS and carried out
inspectable experiments on PAN10 data using either the first 1000 suspect files, or the
100 largest suspect files, before undertaking full runs.
    Parameters that we could tune were:
        • Minimum detection run length (RR)
        • Max Stitch distance (SS)
        • Min post-stitch run length (PR)
        • Min cosine score (CS)
    Tests with PAN10 data suggested initial values for our parameters of RR=50,
SS=900, PR=50, CS = 0.757, or PR 55 with CS = 0. Following translation of PAN10,
we ran parameter sweeps around these values to determine best performance. This
suggested: RR=50 (minimum suggested length of plagiarism); SS=900, PR=50,
CS=0.75. The combination offered a PAN10 score of 0.658429611476, with many
other scores above 0.65 suggesting there was little more that could be optimized at
this point. Notionally, this overall score would have achieved 4th place in PAN10 (3rd:
0.6984; 4th, 0.6209).
    These parameter values were used for our final PAN11 run, giving us 4th place in
external detection with:
                PlagDet       Recall         Precision      Granularity
                 0.2467329      0.1500480     0.7106536        1.0058894
  Against PAN11 annotations, one of our pre-final submissions could have achieved
a marginally better overall score (0.2485299) using a different approach to stitching.
This had been a speculative approach and had not produced compelling results against
PAN10; it also would not have changed the final rankings.


4 Conclusion

   In our first foray into this competition, we are reasonably satisfied with achieving
4th place in external detection. We inserted ourselves between the 5th and 6th placed
competitors from PAN 10, suggesting that we have a reasonably competitive
approach that also suggests computational efficiency, lowered brittleness, key
proximity that indicates data similarity, and robustness to brute force attacks. Our
approach is able to undertake the core activity of plagiarism detection at high speeds -
12 minutes for the entire PAN11 dataset and similar efficiency on previous PAN data,
whilst some reportedly find it difficult to cope with processing this volume of data.
We would like to think that if we can deal with a few gigabytes within such a time on
one core of a large server, we could readily scale our approach to much bigger data
collections and potentially the entire web. In another experiment, reported in [2], we
investigated news reuse in texts from Reuters Corpus Volume 1 (RCV1 [7]) which
comprises a year’s worth of English news output in 1996/7. We compared 750,000 x
750,000 files, using an m2.xlarge EC2 instance, and reported all cross-copying
between news reports. Processing took approximately 36 minutes, and there were a
reasonable number of indications of content reuse by Reuters’ journalists.
   There was a tension between being able to process the data quickly, and
performing well within finite time. Without translation, we were ready to submit our
initial results within two hours of the data release, which would have cost a mere $4
in compute time. It would have been possible to complicate our approach with more
demanding computations, but processing time and cost would be a critical factor for
an internet-scaled application, and we prioritized achieving good results quickly
rather than, say, waiting for vastly extended periods only to achieve slightly better
results.
   There are many possibilities for improving our system’s detection performance. A
more efficient translation alignment approach would be one, but such alignment will
slow our system per run. There are also different search strategies we might deploy in
order to deal better with some of the obfuscation approaches, but again this will come
at a cost. We can clearly improve our XML production processes and filtering
approach. Whether the benefits of these improvements outweigh the costs of
development and speed is something that we cannot yet answer. Clearly there is
plenty of room for improvement to our detections, and deeper investigation into the
characteristics of the PAN11 dataset will help here. One thing we are convinced of is
that we know ways in which to improve, and we may have several possibilities for the
exploitation of such a system.
5 Acknowledgements

   This work has been supported in part by the EPSRC and JISC (EP/I034408/1) and
we are very grateful to Amazon Web Services (AWS) for providing a supporting
grant for this research and for competition use of both EC2 and EBS services.
   We also gratefully acknowledge the sterling efforts of the PAN11 organizers in
managing the competitions, and the helpful contributions of Surrey students David
Cheung and Ankush Sharma to our understanding of stitching approaches.


6 References

[1] Luhn, H. P.: A statistical approach to mechanized encoding and searching of literary
    information. IBM Journal of Research and Development, 1(4), pp309-317, 1957.
[2] Cooke, N. and Gillam, L.: Clowns, Crowds and Clouds: A Cross-Enterprise Approach to
    Detecting Information Leakage without Leaking Information. In Mahmood, Z. and Hill, R.
    (eds.) Cloud Computing for Enterprise Architectures. Springer, 2011. In press.
[3] Kasprzak, J. and Brandejs, M.:Improving the Reliability of the Plagiarism Detection
    System - Lab Report for PAN at CLEF 2010.
[4] Zou, D., Long, W-j. and Ling, Z.: A Cluster-Based Plagiarism Detection Method - Lab
    Report for PAN at CLEF 2010.
[5] Muhr, M., Kern, R., Zechner, M. and Granitzer,M.: External and Intrinsic Plagiarism
    Detection Using a Cross-Lingual Retrieval and Segmentation System - Lab Report for
    PAN at CLEF 2010.
[6] Grozea, C. and Popescu, M.: Encoplot-Performance in the Second International Plagiarism
    Detection Challenge - Lab Report for PAN at CLEF 2010.
[7] Rose T., Stevenson M., and Whitehead M.: The Reuters Corpus Volume 1 – from
    Yesterday’s News to Tomorrow’s Language Resources. In Proceedings of the Third
    International Conference on Language Resources and Evaluation, 2002.