=Paper=
{{Paper
|id=Vol-1180/CLEF2014wn-Pan-GillamEt2014
|storemode=property
|title=Evaluating Robustness for 'IPCRESS': Surrey's Text Alignment for Plagiarism Detection
|pdfUrl=https://ceur-ws.org/Vol-1180/CLEF2014wn-Pan-GillamEt2014.pdf
|volume=Vol-1180
|dblpUrl=https://dblp.org/rec/conf/clef/GillamN14
}}
==Evaluating Robustness for 'IPCRESS': Surrey's Text Alignment for Plagiarism Detection==
<pdf width="1500px">https://ceur-ws.org/Vol-1180/CLEF2014wn-Pan-GillamEt2014.pdf</pdf>
<pre>
    Evaluating robustness for ‘IPCRESS’: Surrey’s
      text alignment for plagiarism detection
                       Notebook for PAN at CLEF 2014

                                  Lee Gillam, Scott Notley

                      Department of Computing, University of Surrey, UK
                                    l.gillam@surrey.ac.uk


       Abstract. This paper briefly describes the approach taken to the subtask of Text
       Alignment in the Plagiarism Detection track at PAN 14. We have now re-
       implemented our PAN12 approach in a consistent programmatic manner,
       courtesy of secured research funding. PAN 14 offers us the first opportunity to
       evaluate the performance/consistency of this re-implementation. We present
       results from this re-implementation with respect to various PAN collections,
       although it is important to note that our target is to be able to undertake
       plagiarism detection in such a way as would be impervious to a range of
       attempts to discover the content being matched against – a kind of privacy-
       preserving plagiarism detection.


1 Introduction

   As reported in our PAN 13 notebook paper, having secured funding from the UK
government-backed Technology Strategy Board for 18 months, the University of
Surrey have been working on the Intellectual Property Protecting Cloud Services in
Supply Chains (IPCRESS) project, a collaboration with Jaguar Land Rover and
GeoLang Ltd. The IPCRESS project is focused on the difficulty of entrusting valuable
Intellectual Property (IP) to third parties, through the Cloud, as is necessary to allow
for the construction of components in the supply chain. The key innovation is the
ability to track high-value IP without having to reveal that IP – so approaches need to
avoid being reversible to text in clear. Such tracking is then suited to the tasks of (i)
preventing IP leakage; (ii) detecting IP leakage or theft; and (iii) identifying retention
beyond allowed review periods. The project builds from the proposed formulation of
such a system in Cooke and Gillam 2011. This can be formulated as a kind of
plagiarism detection, and hence the relevance of/to PAN, with a more challenging
aim: to be able to generate reliable detections without access to the textual content –
and so allowing for matches to be undertaken in public without exposing the content
high-value documents that ought to be locked in secure electronic vaults. As such,
only those with suitable access to the document in the vault should be able to verify
the match.
   In this paper, we briefly discuss the simplification of the code-base from our
original submissions to the present and much more self-contained setup, and


                                             951
demonstrate the consistency of results obtained. We also hint at improvements in our
treatment of obfuscation that are likely to become a focal point for future work also.
   Section 2 provides a brief summary of results found with re-used software applied
to PAN 2011, PAN 2012 and PAN 2013 datasets. Section 3 carries discussion of the
IPCRESS re-implementation. Section 4 presents results of applying IPCRESS to the
datsets for PAN 2012 and PAN 2013, and preliminary results found using initial
obfuscation handling approaches. Section 5 comments on the PAN 2014 results and
future work.


2 Previous PAN results

   We have discussed in previous PAN efforts (e.g. Cooke, 2011) how our intention is
to be able to find matching text without revealing the textual content. In PAN 11, the
approach brought us 4th place, with PlagDet=0.2467329, Recall=0.1500480,
Precision=0.7106536, Granularity=1.0058894. In 2012, we showed good granularity,
with high recall and precision for non-obfuscated text, but not such great recall in the
face of obfuscation (see Table below).

Test                         Plagdet        Recall         Precision     Granularity
                             Score
02_no_obfuscation               0.92530          0.90449       0.94709             1.0
03_artificial_low               0.09837          0.05374       0.93852         1.04688
04_artificial_high              0.01508          0.00867       0.96822         1.20313
06_simulated_paraphrase         0.11229          0.05956       0.97960             1.0

   In 2013, precision and granularity figures remained high, though recall had
dropped. For different kinds of obfuscation from 2012, recall remains low – though
perhaps surprisingly is better for random obfuscation than for translation or summary.

Test                          Plagdet        Recall        Precision      Granularity
                              Score
02_no_obfuscation                0.85884         0.83788       0.88088              1.0
03_random_obfuscation            0.04191         0.02142       0.95968              1.0
04_translation_obfuscation       0.01224         0.00616       0.97273              1.0
05_summary_obfuscation           0.00218         0.00109       0.99591              1.0


3 The IPCRESS implementation

   For the IPCRESS project, the previous codebase needed to be homogenized and
developed in such a manner as to be scalable to very large datasets. The previous
version/s were memory-based and thus not suitable for use at real scale (hundreds of
gigabytes or more). The IPCRESS code has been fully re-designed as a disk-based


                                           952
approach, as an object oriented implemention in C++. A new stitching algorithm has
also been developed.
    The IPCRESS approach generates what we refer to as secure stamps from whole
documents. From each stamp, we derive a set of shorter individual hash-like codes ,
 I , from sets of words. These codes are considered irreversible. Individual codes are
generated by using a sliding window of length, Wlen , across the document stamp to
extract that portion of the stamp in a manner similar to creating shingles. From this set
of hash-like codes an index is populated with information related to the current
window position within the document stamp and a document ID. This process is
illustrated in figure 1.
                                                                           Index
                                                                   0
                                           IPCRESS Code:4
                                           Position:0              1               Position: 1, ID:5

                                                                   2
                                           IPCRESS Code:1          4               Position: 0, ID:5
                                           Position:1
     Source           Stamp
     Document
         ID:5

                                                                    x-1
                                           IPCRESS Code:x
                                           Position:n                  x           Position: n, ID:5
                                                                       x+1


                   Figure 1. The indexing process using IPCRESS codes

   A query for a suspicious document is similarly generated using the sliding window
of length, Wlen, over the stamp of the suspicious document to generate a set of
IPCRESS code queries, Q .
   Document ID and code position pairs, ii  I , are retrieved from the index, I , for
each IPCRESS code, qi  Q , and sorted by document ID to give a set of results D.
Each element, d srcID  D , relates to a source document and is itself a set of results,
T d srcID ; where, srcID, is a source document ID. Each set, T d srcID , is composed of
information related to text segments, t j  T d srcID , each of length Wlen; each element, tj,
is a pair composed of {suspicious position, source position}. This relationship is
illustrated in figure 2.
    Each set, T d srcID , is then reduced via a first stage stitching algorithm to produce a
set of runs, R d srcID . Each run, rk  R d srcID , is 4-tuple, consisting of {suspicious start
position, suspicious length, source start position, and source length}. This first stage
stitching generates each run by finding consecutive elements of T d srcID that are
overlapping or consecutive in source position. Any runs, rk, that are less than a
defined minimum run length (MRL) are discarded.


                                             953
                                                 Set: T d (srcID: 1)
                                                        1


                                                    {Susp. Pos.: 0, Src. Pos.: 0}
       Set: D
                T d0                                {Susp. Pos.: 1, Src. Pos.: 1}


                T d1


                                                    {Susp. Pos.: m-1, Src. Pos.: m-1}
                T dn
                                                    {Susp. Pos.: m, Src. Pos.: m}


                Figure 2. Ordering of results from document query process

A second stage stitching algorithm then produces a set of text segments, S d srcID , from
the set of runs R d srcID . The algorithm finds subsets, R  R d srcID , such that each
element of R are all within a defined stitch distance (SD) of at least one other element
of R in terms of both suspicious and source position. The size of each subset is
maximized so that S d srcID is of minimal length. From each subset, R, a new 4-tuple is
formed, sm  S srcID , that gives {suspicious start position, suspicious length, source
                  d

start position source length}; the start positions are given by the first element of R and
the lengths are determined from the last element of R. Any segments found that are
shorter than a defined minimum segment length (MSL) are discarded.

3.1 Obfuscation Handling

   We consider two initial obfuscation handling approaches based on transformations
of a single query into closely related queries. The hash-like codes mentioned above
are formulated such that code similarity can be indicative of data similarity, and as
such the ‘closeness’ of any two queries can be based on binary distance approaches
such as Hamming and Levenshtein.
   The first approach, based on Hamming distances, generates transformed queries
with a given maximum Hamming distance with relation to the original enquiry. For
an original query qi  Q , of length Wlen, this approach will generate an extra Wlen
transformations. For a Hamming distance of 1, say, this method involves an extra Wlen
look-ups for each initial query.
   The second approach, based on the Levenschtein distance, similarly generates
transformed queries from each original query, qi. For a query length of Wlen, this
approach generates 2 sets of transformations Ti0 and Ti1 where i refers to a word
position within query, qi, and lies in the range 0  i  Wlen . This approach requires


                                           954
2*Wlen extra look-ups for each query. For each insertion the transformed query, Tin, is
masked to length Wlen for index compatibility.


4 IPCRESS vs previous PAN collections

   Prior results offer up a standard to be achieved in re-implementation. The new
codebase has been tested against data from PAN12 and PAN13, with modifications to
the algorithm largely demonstrating slightly improved performance, as shown in the
tables below:

IPCRESS raw – PAN12 data
Test                      Plagdet Score        Recall       Precision   Granularity
02_no_obfuscation                0.9437          0.9045       0.9877          1.0008
03_artificial_low                0.0956          0.0525       0.9942          1.0608
04_artificial_high               0.0200          0.0118       0.9852          1.2459
06_simulated_paraphrase          0.0992          0.0522       0.9922          1.0000
Obfuscation handler #1 (Hamming)
02_no_obfuscation                0.9358           0.9048       0.9703           1.0008
03_artificial_low                0.1970           0.1110       0.9853           1.0178
04_artificial_high               0.0373           0.0201       0.9577           1.0759
06_simulated_paraphrase          0.1512           0.0825       0.9038           1.0000
Obfuscation handler #2 (Levenshtein)
02_no_obfuscation                0.9236           0.9057       0.9423           1.0000
03_artificial_low                0.1888           0.1066       0.9820           1.0266
04_artificial_high               0.0682           0.0368       0.9489           1.0535
06_simulated_paraphrase          0.1345           0.0723       0.9572           1.0000

IPCRESS raw – PAN13 data
Test                       Plagdet Score       Recall       Precision    Granularity
02_no_obfuscation                  0.9253        0.9273        0.9233          1.0000
03_random_obfuscation              0.1356        0.0729        0.9675          1.0000
04_translation_obfuscation         0.0243        0.0123        0.9865          1.0000
05_summary_obfuscation             0.0022        0.0011        0.9959          1.0000
Obfuscation handler #1 (Hamming)
02_no_obfuscation                  0.9029         0.9289       0.8783           1.0000
03_random_obfuscation              0.1297         0.1297       0.9120           1.0000
04_translation_obfuscation         0.0244         0.0244       0.8953           1.0000
05_summary_obfuscation             0.0035         0.0017       0.9807           1.0000
Obfuscation handler #2 (Levenshtein)
02_no_obfuscation                  0.9058         0.9274       0.8853           1.0000
03_random_obfuscation              0.2151         0.1224       0.8936           1.0000
04_translation_obfuscation         0.0743         0.0386       0.9533           1.0000
05_summary_obfuscation             0.0035         0.0017       0.9920           1.0000


                                         955
5 IPCRESS vs PAN 2014 collections and Future Work

   PAN 2014 test results showed expected granularity and precision, but a surprising
difference between values for recall. Investigations led to the discovery of a bug in
detecting UTF-8 codes; when applied to PAN 2012 and 2013 collections, a similar
lowering of values was also observed. Further, our initial attempts at handling
obfuscation show some promise, but much more rigorous evaluation will be required
to determine the fullest extent of impact achievable by these approaches on the hash-
like codes.

Test data     Plagdet         Precision        Recall        Granularity     Runtime
Corpus 2          0.28302         0.88630          0.16840      1.00000        00:00:55
Corpus 3          0.44076         0.85744          0.29661      1.00000        00:00:56

   Through PAN 2014, we have demonstrated that the IPCRESS code produces
results comparable to, and even slightly better than, the previous implementation, and
effort has been put into ensuring the implementation is suited to scaling to very large
datasets. These results are certainly not going to be anywhere near the best that is
possible when evaluating similarity between texts where the content is fully exposed.
However, that is not our challenge. and it is important to note again that our specific
challenge is to be able to undertake plagiarism detection in such a way as would be
impervious to a range of attempts to discover the content being matched against – a
kind of privacy-preserving plagiarism detection that can be used against documents
whose content should be kept from plain sight.


Acknowledgements

   The authors gratefully recognize prior contributions of Neil Newbold, Neil Cooke,
Peter Wrobel and Henry Cooke to the formulation of the codebase used for prior
versions of this task, and by Cooke and Wrobel to the patents generated from these
efforts. This work has been supported in part by the EPSRC and JISC (EP/I034408/1)
and more substantially since PAN13 by the UK’s Technology Strategy Board (TSB,
169201). The authors are also grateful for the efforts of the PAN organizers in system
provision and managing the submissions.


References

  1.   Cooke, N. and Gillam, L.: Clowns, Crowds and Clouds: A Cross-Enterprise
       Approach to Detecting Information Leakage without Leaking Information. In:
       Mahmood, Z., Hill, R. (eds.) Cloud Computing for Enterprise Architectures,
       pp. 301-322. Springer, London (2011).


                                         956
2.   Cooke, N., Gillam, L., Wrobel, P., Cooke, H,, Al-Obaidli, F.: A high
     performance plagiarism detection system. In Proceedings CLEF 2011, Labs
     and Workshop, Notebook Papers, 3rd PAN workshop (2011).


                                   957

</pre>