From English to Persian: Conversion of Text Alignment for
                  Plagiarism Detection

                           Lee Gillam                                                         Anna Vartapetiance
               Department of Computer Science                                           Department of Computer Science
                      University of Surrey                                                    University of Surrey
                              UK                                                                      UK
                    l.gillam@surrey.ac.uk                                                a.vartapetiance@surrey.ac.uk


ABSTRACT                                                               In PAN 11, this approach gained 4th place, with
This paper briefly describes the approach taken to Persian             PlagDet=0.2467329, Recall=0.1500480, Precision=0.7106536,
Plagiarism Detection based on modification to the approach used        Granularity=1.0058894. In 2012, we showed good granularity,
for PAN between 2011 and 2014 in order to adapt to Persian. This       with high recall and precision for non-obfuscated text, but not
effort has offered us the opportunity to evaluate detection            such great recall when faced with higher orders of obfuscation,
performance for the same approach with another language. A key         and subsequent results are consistent or slightly better.
part of the motivation remains that of undertaking plagiarism          In this paper, we assess our efforts against texts in Persian, with
detection in such a way as to make it highly unlikely that the         some commentary on the effects of the make-up of the dataset.
content being matched against could be determined based on the         Section 2 presents findings in respect to the training data of
matches made, and hence to allow for privacy.                          Persian Plagdet, and Section 3 addresses the test data. Section 4
                                                                       concludes the paper and considers future work.
CCS Concepts
• Information systems → Near-duplicate and plagiarism                  2. TRAINING DATA
detection • Information systems → Evaluation of retrieval
results.
                                                                       The Persian Plagdet training data, dated 2016-07-17 comprises
Keywords                                                               1563 source documents and 1525 suspicious documents. There
Plagiarism detection; text alignment; Persian; PAN                     are some 2749 associated annotations, the breakdown of which is
                                                                       shown below in Table 1.
1. INTRODUCTION
Detection of plagiarism has been shown to be beneficial to both            Table 1. Breakdown of 2749 potential cases of plagiarism
education and research, ensuring that students and researchers                              within training data.
alike are demonstrating their own understanding and findings. The                01-no-plagiarism                 783        28%
same techniques can be used against the document archives of an                 02-no-obfuscation                 208        8%
organization to improve how such information is managed. And                  03-random-obfuscation              1611        59%
we have successfully demonstrated how related techniques can be              04-simulated-obfuscation             147        5%
used to protect by preventing the accidental propagation of                         Total files                  2749
corporate information deemed to be of high value in an Innovate
UK project on Intellectual Property Protecting Cloud Services in
Supply Chains (IPCRESS) collaboratively with Jaguar Land               It is striking that there is such a significant minority of
Rover and GeoLang Ltd [1]. This work was grounded in our               unobfuscated cases, in contrast to 59% ‘random’. A high-
previous PAN efforts, e.g. [2], undertaken with respect to finding     performing system must therefore be able to address the nature of
matching text without, at the time of the match, directly using the    this randomness, and this was uncovered during training as it
textual content itself (e.g. n-grams) or using patterns as could be    seemed difficult to push performance beyond a certain point.
uniquely resolved to the textual content. The approach [3]             Following re-implementation of our approach in Python for ease
produces a minimal representation of an input text by                  of inspection (and necessary upgrading to Python 3.4 so that wide
distinguishing content words from auxiliary words, and producing       characters were handled properly), and a switch to a Persian
matchable binary patterns directly from these dependent on the         stoplist1, initial tests were conducted to ascertain performance
number of classes of interest in each. This acts like hashing, but     against the training data.
no effort is taken to ensure collision-avoidance; indeed, the
approach actively encourages hash collision over short distances,      Key parameters for the system are:
as acts to prevent reverse-engineering of the patterns, and uses the       1. length used for match (windowSize, measured in words)
number of coincident matches to indicate the extent of similarity.         2. distance within which two non-overlapping matches
Such an approach is, therefore, more suited for longer initial                 can be merged (merge_dist, measured in characters)
pattern matching. Further, with the intention to undertake match
without access to the textual content, there is a need for
subsequent verification of potential matches based on access to        1
the text, which can be undertaken automatically here, but would                 Obtained       from:    https://github.com/kharazi/persian-
be anticipated to involve delegation of permissions within the             stopwords/blob/master/persian, comprising some 778 words. An
kinds of system envisaged.                                                 alternative of just 330 words was also considered but not used:
                                                                           http://www.ranks.nl/stopwords/persian
     3.   minimum proportion of shared words in matched                        To begin with, if passages are constructed which would not be
          segments (match_threshold), required due to the                      meaningful within the language, there would seem to be limited
          methodology used as verification of match                            gain from its treatment here, as there is a question over how this
     4.   minimum length of a final match (min_length), used                   reflects the reality of the problem being addressed. In addition, if
          later to filter out short matches.                                   the extent of change is high as would require significant human
                                                                               effort to reproduce, the likelihood of such highly edited passages
Brute force efforts were used to obtain approximate values for                 in the wild would seem to be lessened unless approaches are
parameters 1-3, with:                                                          partially automated and do not undergo post-editing. It also
    1. windowSize from 15 to 35, in increments of 5                            becomes difficult to address the difference between necessary
    2. merge_dist from 100 to 225 in increments of 25, and                     inclusion within a focused discussion, and an act of deliberate
         then honed in increments of 5                                         copying.
    3. match_threshold from 0.5 to 0.95 in increments of 5
                                                                                Table 3. Breakdown of scoring against specific subsets of the
The   best    initial  combination     was    windowSize=30,                                          training data.
merge_dist=210, match_threshold=0.75, producing the following:
                                                                                                         02 before          02           03             04
                                                                                                          content
  Table 2. Best scores on training data following initial brute                                            check
                     force determination.
                                                                                Plagdet Score             0.5740       0.9050           0.0031         0.0426
          Plagdet Score                           0.1894
             Recall                               0.1055                            Recall                0.9817       0.9729           0.0016         0.0357
            Precision                             0.9542                          Precision               0.4055       0.8461           0.0476         0.0605
           Granularity                            1.0043
                                                                                 Granularity              1.0000       1.0000           1.0000         1.0769
Following the investigation into the make-up of the training data,                 % cases                 8%            8%              59%            5%
explained above, the performance against each subset was
evaluated. Initially, we determine the detection capability with               Consider the example passages below with matches between
match_threshold=0 (second column), i.e. without content                        source and suspicious, with colour used to identify those passages
checking, then look at each subset with checking (subsequent                   shared. These passages are not fully matchable: lowermost in the
columns), as shown in Table 3.                                                 table are two matched sub-passages where the maximal fragment
                                                                               length is 5, and the sub-matches are mostly smaller. Clearly, when
Such scores demonstrate acceptable performance for the approach                the initial extents are short this will favour those approaches that
used, which is geared towards copying with low levels of                       address short n-grams. In addition, these passages include extents
obfuscation, as even without subsequent checking of matches                    of text in that are quite different between the two, and this brings
recall is very high for non-obfuscated text. The low recall for                implications to the treatment of the gap between passages as well
obfuscation therefore leads us to explore the nature of the kind of            as to any verification step as addresses word overlap.
obfuscation in use, and here the use of so-called random
obfuscation within the training data certainly merits discussion.

      Table 4. Matching segments in random obfuscation, within annotations provided for the training data. Highlighted parts
       lowermost show differences between matched segments, as assists in identifying favourable n-gram sizes for matching.
Suspicious
                                           . ‫ ساختواى اصلی ًادرستیست کتاب شیعی تشکیل قصیذٍ دادُاًذ‬، ‫ کاٍ است‬، ‫ کذخذا ّ ُوَ خزّص‬، ‫ گٌج‬، ْ‫ٍ عِذآبخ‬
  ‫ اّلیي بیشیلِپیلَ ًاهَ طفزٍ اًذر‬. ‫ابزاُین گلستاى در اعتزاض کزدى اسن غیزقاًًْی تیغ قٌذک ّ جزاحی شذٍ کتابش بَ جوِْری اسالهی ایزاى دّ کار هیٌْیسذ‬
. ‫ اگز خْارّبارفزّشی بزای شزم کٌِسال‬، ‫ شزهآّر اًتشاراتیای‬، ‫ خطاب بَ هذیز است یا کتاب در آى یغوا شذٍ ًْشتَ_شذٍ_است‬۳۱۳۱ ‫ تیز قوز‬۱۳ ‫تپَ تاریخ‬
  ٍ‫ هي ّقتن را اس تلف ًویکٌن هعصیتبار ساّیَ کَ اس هٌافی عالقَ هٌذی کساًی چٌیي کزدُاًذ شکایتی اشغال بَ را هاخذ بَ اقاهتگا‬. ‫ چاپ ًاهَ ایي‬. ‫ایي اقذام شوا‬
                      ‫ ًویشٌاسن … بِتز است ایي باشذ کش رفتٌِا ّ هثلَ کزدًِا بواًذ بزقزار شخ بزای عولَٔ شک‬، ‫ببزم کاٍ چٌیي هقام ّ هزجع را اگز ُن باشذ‬
Source
                                         ‫ ساختارُای اصلی ایي کتاب را تشکیل دادُاًذ‬،‫ کَ هظِز بیذاری است‬،‫ کذخذا ّ اس ُوَ هِوتز خزّص‬،‫ گٌج‬،ٍ‫جشیز‬.
        ‫ خطاب بَ هذیز‬۳۱۳۱ ٍ‫ تیز ها‬۱۳ ‫ اّلیي ًاهَ در تاریخ‬.‫ابزاُین گلستاى در اعتزاض بَ چاپ غیزقاًًْی ّ جزاحی شذٍ کتابش در ایزاى دّ ًاهَ هیٌْیسذ‬
                                                                                                 ‫اًتشاراتیای کَ کتاب در آى چاپ شذٍ ًْشتَ شذٍ است‬:
 َ‫ هي ّقتن را بیش اس تلف ًویکٌن کَ اس کساًی ک‬.‫ اسن ایي کار ًادرستیست‬.‫ ایي اقذام شوا ًادرست بْدٍ است‬.‫ اگز بزای شزم هعٌایی باقیواًذٍ باشذ‬،‫شزهآّر است‬
    ‫ ًویشٌاسن … بِتز است ایي کش رفتٌِا ّ هُثلَ کزدًِا بواًذ بزای‬،‫ کَ چٌیي هقام ّ هزجع را اگز ُن باشذ‬،‫چٌیي کزدُاًذ شکایتی بَ هزجعی یا بَ هقاهی ببزم‬
                                                                                                                                        َ‫عولَٔ شکٌج‬.
                                                                                                ‫گٌج کذخذا ّ *** ُوَ *** خزّص‬
                                                                                                    ‫خزّص‬      ‫هِوتز‬    َ‫ُو‬         ‫اس‬    ّ     ‫کذخذا‬     ‫گٌج‬
      ‫شزهآّر‬    ‫است‬       ٍ‫شذ‬   َ‫ًْشت‬   ٍ‫شذ‬    ‫یغوا‬   ‫آى‬    ‫در‬    ‫کتاب‬    ‫یا‬        ‫است‬      ‫هذیز‬   َ‫ب‬     ‫خطاب‬      ۳۱۳۱        ‫قوز‬    ‫تیز‬    ۱۳       ‫تاریخ‬
      ‫شزهآّر‬    ‫است‬       ٍ‫شذ‬   َ‫ًْشت‬   ٍ‫شذ‬    ‫چاپ‬    ‫آى‬    ‫در‬    ‫کتاب‬   َ‫ک‬     ‫اًتشاراتی‬    ‫هذیز‬   َ‫ب‬     ‫خطاب‬      ۳۱۳۱        ٍ‫ها‬    ‫تیز‬    ۱۳       ‫تاریخ‬
Assuming that the test data would be formulated similarly to the       reasons for the detection performance in respect to random
training data, peak performance would be somewhat constrained,         obfuscation after it was apparent that an improved plagiarism
and with time available, further tests were performed using rather     detection score could not be achieved.
shorter initial word numbers (windowSize values) and lower
values for other parameters.                                           5. ACKNOWLEDGMENTS
                                                                             The authors gratefully recognize prior contributions of Neil
     Table 5. Example scores from further brute force                  Cooke, Scott Notley, Peter Wrobel, Neil Newbold and Henry
   determination, showing increases in recall for smaller              Cooke in respect to the approach in previous competitions, and by
windowSize, at the cost of drops in precision and granularity.         Neil Cooke and Peter Wrobel to the patents generated from these.
                                                                       We recognize, also, prior support from the EPSRC and JISC
      merge_dist              50            50             50
                                                                       (EP/I034408/1) and by Innovate UK (TSB, 169201). The authors
      windowSize              20            10              8          are also grateful for the efforts of the Persian Plagdet organizers in
    match_threshold          0.35          0.35           0.35         formulating proceedings [5] and for system and data provision [6],
      min_length             200           200            200          [7], [8].
     Plagdet Score          0.1478        0.3669         0.4231
                                                                       6. REFERENCES
         Recall             0.0859        0.3055         0.4568        [1] Gillam, L., Notley, S., Broome, S. and Garside, D. 2015
        Precision           0.9514        0.8493         0.7586            IPCRESS: Tracking Intellectual Property through Supply
       Granularity          1.0935        1.3370         1.5453            Chains in Clouds. In Raghavendra Rao, N. (ed.) Enterprise
                                                                           Management Strategies in the Era of Cloud Computing. IGI-
3. TEST DATA                                                               Global.
From Table 4, above, values used for the submission based on test
                                                                       [2] Cooke, N., Gillam, L., Wrobel, P. Cooke, H. and Al-Obaidli,
data are in the final column. Results from all participants are
                                                                           F. 2011 "A high performance plagiarism detection system".
shown below in Table 5. Although only achieving 8 th place of 9,
                                                                           Proceedings of the 3rd PAN workshop.
with all other participants are from Iranian institutions, we are
satisfied that we have managed to maintain the core of our             [3] Cooke, N and Gillam, L. 2012. System, process and method
approach, which we were already aware was only robust to a                 for the detection of common content in multiple documents
certain extent of obfuscation and is not readily tuned to random           in an electronic system. U.S. Patent filing US13/307,428,
effects as may not necessarily be readable unless via a codebook.          filed 30th November 2011.
                                                                       [4] The      Computer    Language       Benchmarks     Game:
               Table 6. Results of all participants.                       http://benchmarksgame.alioth.debian.org/u64q/compare.php?
                                                                           lang=python3&lang2=gpp
       Rank    Plagdet    Granularity    Precision     Recall
                                                                       [5] Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Rosso, P., and
                                                                           Potthast, M., 2016. Algorithms and Corpora for Persian
                                                                           Plagiarism Detection: Overview of PAN at FIRE 2016. In
         1     0.92204      1.00146       0.92688      0.91919
         2     0.90593          1         0.95927      0.85820             Working notes of FIRE 2016 - Forum for Information
         3     0.87103          1         0.89258      0.85049             Retrieval Evaluation, Kolkata, India, December 7-10, 2016,
         4     0.83015      1.03968       0.92034      0.79602             CEUR Workshop Proceedings, CEUR-WS.org.
         5     0.80083         1.0        0.93337      0.70124         [6] Potthast, M., Stein, B., Barrón-Cedeño, A. and Rosso, P.
         6     0.77496      1.22759       0.96383      0.83615             2010. An Evaluation Framework for Plagiarism Detection. In
         7     0.72662          1         0.74962      0.70499             Huang, C-R and Jurafsky, D. (eds.), 23rd International
         8     0.39968      1.52803       0.75484      0.41407             Conference on Computational Linguistics (COLING 10),
         9     0.38994      3.53698       0.90002      0.80659
                                                                           pages 997-1005, Stroudsburg, Pennsylvania, August
                                                                           2010.Association for Computational Linguistics.
4. CONCLUSIONS                                                         [7] Gollub, T., Stein, B. and Burrows, S.. 2012. Ousting Ivory
In this paper, we briefly described the approach taken to Persian          Tower Research: Towards a Web Framework for Providing
Plagiarism Detection, based on modification to the approach used           Experiments as a Service. In Hersh, B., Callan, J., Maarek,
for PAN between 2011 and 2014. Detection performance for                   Y. and Sanderson. M. (eds.), 35th International ACM
Persian is appropriate with respect to the nature of application we        Conference on Research and Development in Information
have in mind, and the large proportion of randomly obfuscated              Retrieval (SIGIR 12), pages 1125-1126, August 2012. ACM.
data, allied to the manner in which obfuscation is conducted,              ISBN 978-1-4503-1472-5.
limits what our approach would achieve. Runtime performance is         [8] Potthast, M., Gollub, T., Rangel,, F., Rosso, P., Stamatatos,
also inadequate, however this is believed due in significant part to       E. and Stein, B.. Improving the Reproducibility of PAN's
using a Python implementation instead of our main C++ codebase             Shared Tasks: Plagiarism Detection, Author Identification,
– and we are aware that some suggest Python 3 is significantly             and Author Profiling. In Kanoulas, E. et al, editors,
slower than C++ for the majority of tasks [4]. Evaluating a                Information Access Evaluation meets Multilinguality,
standard n-gram approach, via C++, would be expected to                    Multimodality, and Visualization. 5th International
improve detection performance against these data.                          Conference of the CLEF Initiative (CLEF 14), pages 268-
It is worth noting that the first author has no familiarity with           299, Berlin Heidelberg New York, September 2014.
Persian languages, and only sought the co-author’s advice on               Springer. ISBN 978-3-319-11381-4.