A Pairwise Document Analysis Approach
                        for Monolingual Plagiarism Detection

                               Nava Ehsan                                                 Azadeh Shakery
          School of Electrical and Computer Engineering                   School of Electrical and Computer Engineering
                      College of Engineering                                          College of Engineering
                       University of Tehran                                            University of Tehran
                        n.ehsan@ece.ut.ac.ir                                              shakery@ut.ac.ir

ABSTRACT                                                                  reference of the suspicious text. In monolingual plagiarism detec-
The task of plagiarism detection entails two main steps, suspicious       tion, the suspicious text could be an exact copy or a modified copy
candidate retrieval and pairwise document similarity analysis also        [17] and it should be large enough to be more than just a coinci-
called detailed analysis. In this paper we focus on the second sub-       dence.
task. We will report our monolingual plagiarism detection system             One method for monolingual plagiarism detection is comparing
which is used to process the Persian plagiarism corpus for the task       fragments of suspicious and source documents using fingerprint in-
of pairwise document similarity. To retrieve plagiarised passages a       dexing. Winnowing approach [22], which is used in the widely
plagiarism detection method based on vector space model, insen-           used plagiarism detection tool, MOSS, is based on fingerprint in-
sitive to context reordering, is presented. We evaluate the perfor-       dexing.
mance in terms of precision, recall, granularity and plagdet metrics.        There are different approaches for detection monolingual plagia-
                                                                          rism detection. Some of these approaches could be classified into
                                                                          fingerprinting [22, 13], string matching [8], using stopwords [24],
CCS Concepts                                                              vector space models [12, 5], probability models [3], classification
•Information systems → Near-duplicate and plagiarism detec-               [16], semantic models [6, 7] and structural models [23]. According
tion; •Applied computing → Document analysis;                             to monolingual experiments in [4] paraphrasing could make pla-
                                                                          giarism detection more difficult. There have been some works for
                                                                          detecting paraphrased sentences in monolingual texts [1].
1.    INTRODUCTION                                                           In recent years PAN competition offers evaluation environment
                                                                          to evaluate plagiarism detection algorithms. This competition also
   The task of plagiarism detection comprises two main steps: can-
                                                                          offers evaluation corpora. PAN@FIRE Persian Plagdet 2016 com-
didate document retrieval and pairwise document detection or de-
                                                                          petition [2] prepares a set of suspicious and source documents writ-
tailed analysis. Some researchers also included a third step, called
                                                                          ten in Persian and the task is to find all plagiarized sections in the
post-processing, where the extracted passage pairs are cleaned, fil-
                                                                          suspicious documents and, if available, the corresponding source
tered, and possibly visualized for later presentation [18]. Detailed
                                                                          sections. This external plagiarism detection task provides a situa-
analysis of plagiarism detection task is retrieving passages of text
                                                                          tion to evaluate Persian plagiarism detection systems.
which have originated from another document. This process is
comparing source and suspicious pairs and retrieving plagiarized
fragments. In this paper we introduce a vector space model for this
task. The proximity of words are considered by dividing the text          3.    PAIRWISE DOCUMENT SIMILARITY ANAL-
into passages. After creating the passages we didn’t take into ac-              YSIS
count the order of the words in each passage. Thus, this approach
is insensitive to punctuations, extra white spaces and permutation           In this section we deal with pairwise document similarity analy-
of the document context in the passages. We also didn’t use any           sis and we will introduce a method based on a vector space model
language specific feature. Thus, our approach is applicable in any        to detect plagiarized fragments of specified set of suspicious and
language. The result we obtained on the training data was about           source document pairs. We assume that the source and suspicious
0.82 with respect to Plagdet score.                                       pairs are pre-defined for the system. According to PAN competi-
   The rest of the paper is organized as follows: Section 2 outlines      tion this process is called, detailed analysis stage. The problem of
related works in monolingual plagiarism detection. Section 3 de-          pairwise document analysis is defined as follows: let S 0 be a suspi-
scribes the pairwise document similarity approach. Finally exper-         cious document and S be a source document that is likely to contain
imental results are discussed in Section 4. Conclusion and future         similar passages to some passages in S 0 . S and S 0 are compared
work are reported, in Section 5.                                          section-wise using a language retrieval model. A plagiarism is con-
                                                                          sidered, if for a pair of sections (Sf and Sf0 0 ) a similarity above a
                                                                          threshold is detected.
2.    RELATED WORK                                                           According to [18] the detection approaches of this sub-task in-
   Plagiarism detection could be classified into two classes, intrin-     cludes three building blocks named (1) seeding, (2) match merg-
sic and external also called without reference or with reference, re-     ing, and (3) extraction filtering. In the following subsections, we
spectively. Intrinsic evaluation is referred to those methods, which      describe them in detail.
use style analysis to detect parts of the text that are inconsistent in
terms of writing style [15, 14]. The aim of external plagiarism de-
tection is not only finding the suspicious text but also finding the
3.1     Seeding                                                             4.    RESULTS
   Given a suspicious document and a source document, matches                  The results of Persian detailed analysis subtask on PAN@FIRE
between the two documents are identified using some seed [18]. In           Persian Plagdet 2016 [2] using the training data is summarized in
our approach, first preprocessing phase is performed. Given a doc-          Tables 1, 2 and 3. The experimentation platform TIRA is used for
ument, a preprocessing phase is performed. We substitute Arabic             our evaluations [11, 19].
ø and ¼ with Persian ø and ¸. The reason is that the mentioned                 Parameters t1 and n are respectively the similarity threshold and
two letters have different character encodings. Then, stopwords,            number of consecutive sentences described in Section 3.1. The ad-
punctuations and extra white spaces are removed and tokens are              jacency threshold t2 , described in Section 3.2, is set to 1500 char-
extracted.                                                                  acters. The n value for n-gram creation described in Section 3.3
   Since, plagiarism usually happens in parts of the text, a plagia-        for each sentence is chosen to be 9, except for the final partition of
rism detection method should be able to detect local similarities           the sentence that may include fewer or more than 9 terms.
where only a short passage may be in common in both documents.                 The evaluation metrics are described in [21]. The evaluation is
Thus, there is a requirement to segment the texts into fragments.           based on macro-average precision and recall. Also, the granular-
For each document pair, we split the texts into sentences by using          ity measure characterizes the power of a detection algorithm. It
".", "?" and "!" marks. We choose an amount of consecutive sen-             shows whether a plagiarism case is detected as a whole or in several
tences as the smallest unit of plagiarism. Documents are divided            pieces. The Plagdet score is the combination of the three metrics,
into some fragments each containing n sentences with one sentence           precision, recall and granularity defined as follows for plagiarism
overlap. The sensitivity of the algorithm with respect to parameter         cases S and plagiarism detections R [21]:
n is shown in Section 4.
   After sentence splitting, a vector space model is created. The                                                   F1
                                                                                     plagdet(S, R) =                                          (1)
terms of the source document are considered as the vocabulary and                                       log(1 + granularity(S, R))
the binary weighting schema is used by setting the ith index 1 if
                                                                               The results of the tables show that the plagdet score improves by
the ith term occurs in the fragment and 0 otherwise. We regard
                                                                            decreasing the amount of the sentences in a fragment. The reason
suspicious passage Sf and reference passage Sf0 0 as pairs of pla-
                                                                            could be that there are more short plagiarized texts than long ones in
giarism candidate sentences whose cosine similarity is greater than
                                                                            the dataset. The increase in precision comparing different number
a threshold t1 , and at least three terms of the source text are found in
                                                                            of sentences in a fragment using 0.3 for the value t1 , could show
the suspicious fragment. The last criterion is added to avoid retriev-
                                                                            that decreasing the amount of sentences may have higher impact on
ing fragments with coincidental similarity. Sf will be considered
                                                                            decreasing rather than increasing the false positive detections.
as a plagiarism source of Sf0 0 if it has maximum cosine similarity
                                                                               We also analysed the effect of extraction filtering part of the ap-
among all source fragments with similarity above the threshold.
                                                                            proach and we realized that without applying this stage for (t1 =
   The vector creation approach is similar to Eurovoc-based model
                                                                            0.3, n = 3) the precision was 0.6026 and the plagdet score was
proposed in [20] except that we use it for monolingual texts rather
                                                                            0.7087. This shows that this step improved the plagdet score about
than cross-lingual texts. Thus, this approach could be easily adapted
                                                                            14 percent.
to cross language plagiarism detection.
                                                                               We used 0.3 for t1 and 5 sentences for parameter n (number of
                                                                            consecutive sentences) for the test set. The results on the test set
3.2     Match Merging                                                       are shown in Table 4.
   Finding seed matches between a suspicious and a source doc-
ument, they are merged into aligned passages of maximal length              5.    CONCLUSION AND FUTURE WORK
which are then reported as plagiarism detections [18]. To improve              The task of plagiarism detection entails two sub-tasks, suspicious
performance with respect to the granularity metric, we merge ad-            candidate retrieval and pairwise document similarity. We introduce
jacent suspicious and source fragments to report a single plagia-           a pairwise document analysis approach for Persian language. An
rism case. If the number of characters between two detected frag-           approach based on a vector space model is described for computing
ments of source and suspicious documents are below a threshold t2           pairwise document similarity. The principle of this approach is that
those fragments are considered as adjacent fragments and they are           sentences containing more common words are likely to be a source
merged to report a single plagiarism case.                                  of plagiarism. The method contains three building blocks named
                                                                            seeding, match merging and extraction filtering. Our work is tested
3.3     Extraction Filtering                                                on a Persian corpora which offers evaluation environment to eval-
                                                                            uate plagiarism detection algorithms. The proposed approach is
   Given a set of aligned passages, a passage filter removes all
                                                                            insensitive to context reordering and could be applied in any lan-
aligned passages that do not meet certain criteria. For example
                                                                            guage.
dealing with overlapping passages or extremely short passages [18].
                                                                               Detailed analysis subtask will be improved by expanding the rep-
   After retrieving potential plagiarized fragments in previous steps,
                                                                            resentative words of the document to find appropriate substitutes
the sentences within a fragment are partitioned into non-overlapping
                                                                            for a word in the context in order to capture intelligent plagiarisms.
n-grams. For extraction filtering step we applied a method simi-
                                                                            For this reason, there is requirement to minimize the risk of noises
lar to result filtering approach proposed in [10] for excluding false
                                                                            that word expansion may cause. Other weighting schemas such as
positive detections and improving precision. We used the dynamic
                                                                            tf −idf weighting could be applied in comparing the vectors of the
algorithm to find the alignments between the n-grams of the source
                                                                            texts. A complete plagiarism detection system could be developed
and suspicious texts and then the null alignments are excluded from
                                                                            by adding a candidate selection [9] step before pairwise document
the start and end of the reported fragments.
                                                                            analysis.
                                                  Precision   Recall     Granularity     Plagdet
                              (t1 = 0.4, n = 5)    0.8532     0.5357         1           0.6582
                              (t1 = 0.3, n = 5)    0.7630     0.7486         1           0.7557
                              (t1 = 0.2, n = 5)    0.4004     0.8151         1           0.5370
                              (t1 = 0.3, n = 3)    0.7867     0.8304         1           0.8080
                              (t1 = 0.3, n = 2)    0.8482     0.7876         1           0.8168


Table 1: Results of detailed analysis sub-task using macro-averaged precision, recall, granularity, and the plagdet score


                                                         Precision     Recall     F1
                                     (t1 = 0.4, n = 5)    0.9434       0.5478   0.6931
                                     (t1 = 0.3, n = 5)    0.8484       0.7631   0.8035
                                     (t1 = 0.2, n = 5)    0.4379       0.8292   0.5731
                                     (t1 = 0.3, n = 3)    0.8931       0.8518   0.8720
                                     (t1 = 0.3, n = 2)    0.9069       0.8162   0.8592


                               Table 2: Results of detailed analysis sub-task in case-level


                                                         Precision     Recall     F1
                                     (t1 = 0.4, n = 5)    0.9457       0.5534   0.6982
                                     (t1 = 0.3, n = 5)    0.8798       0.7701   0.8213
                                     (t1 = 0.2, n = 5)    0.6266       0.8342   0.7156
                                     (t1 = 0.3, n = 3)    0.9312       0.8555   0.8918
                                     (t1 = 0.3, n = 2)    0.9452       0.8220   0.8793


                            Table 3: Results of detailed analysis sub-task in document-level


                                                  Precision   Recall     Granularity     Plagdet
                              (t1 = 0.3, n = 5)    0.7496     0.7050         1           0.7266


                              Table 4: Results of detailed analysis sub-task on the test set
Acknowledgement                                                          3936 LNCS of Lecture Notes in Computer Science, pages
We would like to acknowledge the assistance and information pro-         565–569, Berlin Heidelberg New York, 2006. Springer.
vided by Hossein Nasr Esfahani and Mahsa Shahshahani.               [15] G. Oberreuter and J. D. Velásquez. Text mining applied to
                                                                         plagiarism detection: The use of words for detecting
                                                                         deviations in the writing style. Expert Systems with
                                                                         Applications, 40(9):3756–3763, 2013.
                                                                    [16] R. C. Pereira, V. P. Moreira, and R. Galante. A new approach
6.   REFERENCES                                                          for cross-language plagiarism analysis. In Multilingual and
 [1] I. Androutsopoulos and P. Malakasiotis. A survey of                 Multimodal Information Access Evaluation, volume 6360,
     paraphrasing and textual entailment methods. Journal of             pages 15–26. 2010.
     Artificial Intelligence Research, 38:135 – 187, 2009.          [17] M. Potthast, A. Barrón-Cedeño, B. Stein, and P. Rosso.
 [2] H. Asghari, S. Mohtaj, O. Fatemi, H. Faili, P. Rosso, and           Cross-language plagiarism detection. Language Resources
     M. Potthast. Algorithms and corpora for persian plagiarism          and Evaluation, 45(1):45–62, 2011.
     detection: Overview of pan at fire 2016. In Working notes of   [18] M. Potthast, T. Gollub, M. Hagen, J. Kiesel, M. Michel,
     FIRE 2016 - Forum for Information Retrieval Evaluation,             A. Oberländer, M. Tippmann, A. Barrón-Cedeño, P. Gupta,
     CEUR Workshop Proceedings. CEUR-WS.org, 2016.                       P. Rosso, et al. Overview of the 4th international competition
 [3] A. Barrón-Cedeño, P. Rosso, and J.-M. Benedí. Reducing the          on plagiarism detection. In CLEF (Online Working
     plagiarism detection search space on the basis of the               Notes/Labs/Workshop), 2012.
     kullback-leibler distance. In Computational Linguistics and    [19] M. Potthast, T. Gollub, F. Rangel, P. Rosso, E. Stamatatos,
     Intelligent Text Processing, pages 523–534. Springer, 2009.         and B. Stein. Improving the Reproducibility of PAN’s Shared
 [4] A. Barrón-Cedeno, M. Vila, M. A. Martí, and P. Rosso.               Tasks: Plagiarism Detection, Author Identification, and
     Plagiarism meets paraphrasing: Insights for the next                Author Profiling. In E. Kanoulas, M. Lupu, P. Clough,
     generation in automatic plagiarism detection. Computational         M. Sanderson, M. Hall, A. Hanbury, and E. Toms, editors,
     Linguistics, 39(4):917–947, 2013.                                   Information Access Evaluation meets Multilinguality,
 [5] S. Brin, J. Davis, and H. Garcia-Molina. Copy detection             Multimodality, and Visualization. 5th International
     mechanisms for digital documents. In ACM SIGMOD                     Conference of the CLEF Initiative (CLEF 14), pages
     Record, volume 24, pages 398–409. ACM, 1995.                        268–299, Berlin Heidelberg New York, Sept. 2014. Springer.
 [6] C.-Y. Chen, J.-Y. Yeh, and H.-R. Ke. Plagiarism detection      [20] M. Potthast, B. Stein, and M. Anderka. A wikipedia-based
     using rouge and wordnet. Journal of Computing, pages 34 –           multilingual retrieval model. In Advances in Information
     44, 2010.                                                           Retrieval, pages 522–530. 2008.
 [7] M. Chong and L. Specia. Lexical generalisation for             [21] M. Potthast, B. Stein, A. Barrón-Cedeño, and P. Rosso. An
     word-level matching in plagiarism detection. In RANLP,              evaluation framework for plagiarism detection. In
     pages 704–709, 2011.                                                Proceedings of the 23rd International Conference on
 [8] P. Clough and M. Stevenson. Developing a corpus of                  Computational Linguistics: Posters, pages 997–1005.
     plagiarised short answers. Language Resources and                   Association for Computational Linguistics, 2010.
     Evaluation, 45(1):5–24, 2011.                                  [22] S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing:
 [9] N. Ehsan and A. Shakery. Candidate document retrieval for           local algorithms for document fingerprinting. In Proceedings
     cross-lingual plagiarism detection using two-level proximity        of the 2003 ACM SIGMOD international conference on
     information. Information Processing & Management,                   Management of data, pages 76–85. ACM, 2003.
     52(6):1004–1017, 2016.                                         [23] A. Si, H. V. Leong, and R. W. Lau. Check: a document
[10] N. Ehsan, F. W. Tompa, and A. Shakery. Using a dictionary           plagiarism detection system. In Proceedings of the 1997
     and n-gram alignment to improve fine-grained                        ACM symposium on Applied computing, pages 70–77. ACM,
     cross-language plagiarism detection. In Proceedings of the          1997.
     2016 ACM Symposium on Document Engineering, pages              [24] E. Stamatatos. Plagiarism detection using stopword n-grams.
     59–68. ACM, 2016.                                                   Journal of the American Society for Information Science and
[11] T. Gollub, B. Stein, and S. Burrows. Ousting Ivory Tower            Technology, 62(12):2512–2527, 2011.
     Research: Towards a Web Framework for Providing
     Experiments as a Service. In B. Hersh, J. Callan, Y. Maarek,
     and M. Sanderson, editors, 35th International ACM
     Conference on Research and Development in Information
     Retrieval (SIGIR 12), pages 1125–1126. ACM, Aug. 2012.
[12] C. Grozea, C. Gehl, and M. Popescu. Encoplot: Pairwise
     sequence matching in linear time applied to plagiarism
     detection. In 3rd PAN Workshop. Uncovering Plagiarism,
     Authorship and Social Software Misuse, pages 10 – 18, 2009.
[13] G. S. Manku, A. Jain, and A. Das Sarma. Detecting
     near-duplicates for web crawling. In Proceedings of the 16th
     international conference on World Wide Web, pages
     141–150. ACM, 2007.
[14] S. Meyer zu Eißen and B. Stein. Intrinsic Plagiarism
     Detection. In Advances in Information Retrieval. 28th
     European Conference on IR Research (ECIR 06), volume