A Text Alignment Corpus for Persian Plagiarism Detection

                    Fatemeh Mashhadirajab                                              Mehrnoush Shamsfard
                      NLP Research Lab,                                                 NLP Research Lab,
         Faculty of Computer Science and Engineering,                      Faculty of Computer Science and Engineering,
                Shahid Beheshti University, Iran                                  Shahid Beheshti University, Iran
                f.mashhadirajab@mail.sbu.ac.ir                                          m-shams@sbu.ac.ir


            Razieh Adelkhah                                Fatemeh Shafiee                             Chakaveh Saedi
           NLP Research Lab,                              NLP Research Lab,                     NLX Lab of university of Lisbon
    Faculty of Computer Science and                Faculty of Computer Science and                Department of Informatics
               Engineering,                                   Engineering,                                 Portugal
    Shahid Beheshti University, Iran               Shahid Beheshti University, Iran                 Ch_saedi@sbu.ac.ir
        r.adelkhah@yahoo.com                           f.shafiee@hotmail.com


                                                                      2. RELATED WORK
ABSTRACT                                                              Numerous text alignment datasets, including PAN plagiarism
This paper describes how a Persian text alignment corpus was          corpora, have been employed to evaluate text alignment
constructed to evaluate plagiarism detection systems. This corpus     algorithms in plagiarism detection competitions since 2009 [3, 8,
is in PAN format and contains 11,089 documents and more than          9, 16, 17, 18]. The first text alignment data set that was developed
11,603 plagiarism cases. Efforts were made to simulate various        by PAN in 2010 [3] includes 27,073 documents in English and
types of plagiarism manually, semi-automatically, or                  68558 cases of plagiarism. In this massive data set, plagiarism
automatically in this large-scale corpus.                             cases are generally provided with two strategies. Simulated
                                                                      Plagiarism is the first strategy in which 907 people were asked to
CCS Concepts                                                          rewrite the given original texts so that the meaning of the original
• Information systems → Near-duplicate and plagiarism
                                                                      is not changed but the appearance of the text be replaced with
detection.
                                                                      different words and phrases. Artificial Plagiarism is the second
• Information systems → Evaluation of retrieval results.
                                                                      strategy where automated methods have been used to change the
Keywords                                                              text. Techniques used in this section are divided into three
Plagiarism detection; Text alignment corpus; Types of plagiarism;     categories. The first category uses techniques to insert, remove
Corpus construction.                                                  and replace words and short phrases, the second category uses
                                                                      techniques to replace words with their synonyms, antonyms,
1. INTRODUCTION                                                       hyponyms, or hypernyms and the third category uses the
Plagiarism is using others’ phrases, solutions, ideas, or results     movement of vocabulary in a sentence with the same POS Tag.
with no faithful citation. The considerable worldwide growth of       Another text alignment corpus which was offered by was used by
plagiarism in recent years emphasizes the importance of dealing       PAN to evaluate algorithms in 2013 and 2014 [9,18]. This corpus
with this phenomenon. Plagiarism is an ethical challenge in           includes the 3653 suspicious document and 4774 source document
science to which there are many contributing factors; however, the    in English and 8,000 cases of plagiarism. This corpus consists
development in plagiarism detection systems can at least result in    three types of obfuscation strategies: Random obfuscation, Cyclic
a reduction in plagiarism growth. The PAN competition which has       Translation obfuscation and Summary obfuscation. In Random
been held yearly since 2009 is one famous example in the              obfuscation they use techniques similar to Artificial Plagiarism
plagiarism detection area. Such competitions provide a suitable       strategy. In cyclic translation obfuscation, a text is manually or
layout for comparing researchers’ different approaches and            automatically translated into another language and after edition it
solutions. Having a suitable evaluating corpus is one of the most     is translated into the source language again. To simulate Summary
important requirements in such a competition. This article            obfuscation which is considered as a plagiarism technique, PAN
describes how a corpus for the task of text alignment corpus          has used evaluation corpora of summarizer automatic system.
construction in Persian Plagdet 2016 [1] was constructed.             Moreover, in year 2015, instead of inviting text alignment
Researchers have produced different taxonomies of plagiarism          algorithms, PAN demanded to have text alignment data set sent,
types [19, 20, 21]. The taxonomy of plagiarism presented by           and a total of 8 data sets have been submitted to the PAN 2015.
Alzahrani et al. [2] is shown in Fig. 1. This taxonomy was used in    [22-29]. These data sets are in different languages and have used a
the current study to construct a data set for evaluating plagiarism   variety of techniques to obfuscate the text. Alvi’ corpus [22] is
detection systems. In the second section, we review available text    among such sent corpora, which includes 272 documents in
alignment corpora and in the third section the method for             English and 150 plagiarism cases. Alvi uses character-
developing a corpus is described. The fourth section explains how     substitution,     human-retelling     and      synonym-replacement
to simulate each mentioned type of plagiarism, and finally, dataset   techniques to obfuscate text. Asghari [27] has submitted a
statistics for the constructed corpus are given.                      Persian-English parallel corpus to the PAN 2015. This corpus
                                                                      includes 27115 documents and 11200 plagiarism cases. Cheema’
corpus [23] includes 1000 documents in English and 250                 3. TEXT ALIGNMENT CORPUS
plagiarism cases. In this corpus, in order to obfuscate texts, a
number of students of different academic courses were asked to         CONSTRUCTION
select and rewrite a number of texts related to their fields and put   The goal in text alignment is to identify plagiarized segments for
them inside documents with the same subject such as Wikipedia          each given source and suspicious document pairs [8].
documents. Also A bilingual English-Urdu corpus that includes          In this study, a text alignment corpus is created to evaluate
1000 documents and 270 plagiarism cases sent to the PAN 2015           plagiarism detection systems on Persian scientific documents. The
competitions by Hanif [24]. In this corpus he used machine             conducted procedure to build this corpus is described herein.
translation with and without manual correction of results, with the
use of random-obfuscation strategy in some translation results to      a. Data Source Preparation
obfuscate the text. Khoshnavataher [26] has presented a corpus in      We use some documents of source documents collection in
Persian that includes 2111 documents and 823 plagiarism cases.         Mahtab plagiarism detection system [15] to construct our text
In order to obfuscate, he used Random obfuscation technique and        alignment corpus. Mahtab plagiarism detector is developed at the
no-obfuscation technique where a piece of the source document is       Shahid Beheshti University NLP Lab. The goal of Mahtab is
added to suspicious document without any change. Kong [25] also        detecting plagiarized articles in the fields of computer science and
took part in the PAN 2015competition with 160 documents in             engineering. Our text alignment corpus in this study contains
Chinese and 152 cases of plagiarism. In order to obfuscate text,       11,089 documents. They are all articles or theses in the fields of
Kong asked a number of volunteers to write a paper for topics that     computer science and engineering and also electrical engineering
have been identified. Mohtaj’s corpus [28] also was submitted to       with the following distribution:
PAN 2015 with 4261 documents in English and 2781 plagiarism
                                                                               4,500 documents from Wikipedia articles;
cases. In this corpus, techniques of no-obfuscation, random-
obfuscation and simulated-obfuscation is used to obfuscate text.               1,500 documents from CSICC1 articles (2004-2015);
Palkovskii [29] also makes use of PAN 2013-2014 corpus to
prepare a corpus that included 5057 documents in English and                   1,500 documents from articles and theses available from
4185 plagiarism cases. Obfuscation was made based on                           online stores;
techniques of random-obfuscation, no-obfuscation, cyclic-                      3,589 documents from free Persian resources including mag-
translation-obfuscation and summary-obfuscation. In the rest of                iran2, iran-doc3, SID4, prozhe5, and MatlabSite6.
this paper we will describe the construction method we employed
to develop a text alignment corpus to evaluate Persian plagiarism      b. Documents Clustering
detection systems.                                                     Since all documents in the corpus are in the field of computer
                                                                       science, there is a general similarity among them. The method
                                                                       proposed for document clustering is to estimate cluster features
                                                                       first, and then perform clustering based on the introduced features.
                                                                       Finally, an optimization process improves the results. To extract
                                                                       features, all words included in a document are extracted and
                                                                       stemmed using STeP-1 [4]. Each word is then labeled based on
                                                                       Table 1 which is introduced by Makrehchi [5]. For each
                                                                       document, an n-bit histogram vector is produced named V( , ,
                                                                       …, ) where n is the number of features. If              existed in a
                                                                       document, = 1; otherwise,         = 0. Afterwards, these vectors are
                                                                       classified based on the K-means algorithm and Cosine similarity.
                                                                       To optimize the extracted features in a cluster, the sum of all
                                                                       vectors of a cluster is found and used to produce H ( , , …,
                                                                          )), where h1 indicates the number of documents containing the
                                                                       first feature. H is produced for all clusters; Equation (1) can be
                                                                       used to calculate the weight of each feature in the corresponding
                                                                       cluster.


                                                                       fc indicates the number of clusters containing this feature. The
                                                                       features are sorted in a descending order based on their weights.
                                                                       Afterwards, the first 100 words of each cluster are considered as
                                                                       the features for that cluster. To improve clustering accuracy, the

                                                                       1
                                                                           Computer Society of Iran Computer Conference, http://csi.org.ir
                                                                       2
                                                                           http://mag-iran.com
                                                                       3
                                                                           http://www.irandoc.ac.ir
                                                                       4
                                                                           http://sid.ir
                                                                       5
                                                                           http://www.prozhe.com
                                                                       6
              Fig.1. A taxonomy of plagiarism [2]                          www.MatlabSite.com
membership degree to each cluster must be calculated, and                        e. Source Set for a No-plagiarism Document
documents must be placed in the most similar cluster. The                        For each no-plagiarism document, a source set is selected as
membership degree for each document is calculated as follows:                    described in step d. However, in this step a similarity detection at
                                                                                 the sentence level for each randomly selected source and
                                              √                                  suspicious document is considered based on the Jaccard similarity
                                                                                 measure and a threshold of 0.9; if there are no same sentences
In which        is the number of all seen cluster features (the first            between both mentioned files, the source document is added to
100 words of each cluster based on their weights are considered as               Dsrc. Using this method, 2,630 pairs of documents are produced in
cluster features) in the corresponding document,              is the             this phase.
number of cluster features occurring in the document, and          is
the document length.                                                             f. Source Documents Segmentation
                                                                                 In this step, first a document is divided into its paragraphs. Each
                                                                                 subsequence of paragraphs that contain at least 300 words is
                                                                                 considered a segment. If a paragraph contain less than 300 words
                           Table 1. Three categories of words in a corpus [5]    it is combined with the next paragraph. Ultimately, all segments
                                                                                 contain at least 300 words.
                                                  document frequency
                                                                                 g. Determine the Length of Plagiarized
frequency of the term in


                                   Low        Medium           High                 Segments in each Suspicious Document
                                                                                 The number of plagiarized segments which are employed in a
      the corpus


                                Key word       Feature      Stop Word    High    suspicious document depends on the source document length and
                                                                                 the length of any plagiarized segments. To decide the number of
                                Key word       Feature      Stop Word   Medium   segments to use from a source document in its paired suspicious
                                                                                 one, all paired documents are first labeled. Each randomly
                                Stop Word    Stop Word      Stop Word     Low
                                                                                 selected pair of documents is labeled as “entirely,” “much,”
                                                                                 “medium,” or “hardly” as described below.
                                                                                  Entirely: The length of the source document is more than 80%
                                                                                 of the length of the suspicious document.
                                                                                  Much: The length of the source document is more than 50%-
c. Suspicious Documents Selection                                                80% of the length of the suspicious document.
Some documents are randomly selected from each cluster as
suspicious documents. Almost half of the documents are                            Medium: The length of the source document is about 20%-
                                                                                 50% of the length of the suspicious document.
employed as source documents and the other half as suspicious
documents. Half of the suspicious documents are considered as                     Hardly: The length of the source document is less than 20% of
no-plagiarism documents, and the other half of the documents are                 the length of the suspicious document.
used to produce plagiarized documents.                                           If the number of paired documents with the same label is more
                                                                                 than one-fourth of the number of paired documents with a label of
d. Source Set for a plagiarism Document                                          smaller length that do not have enough paired documents, the
For each plagiarism document in a cluster, a set of source                       label with the lower length is assigned; thus, a uniform
documents named Dsrc is selected in which there is no repeated                   distribution is obtained.
document or very similar document to suspicious document, a
source document can be used in many suspicious documents so                      h. Segment Extraction
every time a suspicious document can select each source                          From each source document, some segments are randomly
document from the corresponding cluster therefore the selected                   selected. The number of selected segments is based on the
documents may be selected by this suspicious document before.                    classification defrofrep in step g.
Moreover if the similarity between source document and
suspicious document is more than 50% before adding plagiarism
                                                                                 i. Segment Obfuscation
                                                                                 This study offers a strategy to manually, semi-automatically, or
passages to suspicious document, then it is not a good selection
                                                                                 automatically produce each type of plagiarism mentioned in
because even if a hard strategy is used to obfuscate, plagiarism
                                                                                 Alzahrani’s taxonomy of plagiarism. In this step, each segment is
passages may be discovered by simple similarity detection
                                                                                 obfuscated based on one strategy and add to one suspicious
algorithms. To create Dsrc for each plagiarism document, a
                                                                                 document. It is noteworthy that all obfuscated segments included
document from the corresponding cluster is selected randomly; if
                                                                                 in a document must be obfuscated using the same strategy because
the similarity based on the SimHash method [10] between the
                                                                                 according to PAN corpus format, there is no overlap between
selected document and each document in Dsrc is more than 50%,
                                                                                 suspicious documents in different strategies[9] and only one type
the document is considered repeated; otherwise, it is included in
                                                                                 of plagiarism should be employed in each suspicious document.
Dsrc. This step is continued until there are at least 3 documents in
Dsrc. A Dsrc contains a suspicious document and at least 2 source                j. Obfuscated Segment Insertion
documents. The reason for employing the SimHash method is the
                                                                                 In this step each obfuscated segment is inserted into a suspicious
noticeable results achieved in [11]. In this phase, the source and
                                                                                 document in a randomly chosen space.
suspicious document pairs are specified. In this way, 3,867 paired
documents (source- suspicious) are produced to be included in the
corpus.
4. STRATEGIES FOR PLAGIARISMS                                               Automatic Translation
TYPES                                                                   According to types of plagiarism in Fig .1 translation is a type of
                                                                        plagiarism that is divided into automatic and manual translation.
                                                                        Hanif et al. [24] use the automatic translation strategy to
    Exact Copy                                                         obfuscate documents in their corpus. Moreover in the PAN 2013-
In this strategy, the segments produced in step h were inserted into    2014 corpus [9, 18] use cyclic translation strategy.
a suspicious document with no obfuscation. Using this strategy,
                                                                        We use all of above three strategies in our corpus (described in
324 paired documents were produced.
                                                                        Automatic Translation, Manual Translation and Cyclic
    Near Copy                                                          Translation stages). For the automatic translation strategy, the
                                                                        selected sections are translated from Persian to English by Google
According to Fig .1 a type of plagiarism is Near Copy [2] that
consists insertion, deletion, substitution and sentence split or join   translate and the results are checked by Hunspell. Then they are
                                                                        added to the English suspicious documents. 306 paired documents
methods. To create this kind of plagiarism, the segments produced
                                                                        are produced using this strategy.
in step h are obfuscated through deletion, insertion, sentence
replacement, and sentence division. With this method, some
randomly selected sentences are deleted from the segment and
                                                                            Manual Translation
                                                                        The suspicious documents in this step are English articles in the
replaced with randomly selected sentences from the suspicious
                                                                        field of computer engineering, and the source documents are
document. Then, some randomly selected sentences are swapped.
                                                                        Persian articles in the same field. The English articles are
Finally, complex sentences are identified and broken into main
                                                                        clustered as described in step b, and an equivalent English cluster
simple sentences. To do this, the complex sentence identifier
developed at the Shahid Beheshti University NLP Lab is                  is produced for each Persian one. Then, for each suspicious
                                                                        document, a source document from its equivalent Persian cluster
employed. Each complex sentence in this segment is replaced
                                                                        is randomly selected. Based on what was described in steps f, g
with its main and subordinate clauses, and 457 paired documents
                                                                        and h, some sections of the source document are selected.
are produced based on this strategy.
                                                                        Afterwards, these sections are translated by experts in the fields of
    Modified Copy                                                      computer engineering and are added to the suspicious documents
                                                                        as described in step j. Seven hundred paired documents are
In the taxonomy of plagiarism [2] there is a type of plagiarism
                                                                        produced using this strategy which can be employed to evaluate
called Modified Copy that to obfuscate a text using this strategy,
                                                                        cross-language similarity detection systems (Persian-English).
the Persian sentence understanding and generation system
introduced by Adelkhah et al. [7] is employed. This system
performs a bidirectional conversion between Persian sentences
                                                                            Cyclic Translation
                                                                        With the cyclic translation strategy the selected sections are
and their semantic representation. It changes each sentence to its
                                                                        translated from English to Persian using Google translate, and the
semantic representation and then generates the Persian sentence
                                                                        results are checked by Negar, a Persian spell checker developed at
using semantic representation. To clarify, this system is composed
                                                                        the NLP Lab of Shahid Beheshti University. The selected sections
of 2 sub-systems: 1) semantic representation production for
                                                                        are then translated again from Persian to English. Finally, the
sentences (sentence understanding), and 2) sentence production
                                                                        results are checked by Hunspell and add to the English suspicious
based on semantic representation (sentence generation). It is
                                                                        documents. Using this method, 388 paired documents are created.
noteworthy that in the sentence production phase, in addition to
structural changes, there might be samples of chunk relocations in
a sentence or samples of word relocations in a chunk. The aim of
                                                                            Idea Adoption (semantic-based meaning)
                                                                        The goal in this step is to represent the main idea of a source
this system is to produce sentences with the same meaning (deep
                                                                        document using new words/wording. Since most source
structure) but different surface structures and words. Using this
                                                                        documents are computer related theses and articles, automatic
strategy, 465 paired documents are created.
                                                                        idea extraction would be a complex task here for which no high
    Text Manipulation (Paraphrasing)                                   accurate system is yet available. Hence, the researchers asked
                                                                        computer experts to rewrite the idea of each document in their
Text Manipulation was performed as described earlier in Modified
                                                                        own words. To simplify the task, only important sections of
Copy. The difference here is the word replacement in the sentence
                                                                        source documents, such as the abstract, were considered. Source
generation phase. Each word is replaced with a synonym retrieved
                                                                        documents were distributed among three computer PhD
from FarsNet (Persian WordNet) [14] or FavaNet (WordNet of
                                                                        candidates and 30 computer MS students, and 109 paired
Computer domain) [13]. Hence, in addition to structure
                                                                        documents were produced.
modification, different words are included in the sentence
compared to the main sentence, although the concept remains the         5. DATASET STATISTICS
same. Chunks may be moved inside a sentence; however, there is
                                                                        Overall, employing all the mentioned strategies, 11,603
no movement for words in a chunk. Using this method, 604 paired
                                                                        plagiarism cases and 6,497 paired documents are produced, from
documents are produced.
                                                                        which 2,650 are no-plagiarism, 780 are no obfuscation, and 3,067
                                                                        are obfuscated ones. The dataset statistics are shown in Table 2.
    Text Manipulation (Summarizing)
The goal in this step is to obfuscate a text document using             6. CONCLUSION
summarization methods. To create such queries, the automatic            This article describes a methodology for building a Persian corpus
Persian summarizer introduced by Shafiee et al. [6] is used, and        for evaluating plagiarism detection systems. This large-scale
506 paired documents are produced.                                      corpus is in PAN format. To produce this corpus, the focus is on
                                                                        the simulation of different types of plagiarism. Different strategies
are employed to create obfuscation in each plagiarism category;           MAN, AND CYBERNETICS—PART C: APPLICATIONS
hence, a variety of plagiarism types in large volume are created.         AND REVIEWS, vol. 42, no. 2.
                                                                     [3] Potthast, M., Stein, B. and et.al. 2010. An Evaluation
            Table 2. Dataset statistics for our corpus                    Framework for Plagiarism Detection. Proceedings of the
      documents                                     11089                 23rd International Conference on Computational Linguistics,
                                                                          COLING 2010 Beijing,_c ACL.
      plagiarism cases                               11603
                                                                     [4] Shamsfard, M., and Kiani, S., and Shahedi, Y. STeP-1:
      Document purpose                                                    Standard Text Preparation for Persian Language. CAASL3
      languages                                       fa                  Third Workshop on Computational Approaches to Arabic
                                                                          Script- Languages.
      source documents                               48%
                                                                     [5] Makrehchi, M. and Kamel, M. 2004. A fuzzy set approach
      suspicious documents                                                to extracting keywords from abstracts. North American Fuzzy
      with plagiarism                                28%                  Information Processing Society- NAFIPS 2003, Banf,
                                                                          Canada.
      w/o plagiarism                                 24%             [6] Shafiee, F. and Shamsfard, M. 2015. The automatic Persian
      Document length                                                     summarizer. The 20st Computer Society of Iran Computer
      short (<10 pages7)                             64%                  Conference.
                                                                     [7] Adelkhah, R., Sadeghi, R. and Shamsfard, M. 2016. Persian
      medium (10-100 pages)                          35%                  sentence understanding and generation: a mutual conversion.
      long (>100 pages)                               1%                  The 21st Computer Society of Iran Computer Conference.
      Plagiarism per document
                                                                     [8] Potthast, M., Göring, S. and et.al. 2015. Towards Data
                                                                          Submissions for Shared Tasks: First Experiences for the Task
      hardly (<20%)                                  25%                  of Text Alignment. Working Notes Papers of the CLEF 2015
      medium (20%-50%)                               20%                  Evaluation Labs, CEUR Workshop Proceedings, (September
                                                                          2015), ISSN 1613-0073.
      much (50%-80%)                                 26%
                                                                     [9] Potthast, M., Hagen, M., Gollub, T. and et.al. 2013.
      entirely (>80%)                                29%                  Overview of the 5th International Competition on Plagiarism
      Case length                                                         Detection”, Working Notes Papers of the CLEF
                                                                          2013Evaluation Labs and Workshop, (September 2013),
      short (<1k characters)                         37%                  ISBN 978-88-904810-3-1.
      medium (1k-3k characters)                      55%             [10] Manku, G. S., Jain, A. and Sarma, A. D. 2007. Detecting
      long (>3k characters)                           8%                  NearDuplicates for Web Crawling. Data mining.
                                                                     [11] Kamran, K., Ahmadi, A. and Kazemivanhari, F. 2013.
                  Obfuscation synthesis approaches                        Plagiarism detection in Persian text using Fingerprint
      Exact Copy                                      8%                  algorithms. The 21st Iranian Conference on Electrical
                                                                          Engineering.
      Near Copy                                       12%
                                                                     [12] Davarpanah, M. R., sanji, M. and Aramideh, M. 2009. Farsi
      Modified Copy                                   12%                 Lexical Analysis and StopWord List. Library Hi Tech, vol.
      Paraphrasing                                    16%                 27, pp 435–449.
                                                                     [13] Iran Telecommunication Research Center (ITRC), 2013.
      Summary                                         13%
                                                                          Buali Sina University. http://217.218.62.234:8080/.
      Manual Translation                              18%            [14] Shamsfard, M., Hesabi, A., Fadaei H. and et.al 2010. Semi
      Automatic Translation                           8%                  Automatic Development of FarsNet; The Persian WordNet.
                                                                          Proceedings of 5th Global WordNet Conference.
      Cyclic Translation                              10%
                                                                     [15] Mashhadirajab, F. and Shamsfard, M. 2014. Plagiarism
      semantic-based meaning                          3%                  Detection in Persian documents. Master's thesis. Shahid
                                                                          Beheshti University.
                                                                     [16] Potthast, M., Eiselt, A and et.al. 2011. Overview of the 3rd
                                                                          International Competition on Plagiarism Detection. Notebook
7. REFERENCES                                                             Papers of CLEF 2011 Labs and Workshops, (September
[1] Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Rosso, P., and        2011), ISBN 978-88-904810-1-7.
    Potthast, M., 2016. Algorithms and Corpora for Persian
                                                                     [17] Potthast, M., Gollub, T. and et.al. 2012. Overview of the 4th
    Plagiarism Detection: Overview of PAN at FIRE 2016. In
                                                                          International Competition on Plagiarism Detection. CLEF
    Working notes of FIRE 2016 - Forum for Information
                                                                          2012 Evaluation Labs and Workshop – Working Notes
    Retrieval Evaluation, Kolkata, India, December 7-10, 2016,
                                                                          Papers, (September 2012), ISBN 978-88-904810-3-1.
    CEUR Workshop Proceedings, CEUR-WS.org.
                                                                     [18] Potthast, M., Hagen, M. and et.al. 2014. Overview of the 6th
[2] Alzahrani, M., Salim, N. and Abraham, A.             2012.
                                                                          International Competition on Plagiarism Detection. CLEF
    Understanding plagiarism linguistic patterns, Textual
                                                                          2014 Evaluation Labs and Workshop – Working Notes
    features, and detection Methods. IEEE Trans. SYSTEMS,
                                                                          Papers, (September 2014).
                                                                     [19] Joy, M. S., Sinclair, J. E. and et.al. 2013. Student
                                                                          perspectives on source-code plagiarism. International
7
    A page is measured as 1500 chars.                                     Journal for Educational Integrity, Vol. 9, No. 1, pp. 3–19.
[20] Joy, M. S., Cosma, G. and et.al. 2009. A TAXONOMY OF             [26] Khoshnavataher, K., Zarrabi, V., Mohtaj, S. and Asghari, H.
     PLAGIARISM IN COMPUTER SCIENCE. Proceedings of                        2015. Developing monolingual Persian corpus for extrinsic
     EDULEARN09 Conference, (July 2009), ISBN: 978-84-612-                 plagiarism detection using artificial obfuscation. Notebook
     9802-0.                                                               for PAN at CLEF.
[21] Naik, R. R., Landge, M. B., Mahender, C. N. and et.al 2015.      [27] Asghari, H., Khoshnavataher, K., Fatemi, O. and Faili, H.
     A Review on Plagiarism Detection Tools. International                 2015. Developing bilingual plagiarism detection corpus
     Journal of Computer Applications, vol. 125 – No.11.                   using sentence aligned parallel corpus. Notebook for PAN at
[22] Alvi, F., Stevenson, M., Clough, P. and et.al 2015. The short         CLEF.
     stories corpus. Notebook for PAN at CLEF.                        [28] Mohtaj, S., Asghari, H. and Zarrabi, V. 2015. Developing
[23] Cheema, W., Najib, F., Ahmed, S. and et.al 2015. A corpus             monolingual english corpus for plagiarism detection using
     for analyzing text reuse by people of different groups.               human annotated paraphrase corpus. Notebook for PAN at
     Notebook for PAN at CLEF.                                             CLEF.
[24] Hanif, I., Nawab, A., Arbab, A. and et.al 2015. Cross-           [29] Palkovskii, Y. and Belov, A. 2015. Submission to the 7th
     language urdu-english (clue) text alignment corpus.                   international competition on plagiarism detection.
     Notebook for PAN at CLEF.                                             http://www.uni-weimar.de/medien/webis/events/pan-15,
[25] Kong, L., Lu, Z., Han, Y. and et.al 2015. Source retrieval and        http://www.clef-initiative.eu/publication/working-notes,
     text alignment corpus construction for plagiarism detection.          From the Zhytomyr State University and SkyLine LLC.
     Notebook for PAN at CLEF.