Mahak Samim: A Corpus of Persian Academic Texts for
           Evaluating Plagiarism Detection Systems
                         Morteza Rezaei Sharifabadi                                 Seyed Ahmad Eftekhari
              Computer Research Center of Islamic Sciences               Computer Research Center of Islamic Sciences
                           Tehran, I. R. Iran.                                        Tehran, I. R. Iran.
                        m.rezaei@noornet.net                                       s.ahmad.ef@gmail.com


ABSTRACT                                                             corpora were books from Project Gutenberg. The corpora were
In this paper we introduce Mahak Samim, a plagiarism detection       used to evaluate both external and intrinsic plagiarism detection.
corpus that consists of Persian academic texts in which plagiarism   In external plagiarism detection suspicious documents are
cases are embedded. This corpus, which can be used for               checked against a collection of source documents, but in intrinsic
evaluating plagiarism detection systems, consists of more than       plagiarism suspicious documents are analyzed in isolation for
five thousand artificial plagiarism cases with various lengths and   changes in writing style etc. Fifty percent of the documents were
diverse degrees of obfuscation. The development process and the      used as source documents and fifty percent as suspicious
features of the corpus are described here.                           documents. The corpora contain plagiarism cases with different
                                                                     lengths and various degrees of artificial and simulated
CCS Concepts                                                         obfuscation. Artificial obfuscation includes techniques such as
• Information systems ➝ Information retrieval ➝ Retrieval tasks      automatically shuffling and replacing words and simulated
                                                                     obfuscation was achieved through crowdsourcing the obfuscation
and goals ➝ Near-duplicate and plagiarism detection.
                                                                     task. The major shortcoming of corpora presented in these years
Keywords                                                             was their relatively small size. The plagiarism detectors were
plagiarism detection; evaluation corpus; Persian; academic texts.    expected to include a stage of heuristic retrieval in which they
                                                                     selected a group of candidate documents among the total
                                                                     collection of source documents. However, since the size of the
                                                                     corpora were not large enough, the systems skipped this stage. In
1. INTRODUCTION                                                      PAN 2012 [7] this issue is addressed and a new approach is
Plagiarism is defined as “copying or closely imitating the work of   adopted for developing the plagiarism corpus. For this purpose a
another writer, composer, etc., without permission and with the      number of professional writers were asked to write articles –
intention of passing the results off as original work” [12].         containing plagiarism - on a set of topics. A one billion document
Plagiarism detectors are software programs developed to detect       corpus resembling the web was used as the collection of source
cases of such misconduct in documents. The PAN evaluation lab        documents. The writers compiled their articles by searching
series has provided a framework for evaluating plagiarism            through this huge collection. In PAN 2013 [9] and PAN 2014 [8]
detection systems. This framework relies on plagiarism corpora       expanded versions of the 2012 corpus were used.
which are basically collections of text that include cases of
plagiarism. Plagiarism detection systems receive the corpus texts    In PAN 2015 [10] a task of corpus construction was introduced. In
as input and their ability to detect the plagiarism cases embedded   this task, participants were asked to provide their own plagiarism
in the texts are examined.                                           corpora. Eight plagiarism corpora were provided for this task
                                                                     among which two included Persian documents. Khoshnavataher
Since plagiarism detectors are not entirely language independent,    et.al. [3] present a monolingual Persian corpus based on about
there is a need for plagiarism corpora in various languages. In      2100 Wikipedia articles with plagiarism cases obfuscated
recent years a couple of Persian plagiarism detection systems have   artificially and intended for evaluation of extrinsic plagiarism
been developed. Proper evaluation of these systems is dependent      detection. Asghari et.al. [1] use Wikipedia documents and a
on reliable Persian plagiarism corpora.                              Persian-English sentence-aligned corpus to develop a bilingual
                                                                     plagiarism detection corpus.
In this paper we introduce Mahak Samim1, a corpus suitable for
evaluating Persian plagiarism detectors. We first briefly review     3. CORPUS DEVELOPMENT
previous works in this field and then we introduce our own
approach. The paper concludes with a summary and an outlook          3.1 Document Collection
for further work.                                                    Academic papers are one of the major types of texts subject to
                                                                     plagiarism. In order to cover such texts in our corpus, we
2. RELATED WORK                                                      collected Persian papers from peer reviewed journals. We crawled
Prior to PAN evaluation lab series, plagiarism corpora were rare.    the websites of journals introduced in the System for Evaluation
The corpora used in PAN labs held in 2009 [4], 2010 [5] and 2011     of Scientific Journals2 (affiliated to Iran’s Ministry of Science,
[6] had basically the same structure. The documents used in these    Research and Technology) and we downloaded papers from
                                                                     journal websites that provide free full-text access to their articles
                                                                     in plain-text format. Table1 shows the statistics of the number of
1
    Samim-Noor is a commercial plagiarism detection system
    developed by the Computer Research Center of Islamic
                                                                     2
    Sciences. Mahak in Persian means “Touchstone”.                       http://journals.msrt.ir/
documents in each subject, as grouped by the System for
Evaluation of Scientific Journals, and Table 2 provides                          Table 4. Statistics of lengths of plagiarism cases
information about the document lengths.
                                                                             Plagiarism Case Length                   Percent of Cases
                                                                       Short (50-150 words)                        34 %
    Table 1. Statistics of number of documents per subject             Medium (300-500 words)                      33 %
                Subject                    Number of documents         Long (3000-5000 words)                      33 %
Humanities                                 2697
Science                                    1204
                                                                       3.5 Topic match
Veterinary Science                         469                         The six general topic categories of the papers used in our corpus
Agriculture and Natural Resources          281                         were introduced in table 1. Fifty percent of the plagiarism cases
Engineering                                38                          were made between papers with same topics (intra-topic cases)
                                                                       and fifty percent between papers with different topics (inter-topic
Art and Architecture                       18
                                                                       cases).
                 Total                            4707
                                                                       3.6 Obfuscation types
                                                                       In many cases, plagiarized texts are manipulated by those
             Table 2. Statistics of document lengths                   committing plagiarism in order to avoid being detected by
         Document Length                   Percent of Documents        plagiarism detection systems or human readers. Plagiarism
                                                                       corpora developers use different techniques to include such
short (1-3000 words)                       20 %
                                                                       obfuscations in their plagiarism cases. An overview of different
medium (3000-6000 words)                   50 %                        types of obfuscation in our plagiarism cases is available in Table
long (6000-30000 words)                    30 %                        5.

                                                                                     Table 5. Statistics of Obfuscation types
3.2 Source / suspicious documents                                                  Obfuscation                        Percent of Cases
In plagiarism corpora, the documents collection is usually split
into two main subgroups i.e. source documents and suspicious           None                                        40 %
documents. Source documents are documents from which parts of          Random Text Operations
text are selected as plagiarism cases. These parts are then inserted     > low obfuscation                         20 %
inside the text of so-called suspicious documents. In other words,       > high obfuscation                        20 %
suspicious documents are documents which include text used in          Semantic Word Variation
source documents. We follow PANs tradition of using half of the          > low obfuscation                         10 %
documents as source documents and half as suspicious                     > high obfuscation                        10 %
documents. It is noteworthy that the subjects of the papers were
taken into consideration while dividing the collection into halves.    As shown in table 4, 40 percent of the plagiarism cases have no
i.e. 50 percent of the papers in humanities were used as source        obfuscation. As explained in [11], since the writing style of the
documents and 50 percent as suspicious documents etc.                  original author is preserved in plagiarism cases without
                                                                       obfuscation, these cases are especially appropriate for evaluating
3.3 Plagiarism per document                                            intrinsic plagiarism detection. Random text operations are
50 percent of the suspicious documents have no plagiarism cases.       operations such as adding, deleting and substituting words, which
As mentioned in [11], the documents without plagiarism allow to        are all done randomly. Semantic word variation, on the other
determine whether or not a detector can distinguish plagiarism         hand, is the random substitution of words with their synonyms.
cases from overlaps that occur naturally between random                We use the Comprehensive Dictionary of Persian Synonyms and
documents. Statistics of plagiarism per document in the rest of the    Antonyms3 as a resource for extracting synonyms. The terms “low
suspicious documents, i.e. 25 percent of the whole corpus, is          obfuscation” and “high obfuscation” mentioned in table 4 show
available in Table 3.                                                  the degree of obfuscation i.e. how many words have been added,
                                                                       deleted or substituted etc.
 Table 3. Statistics of plagiarism per document in documents
                         with plagiarism
       Plagiarism Per Document             Percent of Documents        4. SUMMARY AND FUTURE WORK
                                                                       As explained above, Mahak Samim is a plagiarism corpus which
hardly (5%-20%)                            30 %
                                                                       can be used for evaluating both intrinsic and external plagiarism
medium (20%-50%)                           25 %                        detection systems. In order to preserve overall balance, many
much (50%-80%)                             30 %                        factors – plagiarism per document, plagiarism case length, topic
entirely (>80%)                            15 %                        match, obfuscation type, and obfuscation degree – were taken into
                                                                       consideration while preparing each plagiarism case. The corpus
                                                                       files are prepared according to the format of previous PAN
3.4 Plagiarism case length
Our corpus consists of a total of 5862 plagiarism cases with
                                                                       3
lengths between 50 and 5000 words. Table 4 shows the statistics.           The plain-text version of this dictionary can be downloaded from
Long plagiarism cases may include more than one sentence.                  this link: http://dadegan.ir/catalog/D3911124a
corpora which include xml files that have information about the        [8] Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann,
starting point of the plagiarism in relevant source and suspicious         Rosso, P. and Stein, B., 2014. Overview of the 6th
documents and the length of the plagiarism case.                           international competition on plagiarism detection. Working
                                                                           Notes Papers of the CLEF 2014 Evaluation Labs, CEUR
Plagiarism cases in our corpus are cases of “artificial plagiarism”.       Workshop Proceedings.
Using “real plagiarism” cases in plagiarism corpora is problematic
due to ethical, legal, and financial issues [11]. However, we may      [9] Potthast, M., Hagen, M., Gollub, T., Tippmann, M., Kiesel,
enrich our corpus by adding cases of simulated plagiarism. Other           J., Rosso, P., Stamatatos, E. and Stein, B., 2013. Overview of
types of artificial obfuscation, such as POS-preserving word               the 5th international competition on plagiarism detection.
shuffling could also be employed. The corpus may be easily                 In CLEF Conference on Multilingual and Multimodal
expanded with both academic papers and other types of                      Information Access Evaluation (pp. 301-331). CELCT.
documents such as books, web articles, etc.                            [10] Potthast, M., Hagen, M., Göring, S., Rosso, P. and Stein, B.,
This paper has been submitted to The PAN@FIRE2016 Shared                    2015. Towards data submissions for shared tasks: first
Task on Persian Plagiarism Detection and Text Alignment Corpus              experiences for the task of text alignment. Working Notes
Construction [2] and the corpus is available through Peykaregan4            Papers of the CLEF 2015, pp.1613-0073.
website.                                                               [11] Potthast, M., Stein, B., Barrón-Cedeño, A. and Rosso, P.,
                                                                            2010. An evaluation framework for plagiarism detection.
5. ACKNOWLEDGMENTS                                                          In Proceedings of the 23rd international conference on
Special thanks to Dr. Martin Potthast for his valuable help and to          computational     linguistics:   Posters (pp.  997-1005).
Dr. Mahdi Behnia and Mr. Amirhossein Rajabzadeh Assarha, our                Association for Computational Linguistics.
colleagues in the Computer Research Center of Islamic Sciences,
                                                                       [12] Reitz, J.M., 1996. ODLIS: Online dictionary for library and
for their support and their comments.
                                                                            information science. Libraries Unlimited.
6. REFERENCES
[1] Asghari, H., Khoshnava, K., Fatemi, O. and Faili, H., 2015.
    Developing Bilingual Plagiarism Detection Corpus Using
    Sentence Aligned Parallel Corpus. In Working Notes Papers
    of the CLEF 2015.
[2] Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Rosso, P., and
    Potthast, M., 2016. Algorithms and Corpora for Persian
    Plagiarism Detection: Overview of PAN at FIRE 2016. In
    Working notes of FIRE 2016 - Forum for Information
    Retrieval Evaluation, Kolkata, India, December 7-10, 2016,
    CEUR Workshop Proceedings, CEUR-WS.org.
[3] Khoshnavataher, K., Zarrabi, V., Mohtaj, S. and Asghari, H.,
    2015. Developing Monolingual Persian Corpus for Extrinsic
    Plagiarism Detection Using Artificial Obfuscation. In
    Working Notes Papers of the CLEF 2015.
[4] Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B. and
    Rosso, P., 2009. Overview of the 1st international
    competition on plagiarism detection. In 3rd PAN Workshop.
    Uncovering Plagiarism, Authorship and Social Software
    Misuse.
[5] Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso,
    P., 2010. Overview of the 2nd international competition on
    plagiarism detection. In Notebook Papers of CLEF 2010
    LABs and Workshops.
[6] Potthast, M., Eiselt, A., Barrón Cedeño, L.A., Stein, B. and
    Rosso, P., 2011. Overview of the 3rd international
    competition on plagiarism detection. InCEUR Workshop
    Proceedings.
[7] Potthast, M., Gollub, T., Hagen, M., Kiesel, J., Michel, M.,
    Oberländer, A., Tippmann, M., Barrón-Cedeño, A., Gupta,
    P., Rosso, P., Stein, B., 2012. Overview of the 4th
    international competition on plagiarism detection. In
    Working Notes Papers of CLEF 2012 Evaluation Labs and
    Workshop.


4
    www.peykaregan.com