=Paper= {{Paper |id=Vol-1737/T4-1 |storemode=property |title=Algorithms and Corpora for Persian Plagiarism Detection: Overview of PAN at FIRE 2016 |pdfUrl=https://ceur-ws.org/Vol-1737/T4-1.pdf |volume=Vol-1737 |authors=Habibollah Asghari,Salar Mohtaj,Omid Fatemi,Heshaam Faili,Paolo Rosso,Martin Potthast |dblpUrl=https://dblp.org/rec/conf/fire/AsghariMFFRP16 }} ==Algorithms and Corpora for Persian Plagiarism Detection: Overview of PAN at FIRE 2016== https://ceur-ws.org/Vol-1737/T4-1.pdf
  Algorithms and Corpora for Persian Plagiarism Detection
                                                Overview of PAN at FIRE 2016


             Habibollah Asghari                                 Salar Mohtaj                                  Omid Fatemi
       School of Electrical and Computer                  ICT Research Institute                     School of Electrical and Computer
                  Engineering                    Academic Center for Education, Culture and                     Engineering
            College of Engineering                         Research (ACECR)                              College of Engineering
             University of Tehran                                  Iran                                    University of Tehran
           habib.asghari@ictrc.ac.ir                     salar.mohtaj@ictrc.ac.ir                            omid@fatemi.net


               Heshaam Faili                                    Paolo Rosso                                  Martin Potthast
School of Electrical and Computer Engineering             PRHLT Research Center                        Bauhaus-Universität Weimar
            College of Engineering                    Universitat Politècnica de València                        Germany
             University of Tehran                                    Spain                            martin.potthast@uni-weimar.de
                 hfaili@ut.ac.ir                             prosso@dsic.upv.es




ABSTRACT                                                                       We overview the detection approaches of nine participating
The task of plagiarism detection is to find passages of text-reuse        teams and evaluate their respective retrieval performance.
in a suspicious document. This task is of increasing relevance,           Participants were asked to submit their software to the TIRA
since scholars around the world take advantage of the fact that           Evaluation-as-a-Service (EaaS) platform [8] instead of just
information about nearly any subject can be found on the World            sending run outputs, rendering the shared task more reproducible.
Wide Web by reusing existing text instead of writing their own.           The submitted pieces of software are maintained in executable
We organized the Persian PlagDet shared task at PAN 2016 in an            form so that they can be re-run against new corpora later on. To
effort to promote the comparative assessment of NLP techniques            demonstrate this possibility, we asked participants to also submit
for plagiarism detection with a special focus on plagiarism that          evaluation corpora of their own design, which were examined
appears in a Persian text corpus. The goal of this shared task is to      using the detection systems submitted by other participants.
bring together researchers and practitioners around the exciting          In what follows, Section 2 reviews related work with respect to
topic of plagiarism detection and text-reuse detection. We report         shared tasks on plagiarism detection. Section 3 describes the main
on the outcome of the shared task, which divides into two                 steps of tasks. Section 4 describes the evaluation framework,
subtasks: text alignment and corpus construction. In the first            explaining the TIRA evaluation platform as well as the
subtask, nine teams participated, whereas the best result achieved        construction of our training and test datasets alongside the
was a PlagDet score of 0.922. For the second subtask of corpus            performance measures used. In Section 5, the evaluation results of
construction, five teams submitted a corpus, which were evaluated         both the text alignment and the corpus construction subtasks are
using the systems submitted for the first subtask. The results show       reported.
that significant challenges remain in evaluating newly constructed
corpora.                                                                  2. RELATED WORK
                                                                          This section reviews recent competitions and shared tasks on
CCS Concepts                                                              plagiarism detection in English, Arabic and Persian.
•General and reference → General conference proceedings.                       PAN. Potthast et al. [16] first pointed out the lack of a
                                                                          controlled evaluation environment and corresponding detection
Keywords                                                                  quality measures to evaluate plagiarism detection systems as a
Plagiarism Detection; Evaluation Framework; TIRA Platform;                major obstacle to evaluating plagiarism detection approaches. To
Shared Task; Persian PlagDet.                                             overcome these shortcomings, they organized the first
                                                                          international competition on plagiarism detection in 2009
1. INTRODUCTION                                                           featuring two subtasks: external plagiarism detection and intrinsic
In recent years, a lot of research has been carried out concerning        plagiarism detection. An important by-product of this competition
text reuse and plagiarism detection for English. But the detection        was the first evaluation framework for plagiarism detection,
of plagiarism in languages other than English has received                which consists of a large-scale plagiarism corpus and a detection
comparably little attention. Although there have been previous            quality measure called as PlagDet [16, 17].
developments on tools and algorithms to assist detecting text
reuse in Persian, little is known about their detection performance.           The PAN competition was continued in the next years,
Therefore, to foster research and development on Persian                  improving the evaluation corpora with each iteration. As of 2012,
plagiarism detection, we have organized the first corresponding           the competition was revamped in the form of two new subtasks:
competition, held in conjunction with the PAN evaluation lab at           source retrieval and text alignment. Moreover, at PAN 2015, for
FIRE 2016.                                                                the first time, participants were invited to submit their own
                                                                          alignment corpora. Here, participants were asked to compile
corpora comprising artificial, simulated, or even real plagiarism,      reproducibility of our shared          task   while    reducing    its
formatted according to the data format established for the              organizational overhead [6, 7]:
previous shared tasks [20].
                                                                                 TIRA provides every participant with a virtual machine
     AraPlagDet. AraPlagDet is the first international                            that allows for the convenient deployment and
competition on detecting plagiarism in Arabic documents. The                      execution of submitted software.
competition was held as a PAN shared task at FIRE 2015 and
included two sub-tasks corresponding to the first shared tasks at                Both Windows and Linux machines are available to
PAN: external plagiarism detection and intrinsic plagiarism                       participants, whereas deployed software need only be
detection [1]. The competition followed the formats used at PAN.                  executable from a POSIX command line.
One of the main motivations of organizers for this shared task was               TIRA offers a convenient web user interface that allows
to raise awareness in the Arab world on the seriousness of                        participants to self-evaluate their software by remote-
plagiarism, and, to promote the development of plagiarism                         controlling its execution.
detection approaches that deal with the peculiarities of the Arabic
language, providing for an evaluation corpus that allows for                     TIRA allows for evaluating submitted software against
proper performance comparison between Arabic plagiarism                           test datasets hosted at server side. Test datasets are
detectors.                                                                        never visible to participants providing for a blind
                                                                                  evaluation, and also allowing for sensitive datasets to be
      PlagDet Task at AAIC. The first competition on Persian                      used for evaluation that cannot otherwise be shared
plagiarism detection was held as the 3rd AmirKabir Artificial                     publicly.
Intelligence Competition (AAIC) in 2015. The competition was
the first to plagiarism detection in the Persian language and led to             At the click of a button, the run output of given software
the release of the first plagiarism detection corpus in Persian [10].             is evaluated against the ground truth of a given dataset.
Like AraPlagDet, the PAN standard framework on evaluation and                     Evaluation results are stored and made accessible on
corpus annotation has been used in this competition.                              TIRA web page as well as for download.

3. TASK DESCRIPTION                                                     TIRA is widely used as an Evaluation-as-a-Service platform for
The shared task of Persian plagiarism detection divides into two        experimenting information retrieval tasks [9]. In particular, the
subtasks: text alignment and corpus construction.                       evaluation platform was used in since the 4th international
                                                                        competition on plagiarism detection at PAN 2012 [18], and now it
Text alignment is based on PAN evaluation framework to assess           is a common platform for all of PAN shared tasks [19].
the detection performance plagiarism detectors: given two
documents, the task is to determine all contiguous passages of          4.2 Evaluation Corpus Construction
reused texts between them. Nine teams participated in this              In this section we describe the methodology for compiling the
subtask.                                                                Persian Plagdet evaluation corpus used for our shared task. The
The corpus construction subtask invited participants to submit          corpus comprises cases of simulated, artificial, and real
evaluation corpora of their own design for text alignment,              plagiarism. In general, there are a number of reasons why
following the standard corpus format. Five corpora were                 collecting only real plagiarism is not sufficient for evaluating
submitted to the competition. Their evaluation consisted of             plagiarism detectors. First, collections of real plagiarism that have
evaluating the validity of annotations via analyzing corpus             been detected manually are usually skewed towards ease of
statistics, such as the length distribution of the documents, the       detection (i.e. the more difficult a plagiarism case is to be
length distribution of the plagiarized passages, and the ratio of       detected, the less likely it will be detected after the fact). Second,
plagiarism per document. Moreover, we report on the                     collecting real plagiarism is expensive and time consuming. Third,
performance of the aforementioned nine plagiarism detectors in          a corpus comprising real plagiarism cases cannot be published due
detecting the plagiarism comprised within the submitted corpora.        to ethical and legal issues [17]. Because of these reasons, methods
                                                                        to artificially create plagiarism, or to simulate plagiarism are often
4. EVALUATION FRAMEWORK                                                 employed to compile plagiarism corpora. These methods aim at
The text alignment subtask consists of identifying the exact            emulating humans who try to obfuscate their plagiarism by
positions of reused text passages in a given pair of suspicious         paraphrasing reused portions of text. An artificial method for
document and source document. This section describes the                compiling plagiarism corpora includes the use of automatic
evaluation platform, corpus, and performance measure that were          paraphrasing technology to obfuscate plagiarized passages.
used in this subtask. Moreover, the submitted detection                 Simulated passages of plagiarized text are created manually using
approaches and their respective evaluation results are presented.       human resources and crowdsourcing. Simulated methods yield
                                                                        more realistic cases of plagiarism compared to artificial ones,
4.1 Evaluation Platform                                                 whereas artificial methods are cheaper in terms of both cost and
Establishing an evaluation framework for Persian plagiarism             time and hence scalable.
detection was one of the primary goals of our competition,
consisting of a large-scale plagiarism detection corpus along with      Simulated cases of plagiarism. To create simulated cases of
performance measures. The framework may serve as a unified test         plagiarism, a crowdsourcing approach has been used. For this
environment for future activities on Persian plagiarism detection       purpose, a dedicated crowdsourcing platform has been developed,
research.                                                               and a paraphrasing task was designed for crowd workers.
      Due to the diverse development environments of participants,      Paraphrased passages obtained via crowdsourcing were reviewed
it is preferable to set up a common platform that satisfies all their   by experts to ensure quality. All told, about 10% of the
requirements. We decided to use the TIRA experimentation                crowdsourced paraphrases were rejected because of poor quality.
platform [8]. TIRA provides for a set of features that facilitate the   Table 1 gives an overview of the demographics of the crowd
                                                                        workers recruited.
            Table 1. Crowd worker demographics.                        4.3 Performance Measures
                  Worker Demographics                                  The PlagDet measure was used to evaluate the submitted
                        25 – 30                              41%       software. PlagDet is a weighted F-measure that combines
 Age                    30 – 40                              38%       character level precision, recall, and granularity into one metric so
                        40 – 58                              21%       that plagiarism detection systems can be ranked [17]. The run
                                                                       output of a given detector lists detected passages of allegedly
                        College                              05%
                                                                       plagiarized text as character offsets and lengths. Detection
                        BSc.                                 25%
 Education                                                             precision and recall are then computed as shown in Equations 1
                        MSc.                                 58%       and 2 below. In these equations, S is the set of the actual
                        PhD                                  12%       plagiarism cases and R is the set of detected plagiarism cases:
                        Average                              19.0
                                                                                                                          |⋃       (       )|
 Tasks per worker       Std. deviation                       14.5                          (        )             ∑                             ( )
                        Minimum                               01                                            | |                    | |
                        Maximum                               54
                        Male                                 74%                                                          |⋃       (       )|
 Gender                                                                                    (        )             ∑                             ( )
                        Female                               26%                                            | |                    | |


Artificial cases of plagiarism. In addition to simulated                                                          {
plagiarism based on manual paraphrasing, a large number of
artificially created plagiarism has been constructed for the corpus.
                                                                       The granularity measure assesses the capability of a detector to
As mentioned above, artificial plagiarism is cheaper and faster to
                                                                       detect a plagiarism case as a whole as opposed to in several
compile than simulated plagiarism. To create artificial plagiarism,
                                                                       pieces. The granularity of a detector is defined as follows:
the previously proposed method of random obfuscation has been
used [16]. The method consists of random text operations (i.e.
                                                                                            (       )                 ∑        |   |            ( )
word addition, deletion, shuffling), semantic word variation, and                                            |    |
POS-preserving word shuffling. A composition of these
operations has been used to create low and high degrees of
random obfuscation.                                                    where S denotes the set of plagiarism cases in the corpus, R
                                                                       denotes the set of detections reported by a plagiarism detector,
As a result, after the obfuscation of passages extracted from a set    S_R ⊆ S the cases detected by detections in R, and R_S ⊆ R
of source documents, the simulated and artificial cases of             detections that detect cases in S. Finally, the PlagDet measure is a
plagiarism were inserted into a selection of suspicious documents.     combination of F1, the equally-weighted harmonic mean of
Some key statistics of the plagiarism cases and the final corpus       precision and recall, and granularity:
are shown in the Tables 2 and 3.
                                                                                                (       )                                       ( )
                                                                                                                      (                (   ))
               Table 2. Plagiarism case statistics.
                   Plagiarism Case Statistics
                        Number of cases                     1628
                                                                       5. SUBTASK 1: TEXT ALIGNMENT
                                                                       This section overviews the submitted software and reports on their
 Obfuscation            None (exact copy)                   11%
                                                                       evaluation results.
                        Artificial                          81%
                        SimulatedLow                        08%
                                                            40%        5.1 Survey of Detection Approaches
                        Short (30 High
                                   - 50 words)              35%        Nine of 12 registered teams successfully submitted a software to
 Case length                                                41%        TIRA for the text alignment task. All of the nine participants
                        Medium (100-200 words)              38%
                        Long (200-300 words)                27%        submitted working notes describing their approaches. In what
                                                                       follows, we survey the approaches.
                                                                             Talebpour et al. [23] use -trie trees to index the source
                   Table 3. Corpus statistics.
                                                                       documents after preprocessing. The preprocessing steps are text
                        Corpus Statistics                              tokenization, POS tagging, text cleansing, text normalization to
 Entire corpus        Number of documents                  5830        transform text characters into a unique and normal form, removal
                      Number of plagiarism cases           4118        of stop words and frequent words, and stemming. Moreover,
 Document             Source documents                     48%         FarsNet (the Persian WordNet) [22] is used to find words’
 purpose              Suspicious documents                 52%         synonyms and synsets. This may allow for detecting cases of
                      Short (1-500 words)                  35%         paraphrased plagiarism based on replacing words with their
 Document                                                              synonyms. After preprocessing both documents, all of the words
 length               Medium (500-2500 words)              59%
                                                                       of a source document and their exact positions are inserted into a -
                      Long (2500-21000 words)              06%         trie. After inserting all source documents into a -trie structure, the
                      Small (5% - 20%)                     57%         suspicious document are iteratively analyzed, checking each word
 Plagiarism per       Medium (21% - 50%)                   15%         one by one against the –trie to find potential sources.
 Document             Much (50% - 80%)                     18%
                      Entirely (>80%)                      10%             Minaei et al. [14] employ n-grams as seed heuristic to find
                                                                       primary matches between suspicious and source documents. Cases
of plagiarism without obfuscation and similar parts of paraphrased     similar sentences that are close to each other are merged while
text can be found this way. In order to detect cases of plagiarized    passages that either overlap or are too short are removed.
passages, matches closer than a specified threshold are merged.
Finally, to decrease false positive cases, detected cases shorter      5.2 Evaluation Results
than a pre-defined threshold are eliminated.                           Table 4 shows the overall performance and runtimes of the nine
                                                                       submitted text alignment approaches. As can be seen, the
     Momtaz et al. [15] use sentence boundaries to split source        approach of Mashhadirajab [12] has achieved the highest PlagDet
and suspicious documents. After text normalization and removal         score on the complete corpus and is hence ranked highest.
of stop words and punctuations, sentences of both documents are        Regarding runtime, the submission of Gharavi [4] and Minaei [14]
turned into graphs, where words represent nodes and an edge is         are outstanding: they process the entire corpus in only 1:03 and
established between each word and its four surrounding words.          1:33 minutes, respectively. Table 5 shows the performance of the
Such graphs obtained from suspicious and source documents are          submitted software dependent on obfuscation types in the corpus.
compared and their similarity computed, whereas sentences of           Although, due to the lack of true positives, no performance values
high similarity are labeled as plagiarism. Finally, to improve         can be computed for the sub-corpus without plagiarism, at least
granularity, sentences close to each other are merged to create        false positive detections for this sub-corpus influence the overall
contiguous cases of detected plagiarism.                               performance of participants on the whole corpus [18]. Gharavi [4]
     Gillam et al. [5] use an approach based on their previous         is ranked first in detection performance with highest PlagDet for
PAN efforts. The task of finding textual matching is undertaken        “No obfuscation,” and Mashhadirajab [12] achieves best
without direct using of the textual content. The proposed approach     performance for both “Artificial” and “Simulated” plagiarism.
produces a minimal representation of text by distinguishing            Among all participants, Mashhadirajab achieves best recall across
content and auxiliary words. Moreover it produces matchable            all parts of the corpus, whereas Talebpour [23] and Gharavi [4]
binary patterns directly from these dependent words on the             outperform it in precision.
number of classes of interest. Although the approach act similar to
hashing functions, but no effort is taken to prevent collision.        6. SUBTASK 2: CORPUS CONSTRUCTION
Contrary, hash collision is encouraged over short distances, by        This section overviews the five submitted text alignment corpora.
preventing reverse-engineering of the patterns, and uses the           In the first subsection we will have a survey of submitted corpora
number of coincident matches to indicate the extent of similarity.     and will give a statistical overview of them. In the next subsection
                                                                       the results of validation and evaluation on the submitted corpora
      Mansoorizadeh et al. [11] and Ehsan et al. [2] use sentence      will be presented.
boundaries to split source and suspicious documents like the
approach in [15]. In both approaches, each sentence is represented     6.1 Survey of Submitted Corpora
under the vector space model, using TF-IDF as weighting scheme.        All of the submitted corpora consist of Persian mono-lingual
Finally, sentences with cosine similarity greater than a pre-defined   plagiarism for the task of text alignment, except for
threshold between corresponding vectors are considered as cases        Mashhadirajab corpus [13] which also contains a set of cross-
of plagiarism. In [2] a subsequent match merging stage improves        lingual English-Persian plagiarism cases. All of the corpora are
performance with respect to granularity. Moreover, overlapping         formatted in accordance with the PAN standard annotation format
passages and extremely short passages are removed for the same         for text alignment corpora. In particular, this includes two sets of
reason. The lack of such a merging stage in Mansoorizadeh et           documents, namely source documents and suspicious documents,
al.’s [11] approach yields high granularity and therefore a poor       where the latter are to be analyzed for plagiarism from any of the
PlagDet score.                                                         source documents. The annotations of plagiarism cases are stored
      Like most of the submitted software, Esteki et al. [3] split     separately from the text documents within XML documents for
documents into sentences to detect plagiarism cases. After a pre-      each pair of suspicious and source documents. Therein, each
processing phase, which includes normalization, stemming and           plagiarism case is annotated as follows:
stop words removal, a Support Vector Machine (SVM) classifier                   Start position and length of the source passage in the
is used to separate “similar” sentences non-similar ones. The                    source document
Levenshtein distance, the Jaccard coefficient, and the Longest
                                                                                Start position and length of the suspicious passage in the
Common Subsequence (LCS) are used as features extracted from
                                                                                 suspicious document
pairs of sentences. Moreover, synonyms are detected to increase
the likelihood of detecting paraphrased sentences.                              Obfuscation type (e.g., indicating to the way that a
                                                                                 source passage has been paraphrased before being added
     Gharavi et al. [4] use a deep learning approach to represent                as suspicious passage to the suspicious documents)
sentences of suspicious and source documents as vectors. For this
purpose, they use Word2Vec to extract words’ vectors and to            6.1.1 Dataset Overview
compute sentence vectors as average word vectors. The most             Table 6 shows an overview of the submitted text alignment
similar sentences between pairs of source document and                 corpora in terms of the corpus statistics also reported for our
suspicious document are found using the cosine similarity, the         corpus. Mashhadirajab corpus [13] is the biggest one in terms of
Jaccard coefficient, reporting them as plagiarism cases.               number of documents, whereas Abnar corpus contains the largest
                                                                       number of plagiarism cases. Samim corpus [21] includes larger
     Mashhadirajab et al. [12] use the vector space model (VSM)        documents compared to the other corpora, whereas a large volume
with TF-IDF weighting to create sentence vectors from source and       of small documents have been used for construction of the ICTRC
suspicious documents. To gain better results, they use an SVM          corpus. Samim corpus and the ICTRC corpus comprise the largest
neural net to predict the obfuscation type in order to adjust the      and the smallest plagiarism case, respectively. A variety of
required parameters. Moreover, to calculate the semantic               different obfuscation strategies have been employed. No
similarity between sentences, FarsNet [22] is used to extract
synsets of terms. Finally, within extension and filtering steps
obfuscation (i.e., exact copy) and artificial obfuscation (random                            6.1.2 Document Sources
text operations) are two common strategies.                                                  The first step to compile a plagiarism detection corpus is choosing
     The length distributions of documents and plagiarized                                   the documents which will be used as the sets of source documents
passages are depicted in Figures 1 and 2. Here, the ICTRC corpus                             and suspicious documents. Many plagiarism detection corpora
contains stands out, containing the smallest documents and                                   intend to simulate plagiarism in technical texts, so that Wikipedia
plagiarized passages among all submitted corpora. Figure 3 shows                             articles and scientific papers are often employed as source and
the distribution of the plagiarism ratio per suspicious document.                            suspicious documents sources in these corpora. This also pertains
The ratio of plagiarism per suspicious documents in Samim                                    to the corpora submitted, which mainly employ journal articles
corpus is distributed more uniformly compared to the other                                   and Wikipedia articles. Wikipedia articles have been used as
submitted corpora. In what follows, the documents used to                                    resource to compiling the ICTRC corpus and Niknam corpus.
compile the corpora as well as the construction approaches are
discussed in detail.

                               Table 4. Overall detection performance for the nine approaches submitted.


  Rank / Team                     Runtime (h:m:s)                          Recall          Precision                      Granularity              F-Measure                    PlagDet
  1 Mashhadirajab                            02:22:48                      0.9191            0.9268                         1.0014                  0.9230                          0.9220
  2 Gharavi                                  00:01:03                      0.8582            0.9592                              1                  0.9059                          0.9059
  3 Momtaz                                   00:16:08                      0.8504            0.8925                              1                  0.8710                          0.8710
  4 Minaei                                   00:01:33                      0.7960            0.9203                         1.0396                  0.8536                          0.8301
  5 Esteki                                   00:44:03                      0.7012            0.9333                              1                  0.8008                          0.8008
  6 Talebpour                                02:24:19                      0.8361            0.9638                         1.2275                  0.8954                          0.7749
  7 Ehsan                                    00:24:08                      0.7049            0.7496                              1                  0.7266                          0.7266
  8 Gillam                                   21:08:54                      0.4140            0.7548                         1.5280                  0.5347                          0.3996
  9 Mansourizadeh                            00:02:38                      0.8065            0.9000                         3.5369                  0.8507                          0.3899




                 Table 5. Detection performance of the nine approaches submitted, dependent on obfuscation type.

 Team                          No obfuscation                                            Artificial Obfuscation                                    Simulated Obfuscation
                                                 Granularity




                                                                                                            Granularity




                                                                                                                                                                      Granularity
                                 Precision




                                                                                              Precision




                                                                                                                                                        Precision
                                                                 PlagDet




                                                                                                                             PlagDet




                                                                                                                                                                                         PlagDet
                      Recall




                                                                                Recall




                                                                                                                                          Recall




 Mashhadirajab      0.9939     0.9403            1             0.9663         0.9473       0.9416         1.0006           0.9440       0.8045        0.9336        1.0047             0.8613

 Gharavi            0.9825     0.9762            1             0.9793         0.8979       0.9647           1              0.9301       0.6895        0.9682          1                0.8054

 Momtaz             0.9532     0.8965            1             0.9240         0.9019       0.8979           1              0.8999       0.6534        0.9119          1                0.7613

 Minaei             0.9659     0.8663          1.0113          0.9060         0.8514       0.9324         1.0240           0.8750       0.5618        0.9110        1.1173             0.6422

 Esteki             0.9781     0.9689            1             0.9735         0.7758       0.9473           1              0.8530       0.3683        0.8982          1                0.5224

 Talebpour          0.9755     0.9775            1             0.9765         0.8971       0.9674         1.2074           0.8149       0.5961        0.9582        1.4111             0.5788

 Ehsan              0.8065     0.7333            1             0.7682         0.7542       0.7573           1              0.7557       0.5154        0.7858          1                0.6225

 Gillam             0.7588     0.6257          1.4857          0.5221         0.4236       0.7744         1.5351           0.4080       0.2564        0.7748        1.5308             0.2876

 Mansourizadeh      0.9615     0.8821          3.7740          0.4080         0.8891       0.9129         3.6011           0.4091       0.4944        0.8791        3.1494             0.3082
Niknam used 3000 documents larger than 4000 characters, and            “Obfuscation type”). All of the submitted corpora also contain a
ICTRC used about 6000 documents larger than 1500 characters.           portion of plagiarized passages without any obfuscation to
Abnar used texts from a set of novels that were translated to          simulate verbatim copying.
Persian. Despite the genre of books, the documents found in the        Niknam employed a set of text operations consisting of addition,
corpus are not as large as might be expected. Mashhadirajab [13]       deletion and shuffling of words, replacing words with their
and Samim [21] used scientific papers to compile their corpora.        synonyms and POS-preserving word replacement. Similar
Mashhadirajab used a combination of Wikipedia articles (40%),          obfuscation strategies have been used to compile Samim’s corpus.
articles from the Computer Society of Iran Computer Conference         It contains “Random Text Operations” and “Semantic Word
(CSICC) (13%), theses available in online (13%) and Persian            Variation” in addition to “No obfuscation.” In addition to these
open access articles (34%). Samim also collected Persian open          obfuscation types, the authors of the ICTRC corpus used a
access papers from peer reviewed journals to compile their text        crowdsourcing platform for paraphrasing test passages. About 30
alignment corpus. The papers used include papers from the              people of various ages, both genders, and different levels of
humanities (57%), science (25%), veterinary science (10%) and          education have participated in the paraphrasing process. Abnar’s
other related subjects (8%).                                           corpus comprises obfuscation approaches such as replacing words
6.1.3 Obfuscation Synthesis                                            with synonyms, shuffling sentences, circular translation, and a
The second step in compiling a plagiarism detection corpus is to       combination of the aforementioned ones. The circular translation
obfuscate passages selected from source documents and then             approach includes translating the text to an intermediate language
insert them into suspicious documents. Obfuscating text passages       and then translating it back to the original one, hoping that the
aims at emulating plagiarism cases whose authors try to conceal        resulting text will significantly differ from the original one while
the fact their plagiarized, making it more difficult for human         maintaining its meaning. From a diversity point of view,
reviewers and plagiarism detection systems alike to identify the       Mashhadirajab’s corpus contains the most variety in terms of
plagiarized passages afterwards. As discussed above, creating          obfuscation. In addition to artificial and simulated cases, they
obfuscated plagiarism manually is laborious and expensive, so          used summarizing cyclic translation and text manipulation
that most participants resorted to automatic obfuscation methods.      approaches to create cases of plagiarism. Moreover, the corpus
It is remarkable that two of the corpora (the ones of                  comprises also cross-lingual plagiarism where source documents
Mashhadirajab and ICTRC) comprise plagiarism that has been             have been translated to Persian using manual and automatic
manually created. Otherwise, a variety of different approaches         translation.
have been employed for obfuscation (see Table 6, rows

                                          Table 6. Corpus statistics for the submitted corpora.
                                                                       Niknam        Samim        Mashhadirajab     ICTRC        Abnar
 Entire corpus                Number of documents                       3218          4707           11089           5755         2470
                              Number of plagiarism cases                2308          5862           11603           3745        12061
 Document purpose             Source documents                          52%           50%             48%            49%          20%
                              Suspicious documents                      48%           50%             52%            51%          80%
                              Short (1-10000 words)                     35%            2%             53%            91%          51%
 Document length              Medium (10000-30000 words)                56%           48%             32%             8%          48%
                              Long (> 30000 words)                       9%           50%             15%             1%           1%
                              Hardly (<20%)                             71%           29%             39%            57%          29%
 Plagiarism per document      Medium (20%-50%)                          28%           25%             14%            37%          60%
                              Much (50%-80%)                             1%           31%             20%             6%          10%
                              Entirely (>80%)                             -           15%             27%              -           1%
                              Short (1-500 words)                       21%           15%              6%            51%          45%
 Case length                  Medium (500-1500 words)                   76%           22%             52%            46%          54%
                              Long (>1500 words)                         3%           63%             42%             3%           1%
                              No obfuscation (exact copy)               25%           40%             17%            10%          22%
                              Artificial (word replacement)             27%             -               -              -            -
                              Artificial (synonym replacement)          25%             -               -              -            -
                              Artificial (POS-preserving shuffling)     23%             -               -              -            -
                              Random                                      -           40%               -            81%            -
                              Semantic                                    -           20%               -              -          15%
                              Near Copy                                   -             -             28%              -            -
                              Summarizing                                 -             -             33%              -            -
 Obfuscation types            Paraphrasing                                -             -              6%              -            -
                              Modified Copy                               -             -              4%              -            -
                              Circle Translation                          -             -              3%              -          21%
                              Semantic-based meaning                      -             -              1%              -            -
                              Auto Translation                            -             -              2%              -            -
                              Translation                                 -             -              6%              -            -
                              Simulated                                   -             -               -             9%            -
                              Shuffle Sentences                           -             -               -              -          21%
                              Combination                                 -             -               -              -          21%
6.2 Corpus Validation                                                   similar for the start offsets within source documents with one
In order to validate the submitted corpora, we analyzed them            notable exception: the source passages of Samim’s corpus have
quantitatively and qualitatively. For the latter, samples have been     almost always been chosen from the same offsets of source
drawn from each corpus and obfuscation type for manual review.          documents which is a clear bias and may allow for trivial
The review involved of validating the plagiarism annotations,           detection.
such as offsets and lengths of annotated plagiarism in both source      Finally, we analyzed the plagiarized passages in the submitted
and suspicious documents. Moreover, the suspicious passage and          corpora with regard to their similarity between source passage and
its corresponding source have been checked manually to observe          suspicious passage. The experiment consists of comparing source
the impact of different obfuscation strategies as well as the level     passages with suspicious passages using 10 retrieval models. Each
of obfuscation. Altogether, no important issues have been found         model is an n-gram vector space model (VSM), where n ranges
among the studied samples during peer-review.                           from 1 to 10 words, employing stop word removal, TF-weighting
In addition to manual review, we also analyzed the corpora              and the cosine similarity [17]. For high-quality corpora, a pattern
quantitatively: Figures 1 and 2 depict the length distributions of      similar to that of PAN corpora is expected.
the documents and the plagiarism cases in the corpora. Both             Since there are many obfuscate types to choose from, we only
Abnar’s corpus and the ICTRC corpus have clear expected values,         compare a selection: the simulated plagiarism cases of
whereas the other corpora are more evenly distributed. Figure 3         Mashhadirajab and ICTRC are compared to the PAN corpora
depicts the ratio of plagiarism per document, showing that the          (Figure 6). Moreover, the artificial parts of all corpora are
ratios are quite unevenly distributed across corpora; Niknam’s          compared to each other (Figure 7). Abnar’s corpus is omitted
corpus and the ICTRC corpus comprise mostly suspicious                  since it lacks artificial obfuscation. Almost all of the corpora show
documents with a small ratio of plagiarism. Figures 4 and 5 show        same patterns of similarity for different ranges of n, except the
the distribution of plagiarized passages in terms of where they         Mashhadirajab’s corpus which has a higher range of similarity in
start within suspicious documents (i.e., their character offset), and   comparison others.
where they start within source documents. The distributions of
start offsets within suspicious documents are similar across all
corpora with a negative bias against offsets at the beginning of a
suspicious document (see Figure 4). The distributions are also




                                                Figure 1. Length distribution of documents.




                                                Figure 2. Length distribution of fragments.
                 Figure 3. Ratio of plagiarism per document.




   Figure 4. Start position of plagiarized fragments in suspicious documents.




    Figure 5. Start position of plagiarized fragments in source documents.




Figure 6. Comparison of Simulated part of Mashhadirajab and ICTRC corpora.
                   Figure 7. Comparison of Artificial part of Niknam, Samim, Mashhadirajab and ICTRC corpora.


6.3 Corpus Evaluation                                                    software to make it work on all corpora, so that further results
Exploiting the virtues of TIRA, our final experiment was to run          may become available after publication of this paper, e.g., on
the nine submitted detection approaches on the five submitted            TIRA’s web page. Considering the detection performance, it can
corpora, providing for a first impression on how difficult it is to      be seen that the PlagDet scores are generally lower compared to
detect plagiarism within these corpora. Table 7 overviews the            our corpus, except for the ICTRC corpus, where the same
results of this experiment. Unfortunately, not all submitted             performance scores have been reached. This shows that the
approaches succeeded in processing all corpora. One reason was           submitted corpora present their own challenges, rendering them
scalability issues: since some of the submitted corpora are              more difficult, and presenting future researchers with new
significantly larger than our evaluation corpus, it seems                opportunities for contributions.
participants did not pay a lot of attention to scalability. The               Given the results from all our experiments, the submitted
approaches of Talebpour, Mashhadi, and Gillam failed to process          corpora are of reasonable quality. Although some of them are too
the corpora in time. The approaches of Momtaz and Esteki failed          easy to be solved and comprise a biased sample of plagiarism
to process some of the corpora at first, the results of the former are   cases, the diversity of corpora ensures that future evaluations can
only partially reliable to date, whereas the latter of which could be    be done with confidence as long as all available datasets are
fixed in time. This shows that submitting datasets to shared tasks       employed.
presents its own challenges. Participants will be invited to fix their


                        Table 7. PlagDet performance of some submitted approaches on the submitted corpora.
 Team                                    Niknam              Samim             Mashhadirajab                 ICTRC             Abnar
 Gharavi                                 0.8657              0.7386                0.5784                    0.9253            0.3927
 Momtaz                                  0.8161                 -                     -                      0.8924               -
 Minaei                                  0.9042              0.6585                0.3877                    0.8633            0.7218
 Esteki                                  0.5758                 -                     -                         -              0.3830
 Ehsan                                   0.7196              0.5367                0.4014                    0.7104            0.5890
 Mansourizadeh                           0.2984                 -                  0.1286                       -              0.2687


7. CONCLUSION                                                            8. ACKNOWLEDGMENTS
In conclusion, our shared task has attracted considerable attention      This work has been funded by ICT Research Institute, ACECR,
from the community of scientists working on plagiarism                   under the partial support of Vice Presidency for Science and
detection. The shared task has served as a means to establish a          Technology of Iran - Grant No. 1164331. The work of Paolo
new state of the art in performance evaluation for Persian               Rosso has been partially funded by the SomEMBED MINECO
plagiarism detection. Altogether six new evaluation corpora are          TIN2015-71147-C2-1-P research project and by the Generalitat
available now, and nine detection approaches have been evaluated         Valenciana        under       the      grant      ALMAMATER
on them. The results show that Persian plagiarism detection is far       (PrometeoII/2014/030). We would like to thank the participants of
from being a solved problem. In addition, our contributions              the competition for their dedicated work. Our special thanks go to
broaden the scope of the text alignment task which has been              the renowned experts who served on the organizing committee for
studied mostly for English until now. This may allow future work         their contributions and devoted work to make this shared task
on plagiarism detection approaches that work on both languages           possible. We would like to thank Javad Rafiei and Khadijeh
simultaneously.                                                          Khoshnava for their help in construction of evaluation corpus. We
are also immensely grateful to Vahid Zarrabi for his comments           Forum for Information Retrieval Evaluation, Kolkata, India,
and valuable help along the way which greatly assisted this             December 7-10, 2016, CEUR Workshop Proceedings,
challenging shared task.                                                CEUR-WS.org.
                                                                   [13] Mashhadirajab, F, Shamsfard, M, Adelkhah, R, Shafiee, F.,
9. REFERENCES                                                           Saedi, S. 2016. A Text Alignment Corpus for Persian
[1] Bensalem, I., Boukhalfa, I., Rosso, P., Abouenour, L.,              Plagiarism Detection, In Working notes of FIRE 2016 -
    Darwish, K., & Chikhi, S. 2015. Overview of the AraPlagDet          Forum for Information Retrieval Evaluation, Kolkata, India,
    PAN@ FIRE2015 Shared Task on Arabic Plagiarism                      December 7-10, 2016, CEUR Workshop Proceedings,
    Detection, CEUR-WS.org, vol. 1587, pp. 111-122                      CEUR-WS.org.
[2] Ehsan, N, Shakery, A. 2016. A Pairwise Document Analysis       [14] Minaei, B, Niknam, M. 2016. An n-gram based Method for
    Approach for Monolingual Plagiarism Detection, In Working           Nearly Copy Detection in Plagiarism Systems, In Working
    notes of FIRE 2016 - Forum for Information Retrieval                notes of FIRE 2016 - Forum for Information Retrieval
    Evaluation, Kolkata, India, December 7-10, 2016, CEUR               Evaluation, Kolkata, India, December 7-10, 2016, CEUR
    Workshop Proceedings, CEUR-WS.org.                                  Workshop Proceedings, CEUR-WS.org.
[3] Esteki, F, Safi Esfahani, F. 2016. A Plagiarism Detection      [15] Momtaz, M, Bijari, K, Salehi, M, Veisi, H. 2016. Graph-
    Approach Based on SVM for Persian Texts, In Working                 based Approach to Text Alignment for Plagiarism Detection
    notes of FIRE 2016 - Forum for Information Retrieval                in Persian Documents, In Working notes of FIRE 2016 -
    Evaluation, Kolkata, India, December 7-10, 2016, CEUR               Forum for Information Retrieval Evaluation, Kolkata, India,
    Workshop Proceedings, CEUR-WS.org.                                  December 7-10, 2016, CEUR Workshop Proceedings,
                                                                        CEUR-WS.org.
[4] Gharavi, E, Bijari, k, Zahirnia, K, Veisi, H. 2016. A Deep
    Learning Approach to Persian Plagiarism Detection, In          [16] Potthast, M., Stein, B., Eiselt, A., Barron, Cedeno, A., and
    Working notes of FIRE 2016 - Forum for Information                  Rosso, P., 2009. Overview of the 1st international
    Retrieval Evaluation, Kolkata, India, December 7-10, 2016,          competition on plagiarism detection. In 3rd PAN Workshop.
    CEUR Workshop Proceedings, CEUR-WS.org.                             Uncovering Plagiarism, Authorship and Social Software
                                                                        Misuse (p. 1)
[5] Gillam, L., and Vartapetiance, A., 2016. From English to
    Persian: Conversion of Text Alignment for Plagiarism           [17] Potthast, M., Stein, B., Barrón-Cedeño, A. and Rosso, P.,
    Detection, In Working notes of FIRE 2016 - Forum for                2010, August. An evaluation framework for plagiarism
    Information Retrieval Evaluation, Kolkata, India, December          detection. In Proceedings of the 23rd international
    7-10, 2016, CEUR Workshop Proceedings, CEUR-WS.org.                 conference on computational linguistics: Posters (pp. 997-
                                                                        1005). Association for Computational Linguistics.
[6] Gollub, T., Burrows, S. and Stein, B., 2012, August. First
    experiences with TIRA for reproducible evaluation in           [18] Potthast, M., Gollub, T., Hagen, M., Graßegger,J., Kiesel, J.,
    information retrieval. In SIGIR (Vol. 12, pp. 52-55).               Michel,M., Oberländer, A., Tippmann, M., Barrón-Cedeño,
                                                                        A., Gupta, P., Rosso, P., and Stein, B., 2012. Overview of the
[7] Gollub, T., Stein, B., & Burrows, S. 2012, August. Ousting
                                                                        4th International Competition on Plagiarism Detection. In
    ivory tower research: towards a web framework for
                                                                        CLEF (Online Working Notes/Labs/Workshop).
    providing experiments as a service. In Proceedings of the
    35th international ACM SIGIR conference on Research and        [19] Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E.
    development in information retrieval (pp. 1125-1126). ACM.          and Stein, B., 2014, September. Improving the
                                                                        Reproducibility of PAN’s Shared Tasks. In International
[8] Gollub, T., Stein, B., Burrows, S., and Hoppe, D. 2012.
                                                                        Conference of the Cross-Language Evaluation Forum for
    September. TIRA: Configuring, executing, and disseminating
                                                                        European Languages (pp. 268-299). Springer International
    information    retrieval    experiments.  In 2012    23rd
                                                                        Publishing.
    International Workshop on Database and Expert Systems
    Applications (pp. 151-155). IEEE.                              [20] Potthast, M., Hagen, M., Göring, S., Rosso, P. and Stein, B.,
                                                                        2015. Towards data submissions for shared tasks: first
[9] Hopfgartner, F., Hanbury, A., Müller, H., Kando, N., Mercer,
                                                                        experiences for the task of text alignment. Working Notes
    S., Kalpathy-Cramer, J., Potthast, M., Gollub, T., Krithara,
                                                                        Papers of the CLEF, pp.1613-0073.
    A., Lin, J. and Balog, K., 2015, June. Report on the
    Evaluation-as-a-Service (EaaS) expert workshop. In ACM         [21] Rezaei Sharifabadi, M., Eftekhari, S. A. 2016. Mahak
    SIGIR Forum (Vol. 49, No. 1, pp. 57-65). ACM.                       Samim: A Corpus of Persian Academic Texts for Evaluating
                                                                        Plagiarism Detection Systems, In Working notes of FIRE
[10] Khoshnavataher, K., Zarrabi, V., Mohtaj, S., & Asghari, H.
                                                                        2016 - Forum for Information Retrieval Evaluation, Kolkata,
     2015. Developing Monolingual Persian Corpus for Extrinsic
                                                                        India, December 7-10, 2016, CEUR Workshop Proceedings,
     Plagiarism Detection Using Artificial Obfuscation. Notebook
                                                                        CEUR-WS.org.
     for PAN at CLEF 2015. CLEF (Working Notes).
                                                                   [22] Shamsfard, M. 2008.        Developing FarsNet: A lexical
[11] Mansoorizadeh, M, Rahgooy, T. 2016. Persian Plagiarism
                                                                        Ontology for Persian.      Proceedings of the 4th global
     Detection Using Sentence Correlations, In Working notes of
                                                                        WordNet conference.
     FIRE 2016 - Forum for Information Retrieval Evaluation,
     Kolkata, India, December 7-10, 2016, CEUR Workshop            [23] Talebpour, A, Shirzadi, M, Aminolroaya, Z. 2016.
     Proceedings, CEUR-WS.org.                                          Plagiarism Detection based on a Novel Trie-based Approach,
                                                                        In Working notes of FIRE 2016 - Forum for Information
[12] Mashhadirajab, F, Shamsfard, M. 2016. A Text Alignment
                                                                        Retrieval Evaluation, Kolkata, India, December 7-10, 2016,
     Algorithm Based on Prediction of Obfuscation Types Using
                                                                        CEUR Workshop Proceedings, CEUR-WS.org.
     SVM Neural Network, In Working notes of FIRE 2016 -