=Paper=
{{Paper
|id=Vol-1391/61-CR
|storemode=property
|title=PAN 2015 Shared Task on Plagiarism Detection: Evaluation of Corpora for Text Alignment: Notebook for PAN at CLEF 2015
|pdfUrl=https://ceur-ws.org/Vol-1391/61-CR.pdf
|volume=Vol-1391
|dblpUrl=https://dblp.org/rec/conf/clef/Franco-Salvador15a
}}
==PAN 2015 Shared Task on Plagiarism Detection: Evaluation of Corpora for Text Alignment: Notebook for PAN at CLEF 2015==
<pdf width="1500px">https://ceur-ws.org/Vol-1391/61-CR.pdf</pdf>
<pre>
      PAN 2015 Shared Task on Plagiarism Detection:
       Evaluation of Corpora for Text Alignment ?
                        Notebook for PAN at CLEF 2015

    Marc Franco-Salvador1 , Imene Bensalem2 , Enrique Flores1 , Parth Gupta1 , and
                                   Paolo Rosso1
                         1
                    Universitat Politècnica de València, Spain
     mfranco@prhlt.upv.es, {eflores,pgupta,prosso}@dsic.upv.es
                     2
                       Constantine 2 University, Algeria
                        bens.imene@gmail.com


       Abstract. In this paper we describe and evaluate the corpora submitted to the
       PAN 2015 shared task on plagiarism detection for text alignment. We received
       mono- and cross-language corpora in the following languages (pairs): English,
       Persian, Chinese, and Urdu-English, English-Persian. We present an independent
       section for each submitted corpus including statistics, discussion of the obfusca-
       tion techniques employed, and assessment of the corpus quality.


Keywords: Plagiarism detection, Text re-use detection, Cross-language, Evaluation,
Corpus construction

1   Introduction
Plagiarism detection [1, 4] refers to automatically identify the plagiarized fragments of
a suspicious document in a set of source documents. When the source of plagiarism is
in a different language, we refer to cross-language (CL) plagiarism detection [5, 2, 3].
Since 2012, the Uncovering Plagiarism Authorship and Social Software Misuse3 (PAN)
CLEF Lab, organized the shared task on plagiarism detection task which is divided in
two subtasks: source retrieval and text alignment [6, 7]. Given a suspicious document
and a web search API, the source retrieval subtask consists in retrieving all plagiarized
sources while minimizing retrieval costs. Given a pair of documents, the text alignment
subtask is based on identifying all contiguous maximal-length passages of plagiarized
text between them.
    The PAN 2015 subtask on text alignment4 offered a new challenge to participants:
the submission of corpora. This new initiative has obtained a considerably high accep-
tance with a total of six participant teams and eight submissions. They applied different
?
   This research has been carried out within the framework of the European Commission WIQ-
   EI IRSES (no. 269180) and DIANA - Finding Hidden Knowledge in Texts (TIN2012-38603-
   C02) projects, and the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent
   Systems.
 3
   http://pan.webis.de/
 4
   http://www.uni-weimar.de/medien/webis/events/pan-15/pan15-web/
   plagiarism-detection.html
Table 1. Corpus statistics for 426 documents and 193 plagiarism cases in cheema15’s English
corpus.


                      Document statistics                          Plagiarism case statistics
      Document purpose               Plagiarism per document               Type of case
source documents       50 %          hardly   (5%-20%) 95 %    real plagiarism             100 %
suspicious documents                 medium (20%-50%) 5 %
                                                                        Case length
 - with plagiarism     25 %          much (50%-80%) 0 %
                                                               short  (50-1k characters) 89 %
 - w/o plagiarism      25 %          entirely   (>80%) 0 %
                                                               medium (1k-3k characters) 11 %
                        Document length                        long  (3k-30k characters) 0 %
short     (1-30k characters) 100 %
medium (30k-300k characters) 0 %
long   (300k-3M characters) 0 %


obfuscation techniques over text pairs, or collected real plagiarism fragments, in order
to generate the plagiarism cases of the corpora. Eight are the corpora that have been
submitted: six monolingual -Chinese, Persian and four English- and two CL corpora
-Urdu-English and English-Persian. Evaluating whether a submitted corpus is suitable
for evaluation purposes requires an in-depth analysis of its content. Therefore, in this
paper, we report on our manual assessment of the submitted corpora with regard to
quality and realism of the plagiarism cases.


2      Monolingual Text Alignment Corpora

In this first part we study the monolingual submitted corpora. Each subsection title
corresponds with the name of the team and the language employed in the plagiarism
cases. PAN 2015 shared subtask on text alignment encouraged participants to submit
corpora in languages with less resources for plagiarism detection than English. For the
analysis of the plagiarism cases, in order to make sure that the topic and structure of the
plagiarized fragment and the suspicious document were the same, we employed Google
Translate to convert the random5 selected cases to English.

2.1     cheema15 - English

The corpus statistics are shown in Table 1. We observe that all the corpus has been com-
posed by English paraphrasing cases. PhD, MSc and undergrad students collaborated
with authors to manually generate and annotate the cases. Some forced substitutions
have been found (e.g. “PC Project“ replaced by ”computer program“), in addition to
minor issues which are not much determinative in order to detect plagiarism, e.g. source
and suspicious documents starting from mid-sentence or words. However, the manual
study of several random samples provided a positive impression about the plagiarism
cases and its usability as corpus for evaluation.
 5
     In this paper we employed four reviewers and an average of eight cases per dataset and re-
     viewer. Random cases were independently selected for each reviewer.
Table 2. Corpus statistics for 160 documents and 75 plagiarism cases in alvi15’s English corpus.


                    Document statistics                            Plagiarism case statistics
      Document purpose              Plagiarism per document                 Type of case
source documents       44 %         hardly   (5%-20%) 56 %    verbatim copy                33.33 %
suspicious documents                medium (20%-50%) 11 %     artificial obfuscation       33.33 %
 - with plagiarism     47 %         much (50%-80%) 33 %       real plagiarism              33.33 %
 - w/o plagiarism       9%          entirely   (>80%) 0 %
                                                                        Case length
                      Document length                         short  (50-1k characters)         99 %
                                                              medium (1k-3k characters)          1%
short     (1-30k characters) 99 %
                                                              long  (3k-30k characters)          0%
medium (30k-300k characters) 1 %
long   (300k-3M characters) 0 %


2.2   alvi15 - English

The authors of this English corpus employed three types of plagiarism (see Table 2):
verbatim, obfuscation and real plagiarism cases. The first type is limited to simply in-
serting copies of fragments of a source in a suspicious document. The obfuscation cases
have automatically replaced different words and nouns by synonyms and pronouns re-
spectively. However, there is a loss of semantic relatedness in some cases, e.g. ”already
big enough to speak“ replaced by ”already great adequate to say“. Authors used charac-
ter substitution as well for this type of plagiarism. The real plagiarism cases -extracted
from the Bible- contain a high manual modification level while maintaining the sense.
In contrast, some errors have been found in the codification of the XML files of the
corpus: wrong case offsets -with starting point at mid-word-, in addition to the attribute
”type“ established to ”real“ in all the cases, instead to only the real plagiarism cases and
”artificial“ for the rest. Despite this errors, the overall opinion about this corpus is pos-
itive, especially the real plagiarism cases. The quality of the corpus could be increased
in future versions.


2.3   palkovskii15 - English

As it is shown in Table 3, this corpus is composed by English verbatim and automatic
obfuscation plagiarism cases of three types: random, translation and summary. The ran-
dom obfuscation is quantified by degrees to measure the level of automatic obfusca-
tion, by random employed word reordering. Translation obfuscation cases used a chain
of translators among ten intermediate languages employing MyMemory6 , Google7 and
Bing8 translators. Summary obfuscation cases are created by means of an automatic
summarization tool. The manual analysis of several cases provided average-negative
impressions about the quality of the corpus for its practical usage. It seems that the high
level of random obfuscation, the chain of translators and the unspecified summarization
 6
   https://mymemory.translated.net/
 7
   https://translate.google.com/
 8
   https://www.bing.com/translator/
Table 3. Corpus statistics for 3,125 documents and 1,976 plagiarism cases in palkovskii15’s
English corpus.


                   Document statistics                                 Plagiarism case statistics
      Document purpose              Plagiarism per document                      Type of case
source documents       62 %         hardly (5%-20%) 82 %      artificial obfuscation (summary|random) 69 %
suspicious documents                medium (20%-50%) 17 %     translation-chain                       31 %
 - with plagiarism     18 %         much (50%-80%) 0 %
                                                                            Case length
 - w/o plagiarism      20 %         entirely   (>80%) 0 %
                                                              short  (50-1k characters)              96 %
                    Document length                           medium (1k-3k characters)               4%
                                                              long  (3k-30k characters)               0%
short      1-30k characters) 97 %
medium (30k-300k characters) 3 %
long   (300k-3M characters) 0 %


Table 4. Corpus statistics for 2,744 documents and 2,747 plagiarism cases in mohtaj15’s English
corpus.


                     Document statistics                                  Plagiarism case statistics
      Document purpose                   Plagiarism per document                   Type of case
source documents     71.8 %              hardly (5%-20%) 86 %         verbatim copy                  8%
suspicious documents                     medium (20%-50%) 14 %        artificial obfuscation        77 %
 - with plagiarism   17.6 %              much (50%-80%) 0 %           manual obfuscation            15 %
 - w/o plagiarism    10.6 %              entirely   (>80%) 0 %
                                                                              Case length
                       Document length                                short  (50-1k characters) 99 %
                                                                      medium (1k-3k characters) 1 %
short     (1-30k characters) 81 %
                                                                      long  (3k-30k characters) 0 %
medium (30k-300k characters) 19 %
long   (300k-3M characters) 0 %


tool, provided a high number of senseless text fragments and non-related cases. Finally,
we found similarities with this corpus and the PAN 2013 text alignment corpus9 , e.g.
suspicious-document00005 and source-document01090 are present in both corpora.


2.4   mohtaj15 - English


This English corpus (see Table 4) contains plagiarism cases of three types: verbatim,
random and manual obfuscation. Random obfuscation is performed at two levels (low
and high), with higher word reordering and synonym substitution for the second. We
observed that there exist, especially with the high level, senseless and semantically un-
related cases of this type. The manual obfuscation cases suffered manual paraphrasing
and are in general suitable for plagiarism detection evaluation. Random obfuscation
should be improved in order to have a representative corpus for evaluation.
Table 5. Corpus statistics for 82 documents and 109 plagiarism cases in kong15’s Chinese corpus.


                     Document statistics                           Plagiarism case statistics
      Document purpose               Plagiarism per document               Type of case
source documents       95 %         hardly    (5%-20%) 0 %     real plagiarism             100 %
suspicious documents                medium (20%-50%) 100 %
                                                                        Case length
 - with plagiarism      5%          much (50%-80%) 0 %
                                                               short  (50-1k characters) 92 %
 - w/o plagiarism       0%          entirely    (>80%) 0 %
                                                               medium (1k-3k characters) 6 %
                      Document length                          long  (3k-30k characters) 2 %
short     (1-30k characters) 35 %
medium (30k-300k characters) 65 %
long   (300k-3M characters) 0 %


Table 6. Corpus statistics for 1,522 documents and 411 plagiarism cases in khoshnava15’s Persian
corpus.


                     Document statistics                           Plagiarism case statistics
      Document purpose              Plagiarism per document                 Type of case
source documents       53 %         hardly (5%-20%) 47 %       verbatim copy                31 %
suspicious documents                medium (20%-50%) 53 %      artificial obfuscation       69 %
 - with plagiarism     21 %         much (50%-80%) 0 %
                                                                       Case length
 - w/o plagiarism      26 %         entirely   (>80%) 0 %
                                                               short  (50-1k characters) 42 %
                      Document length                          medium (1k-3k characters) 58 %
                                                               long  (3k-30k characters) 0 %
short     (1-30k characters) 99 %
medium (30k-300k characters) 1 %
long   (300k-3M characters) 0 %


2.5    kong15 - Chinese

The corpus of Table 5 is formed by real plagiarism cases in Chinese. Unfortunately,
XML files do not contain information about the type of strategy employed. Therefore,
it is impossible to determine how the real cases were created. In addition, the manual
analysis of several cases proved that there is not topic and structural relatedness between
annotated cases. It is possible that some error with offsets tagging have been produced.
Note also the low number of suspicious documents, which may produce non-significant
results when using this corpus during evaluation.


2.6    khoshnava15 - Persian

The corpus of the Table 6 is formed by Persian verbatim and random obfuscation cases.
Despite the low information about how the corpus was created, we note the high quality
 9
     http://www.uni-weimar.de/medien/webis/events/pan-13/pan13-web/
     plagiarism-detection.html
Table 7. Corpus statistics for 21,429 documents and 5,606 plagiarism cases in asghari15’s
English-Persian corpus.


                   Document statistics                             Plagiarism case statistics
      Document purpose              Plagiarism per document                Type of case
source documents       74 %         hardly   (5%-20%) 88 %    translated (English to Persian) 100 %
suspicious documents                medium (20%-50%) 12 %
                                                                        Case length
 - with plagiarism     13 %         much (50%-80%) 0 %
                                                              short  (50-1k characters)     100 %
 - w/o plagiarism      13 %         entirely   (>80%) 0 %
                                                              medium (1k-3k characters)       0%
                    Document length                           long  (3k-30k characters)       0%
short     (1-30k characters) 85 %
medium (30k-300k characters) 15 %
long   (300k-3M characters) 0 %


of the cases. Random selected and revised samples of both types of cases are well an-
notated, semantic and structurally related. Therefore, also by its large size, we consider
this corpus has a good quality to be used for Persian plagiarism detection.


3     Cross-language Text Alignment Corpora

In this section we study the cross-language submitted corpora. Each subsection title
corresponds with the name of the team and the source-suspicious document language-
pairs employed. As for the monolingual plagiarism cases not in English, also in the
following CL- text alignment corpora we used Google Translate in order to validate the
topic and structural relatedness.


3.1   asghari15 - English-Persian

This is a considerably large corpus for CL English-Persian plagiarism detection (see
the Table 7 caption). It is formed by documents with encyclopedic knowledge. Au-
thors generated all the plagiarism cases using obfuscation -we assume that by means of
translation-, and divide the level of obfuscation on three types: low, medium and high.
No further details have been provided about how this obfuscation and translation have
been performed. However, the manual analysis of several random samples showed that
the topic and structural relatedness have been maintained in the CL plagiarism cases and
their quality is high enough to consider this corpus for benchmarking English-Persian
plagiarism detection.


3.2   hanif15 - Urdu-English

In Table 8 we can see the statistics of this Urdu-English plagiarism detection corpus.
The corpus has been created using three types of obfuscation by means of manual Urdu-
English translation. Unfortunately, the tags employed in the XML annotation files do
Table 8. Corpus statistics for 500 documents and 135 plagiarism cases in hanif15’s Urdu-English
corpus.


                    Document statistics                           Plagiarism case statistics
      Document purpose              Plagiarism per document               Type of case
source documents       50 %         hardly   (5%-20%) 90 %    translated (Urdu to English) 100 %
suspicious documents                medium (20%-50%) 9 %
                                                                       Case length
 - with plagiarism     27 %         much (50%-80%) 1 %
                                                              short  (50-1k characters) 100 %
 - w/o plagiarism      23 %         entirely   (>80%) 0 %
                                                              medium (1k-3k characters) 0 %
                      Document length                         long  (3k-30k characters) 0 %
short     (1-30k characters) 99 %
medium (30k-300k characters) 1 %
long   (300k-3M characters) 0 %


not allow to understand which is the real difference between these types. Manual anal-
ysis of several random cases offered an average impression about the corpus. There are
semantically unrelated cases but the number of correct instances is higher. However, we
found also some minor typos in the English writing, in addition to some cases which
start at mid-word or in the last word of a sentence. A future revision of the corpus fix-
ing these errors could provide an interesting corpus for benchmarking Urdu-English
plagiarism detection.


4   Conclusions


In this paper we evaluated the quality of the corpora submitted at the PAN 2015 shared
task on text alignment. Among the eight evaluated corpora, seven used some obfusca-
tion strategy to generate their plagiarism cases, five used also verbatim cases, and three
contained real plagiarism cases too. The preferred obfuscation method has been the ran-
dom obfuscation, followed by the synonym substitution. Most of the used documents
and plagiarism cases has been short. Documents and cases with average lengths have
been present in a small amount and corpora authors discarded the use of long ones. In
general, suspicious documents were hardly formed by plagiarism cases, followed by
documents with an average amount of them. Only two corpora contained a percentage
of documents with much plagiarism. Despite English has been the most used language
(in six corpora), the contributions in other languages have been highly appreciated and
some cases denote a remarkable effort to create high quality corpus to evaluate these
languages. It is encouraging to see the high acceptance of this new initiative of allowing
the participants to submit new corpora for text alignment. Future editions will require
a short summary of the strategies and methodology employed to create the plagiarism
cases in order to ease the evaluation of the corpora. We will work also to include statis-
tics about the approximate number of errors per reviewed corpus.
References
1. Clough, P., et al.: Old and new challenges in automatic plagiarism detection. In: National
   Plagiarism Advisory Service, 2003; http://ir. shef. ac. uk/cloughie/index. html. Citeseer (2003)
2. Franco-Salvador, M., Gupta, P., Rosso, P.: Cross-language plagiarism detection using a multi-
   lingual semantic network. In: Proc. of the 35th European Conference on Information Retrieval
   (ECIR’13). pp. 710–713. LNCS(7814), Springer-Verlag (2013)
3. Franco-Salvador, M., Gupta, P., Rosso, P.: Knowledge graphs as context models: Improv-
   ing the detection of cross-language plagiarism with paraphrasing. In: Ferro, N. (ed.) Bridg-
   ing Between Information Retrieval and Databases, Lecture Notes in Computer Science,
   vol. 8173, pp. 227–236. Springer Berlin Heidelberg (2014), http://dx.doi.org/10.
   1007/978-3-642-54798-0_12
4. Maurer, H.A., Kappe, F., Zaka, B.: Plagiarism-a survey. J. UCS 12(8), 1050–1084 (2006)
5. Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection.
   Language Resources and Evaluation 45(1), 45–62 (2011)
6. Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., Stein, B.: Overview
   of the 6th international competition on plagiarism detection. In: Working Notes for CLEF
   2014 Conference, Sheffield, UK, September 15-18, 2014. pp. 845–876 (2014)
7. Potthast, M., Hagen, M., Göring, S., Rosso, P., Stein, B.: Towards Data Submissions
   for Shared Tasks: First Experiences for the Task of Text Alignment. In: Working Notes
   Papers of the CLEF 2015 Evaluation Labs. CEUR Workshop Proceedings, CLEF and
   CEUR-WS.org (Sep 2015), http://www.clef-initiative.eu/publication/
   working-notes

</pre>