<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PAN 2015 Shared Task on Plagiarism Detection: Evaluation of Corpora for Text Alignment ?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marc Franco-Salvador</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Imene Bensalem</string-name>
          <email>bens.imene@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrique Flores</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Parth Gupta</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Rosso</string-name>
          <email>prossog@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Constantine 2 University</institution>
          ,
          <country country="DZ">Algeria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universitat Polite`cnica de Vale`ncia</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <abstract>
        <p>In this paper we describe and evaluate the corpora submitted to the PAN 2015 shared task on plagiarism detection for text alignment. We received mono- and cross-language corpora in the following languages (pairs): English, Persian, Chinese, and Urdu-English, English-Persian. We present an independent section for each submitted corpus including statistics, discussion of the obfuscation techniques employed, and assessment of the corpus quality.</p>
      </abstract>
      <kwd-group>
        <kwd>Plagiarism detection</kwd>
        <kwd>Text re-use detection</kwd>
        <kwd>Cross-language</kwd>
        <kwd>Evaluation</kwd>
        <kwd>Corpus construction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Plagiarism detection [
        <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
        ] refers to automatically identify the plagiarized fragments of
a suspicious document in a set of source documents. When the source of plagiarism is
in a different language, we refer to cross-language (CL) plagiarism detection [
        <xref ref-type="bibr" rid="ref2 ref3 ref5">5, 2, 3</xref>
        ].
Since 2012, the Uncovering Plagiarism Authorship and Social Software Misuse3 (PAN)
CLEF Lab, organized the shared task on plagiarism detection task which is divided in
two subtasks: source retrieval and text alignment [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. Given a suspicious document
and a web search API, the source retrieval subtask consists in retrieving all plagiarized
sources while minimizing retrieval costs. Given a pair of documents, the text alignment
subtask is based on identifying all contiguous maximal-length passages of plagiarized
text between them.
      </p>
      <p>The PAN 2015 subtask on text alignment4 offered a new challenge to participants:
the submission of corpora. This new initiative has obtained a considerably high
acceptance with a total of six participant teams and eight submissions. They applied different
obfuscation techniques over text pairs, or collected real plagiarism fragments, in order
to generate the plagiarism cases of the corpora. Eight are the corpora that have been
submitted: six monolingual -Chinese, Persian and four English- and two CL corpora
-Urdu-English and English-Persian. Evaluating whether a submitted corpus is suitable
for evaluation purposes requires an in-depth analysis of its content. Therefore, in this
paper, we report on our manual assessment of the submitted corpora with regard to
quality and realism of the plagiarism cases.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Monolingual Text Alignment Corpora</title>
      <p>In this first part we study the monolingual submitted corpora. Each subsection title
corresponds with the name of the team and the language employed in the plagiarism
cases. PAN 2015 shared subtask on text alignment encouraged participants to submit
corpora in languages with less resources for plagiarism detection than English. For the
analysis of the plagiarism cases, in order to make sure that the topic and structure of the
plagiarized fragment and the suspicious document were the same, we employed Google
Translate to convert the random5 selected cases to English.
2.1</p>
      <p>cheema15 - English
The corpus statistics are shown in Table 1. We observe that all the corpus has been
composed by English paraphrasing cases. PhD, MSc and undergrad students collaborated
with authors to manually generate and annotate the cases. Some forced substitutions
have been found (e.g. “PC Project“ replaced by ”computer program“), in addition to
minor issues which are not much determinative in order to detect plagiarism, e.g. source
and suspicious documents starting from mid-sentence or words. However, the manual
study of several random samples provided a positive impression about the plagiarism
cases and its usability as corpus for evaluation.
5 In this paper we employed four reviewers and an average of eight cases per dataset and
reviewer. Random cases were independently selected for each reviewer.
2.2 alvi15 - English
The authors of this English corpus employed three types of plagiarism (see Table 2):
verbatim, obfuscation and real plagiarism cases. The first type is limited to simply
inserting copies of fragments of a source in a suspicious document. The obfuscation cases
have automatically replaced different words and nouns by synonyms and pronouns
respectively. However, there is a loss of semantic relatedness in some cases, e.g. ”already
big enough to speak“ replaced by ”already great adequate to say“. Authors used
character substitution as well for this type of plagiarism. The real plagiarism cases -extracted
from the Bible- contain a high manual modification level while maintaining the sense.
In contrast, some errors have been found in the codification of the XML files of the
corpus: wrong case offsets -with starting point at mid-word-, in addition to the attribute
”type“ established to ”real“ in all the cases, instead to only the real plagiarism cases and
”artificial“ for the rest. Despite this errors, the overall opinion about this corpus is
positive, especially the real plagiarism cases. The quality of the corpus could be increased
in future versions.
2.3</p>
      <p>palkovskii15 - English
As it is shown in Table 3, this corpus is composed by English verbatim and automatic
obfuscation plagiarism cases of three types: random, translation and summary. The
random obfuscation is quantified by degrees to measure the level of automatic
obfuscation, by random employed word reordering. Translation obfuscation cases used a chain
of translators among ten intermediate languages employing MyMemory6, Google7 and
Bing8 translators. Summary obfuscation cases are created by means of an automatic
summarization tool. The manual analysis of several cases provided average-negative
impressions about the quality of the corpus for its practical usage. It seems that the high
level of random obfuscation, the chain of translators and the unspecified summarization
6 https://mymemory.translated.net/
7 https://translate.google.com/
8 https://www.bing.com/translator/
tool, provided a high number of senseless text fragments and non-related cases. Finally,
we found similarities with this corpus and the PAN 2013 text alignment corpus9, e.g.
suspicious-document00005 and source-document01090 are present in both corpora.
2.4</p>
      <p>mohtaj15 - English
This English corpus (see Table 4) contains plagiarism cases of three types: verbatim,
random and manual obfuscation. Random obfuscation is performed at two levels (low
and high), with higher word reordering and synonym substitution for the second. We
observed that there exist, especially with the high level, senseless and semantically
unrelated cases of this type. The manual obfuscation cases suffered manual paraphrasing
and are in general suitable for plagiarism detection evaluation. Random obfuscation
should be improved in order to have a representative corpus for evaluation.
The corpus of Table 5 is formed by real plagiarism cases in Chinese. Unfortunately,
XML files do not contain information about the type of strategy employed. Therefore,
it is impossible to determine how the real cases were created. In addition, the manual
analysis of several cases proved that there is not topic and structural relatedness between
annotated cases. It is possible that some error with offsets tagging have been produced.
Note also the low number of suspicious documents, which may produce non-significant
results when using this corpus during evaluation.
2.6</p>
      <p>khoshnava15 - Persian
The corpus of the Table 6 is formed by Persian verbatim and random obfuscation cases.
Despite the low information about how the corpus was created, we note the high quality
9 http://www.uni-weimar.de/medien/webis/events/pan-13/pan13-web/
plagiarism-detection.html
of the cases. Random selected and revised samples of both types of cases are well
annotated, semantic and structurally related. Therefore, also by its large size, we consider
this corpus has a good quality to be used for Persian plagiarism detection.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Cross-language Text Alignment Corpora</title>
      <p>In this section we study the cross-language submitted corpora. Each subsection title
corresponds with the name of the team and the source-suspicious document
languagepairs employed. As for the monolingual plagiarism cases not in English, also in the
following CL- text alignment corpora we used Google Translate in order to validate the
topic and structural relatedness.
3.1</p>
      <p>asghari15 - English-Persian
This is a considerably large corpus for CL English-Persian plagiarism detection (see
the Table 7 caption). It is formed by documents with encyclopedic knowledge.
Authors generated all the plagiarism cases using obfuscation -we assume that by means of
translation-, and divide the level of obfuscation on three types: low, medium and high.
No further details have been provided about how this obfuscation and translation have
been performed. However, the manual analysis of several random samples showed that
the topic and structural relatedness have been maintained in the CL plagiarism cases and
their quality is high enough to consider this corpus for benchmarking English-Persian
plagiarism detection.
3.2</p>
      <p>hanif15 - Urdu-English
In Table 8 we can see the statistics of this Urdu-English plagiarism detection corpus.
The corpus has been created using three types of obfuscation by means of manual
UrduEnglish translation. Unfortunately, the tags employed in the XML annotation files do
not allow to understand which is the real difference between these types. Manual
analysis of several random cases offered an average impression about the corpus. There are
semantically unrelated cases but the number of correct instances is higher. However, we
found also some minor typos in the English writing, in addition to some cases which
start at mid-word or in the last word of a sentence. A future revision of the corpus
fixing these errors could provide an interesting corpus for benchmarking Urdu-English
plagiarism detection.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>In this paper we evaluated the quality of the corpora submitted at the PAN 2015 shared
task on text alignment. Among the eight evaluated corpora, seven used some
obfuscation strategy to generate their plagiarism cases, five used also verbatim cases, and three
contained real plagiarism cases too. The preferred obfuscation method has been the
random obfuscation, followed by the synonym substitution. Most of the used documents
and plagiarism cases has been short. Documents and cases with average lengths have
been present in a small amount and corpora authors discarded the use of long ones. In
general, suspicious documents were hardly formed by plagiarism cases, followed by
documents with an average amount of them. Only two corpora contained a percentage
of documents with much plagiarism. Despite English has been the most used language
(in six corpora), the contributions in other languages have been highly appreciated and
some cases denote a remarkable effort to create high quality corpus to evaluate these
languages. It is encouraging to see the high acceptance of this new initiative of allowing
the participants to submit new corpora for text alignment. Future editions will require
a short summary of the strategies and methodology employed to create the plagiarism
cases in order to ease the evaluation of the corpora. We will work also to include
statistics about the approximate number of errors per reviewed corpus.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Clough</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , et al.:
          <article-title>Old and new challenges in automatic plagiarism detection</article-title>
          .
          <source>In: National Plagiarism Advisory Service</source>
          ,
          <year>2003</year>
          ; http://ir. shef. ac. uk/cloughie/index. html.
          <source>Citeseer</source>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Franco-Salvador</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Cross-language plagiarism detection using a multilingual semantic network</article-title>
          .
          <source>In: Proc. of the 35th European Conference on Information Retrieval (ECIR'13)</source>
          . pp.
          <fpage>710</fpage>
          -
          <lpage>713</lpage>
          .
          <source>LNCS(7814)</source>
          , Springer-Verlag (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Franco-Salvador</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Knowledge graphs as context models: Improving the detection of cross-language plagiarism with paraphrasing</article-title>
          . In: Ferro, N. (ed.)
          <source>Bridging Between Information Retrieval and Databases, Lecture Notes in Computer Science</source>
          , vol.
          <volume>8173</volume>
          , pp.
          <fpage>227</fpage>
          -
          <lpage>236</lpage>
          . Springer Berlin Heidelberg (
          <year>2014</year>
          ), http://dx.doi.org/10. 1007/978-3-
          <fpage>642</fpage>
          -54798-0_
          <fpage>12</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Maurer</surname>
            ,
            <given-names>H.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kappe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaka</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Plagiarism-a survey</article-title>
          .
          <source>J. UCS</source>
          <volume>12</volume>
          (
          <issue>8</issue>
          ),
          <fpage>1050</fpage>
          -
          <lpage>1084</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barro´</surname>
            n-Ceden˜o,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Cross-language plagiarism detection</article-title>
          .
          <source>Language Resources and Evaluation</source>
          <volume>45</volume>
          (
          <issue>1</issue>
          ),
          <fpage>45</fpage>
          -
          <lpage>62</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beyer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Busse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tippmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 6th international competition on plagiarism detection</article-title>
          .
          <source>In: Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15-18</source>
          ,
          <year>2014</year>
          . pp.
          <fpage>845</fpage>
          -
          <lpage>876</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Go¨ring,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          :
          <article-title>Towards Data Submissions for Shared Tasks: First Experiences for the Task of Text Alignment</article-title>
          .
          <source>In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep</source>
          <year>2015</year>
          ), http://www.clef-initiative.eu/publication/ working-notes
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>