<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Corpus for Analyzing Text Reuse by People of Different Groups</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, COMSATS Institute of Information Technology</institution>
          ,
          <addr-line>Lahore</addr-line>
          ,
          <country country="PK">Pakistan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Waqas Arshad Cheema</institution>
          ,
          <addr-line>Fahad Najib, Shakil Ahmed, Syed Husnain Bukhari, Abdul Sittar, and Rao Muhammad Adeel Nawab</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <abstract>
        <p>Plagiarism; an un-attributed reuse of text, is very significant problem specifically for higher education institutions. Consequently, a number of automated plagiarism detection system have been developed to cater this problem. The comparison of these automated plagiarism detection systems is difficult sue to problem in collecting real cases of plagiarism by students / scholars. This paper describes development of corpus containing simulated cases of plagiarism by the people having different level of writing skills. This corpus will be a very valuable addition in the set of evaluation resources presently available for comparison of plagiarism detection systems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The un-acknowledged reuse of information is generally known as plagiarism [9].
Plagiarism is acknowledged as a significant &amp; increasing problem in higher education [7]
[11] [20] [12] [5]. Resultantly, plagiarism &amp; its detection has recently received much
attention [1] [8] [21] and higher education institutions are now using automated systems
to detect plagiarism in students’ / scholars’ work. Numerous approaches for plagiarism
detection are available [2] [19]. However, one of the barriers preventing a comparison
among techniques is the lack of a standardised evaluation resource.</p>
      <p>
        This corpus will be a valuable addition to the set of existing corpora for the
plagiarism detection task. This corpus, (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) can be used for comparison &amp; evaluation of
different techniques for plagiarism detection, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) will help in further research in the
field, (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) will be very helpful in understanding the strategies used by students / scholars
when they plagiarise.
      </p>
      <p>The aim of this corpus collection is to investigate how text is reused by students
/ scholars while writing an article, and to determine whether algorithms can be
discovered to detect and quantify such reuse automatically. It is hoped that results will
generalise beyond the text reuse &amp; plagiarism in academia and provide broader insights
into the nature of text derivation and paraphrase; but the selected scenario provides an
ideal initial case study, and one with considerable potential practical application.</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        There can be three types of plagiarism in a benchmark corpus; (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) artificially
plagiarised documents (automatically generated), (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) simulated (manually created plagiarised
documents by humans to simulate plagiarism), and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) real cases of plagiarism [17].
The construction of a benchmark corpus containing real cases of plagiarism is difficult
due to confidentiality issues [4]. The research community constructed corpora
containing artificial examples of plagiarism [18], simulated examples of plagiarism [3], and the
corpora containing both simulated and artificial cases of plagiarism [17].
      </p>
      <p>
        A number of corpora have been constructed for evaluation of state of the art
techniques for plagiarism detection. An outstanding effort for developing plagiarism
corpora is the PAN International Competitions on Plagiarism Detection1. A series of
evaluation labs have been held on plagiarism detection as part of the CLEF conferences2.
A number of benchmark corpora generated as an outcome of this series of
competitions [18] [17] [14] [16] [15]. Both mono-lingual and cross-lingual examples of
plagiarism are present in these corpora, 90% of these are mono-lingual, and remaining
10% are cross-lingual. The distribution of plagiarised and non-plagiarised examples is
uniform i.e 50% documents in each corpora are plagiarised, and remaining 50% are
non-plagiarised. The plagiarised documents are created using different techniques: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
artificial (automatically generated documents, which are further categorised into none,
low and high), (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) simulated (plagiarised documents were written by humans to
simulate plagiarism), (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) cyclic translation (original text in English language was translated
into different languages using automated tools and then translated back to English) and
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) summarization (the original text was summarised to create plagiarised text). There
is variation in length of plagiarism cases from short passages to very long passages. The
mono-lingual plagiarism cases are written in English language.
      </p>
      <p>The Short Answer Corpus [3] contains examples of simulated plagiarism. The Short
Answer Corpus was created by asking participants to answer the five questions on
different topics from Computer Science domain. In order to create non-plagiarised and
plagiarised documents, each participant answered each of the five questions only once. The
each answer consists of 200-300 words. All the documents (answers to the questions)
in this corpus were manually (simulated) created. This corpus contains total 100
documents, 95 of which are suspicious documents and 5 documents are source Wikipedia3
articles. Out of 95 suspicious documents, 57 documents are plagiarised with different
levels of rewrite (near copy = 19, light revision = 19 and heavy revision = 19) and
remaining 38 documents are non-plagiarised.</p>
      <p>
        Plagiarism is not acceptable type of text reuse, but there are other forms of text
reuse that are acceptable, for example reuse of news agency text by newspapers. The
METER corpus4 [6] is another benchmark corpus, which was mainly built for the study
of text reuse in journalism. However, this corpus can also be used for the evaluation
of plagiarism detection systems. The METER corpus contains total 1,716 documents,
1 http://pan.webis.de/ Last visited: 02-06-2015
2 http://clef2015.clef-initiative.eu/CLEF2015/ Last visited: 02-06-2015
3 http://www.wikipedia.org/
4 http://nlp.shef.ac.uk/meter/ Last visited: 18-03-2015
771 documents are Press Association (PA) articles and the remaining 945 documents
are news stories published by nine different British newspapers. Each news story
(suspicious document) was manually examined to access level of text reuse, and based on
the amount of text reused from the PA article (potential source document) classified at
document level as: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Wholly Derived (301 news stories), (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Partially Derived (438
news stories), and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) Non-derived (206 news stories).
      </p>
      <p>All of the above mentioned corpora contains documents with different levels of
rewrite. They lack in categorisation of documents on the basis of writer (having certain
level of writing skills) of document. To the best of our knowledge, no standard
evaluation resource is available for study the variation in text rewritten by groups of people
having different writing skills.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Corpus Creation Process</title>
      <sec id="sec-3-1">
        <title>Fragment Generation</title>
        <p>Previous studies have shown that detecting paraphrased plagiarism is a difficult task and
an open challenge [10] [14]. The proposed corpus aims to collect paraphrased examples
of plagiarism from participants i.e. collection contains simulated cases of plagiarism.</p>
        <p>In previous studies, simulated examples are generated from university students [3]
or by paying workers on Amazon Mechanical Turk [13]. However, none of these
contain paraphrased examples of plagiarism generated by different groups of people. This
study aims to collect paraphrased examples of plagiarism from different groups. We
selected following four groups:
i. Undergrad in progress: The students of undergrad program, who have not written
final year project report.
ii. Undergrad: The people, who have completed undergrad, and they have written
report for their final year project. This group also includes the students of Masters
program, who have written report for their final year prject of undergrad program
but they have not written their Master’s thesis.
iii. Masters: The group of people, who have completed master degree, and have
written masters thesis. This group also includes the students of PhD program, who have
written their masters thesis but they have not written their PhD thesis.
iv. PhD: The group of people, who have completed their PhD degree.</p>
        <p>Another important point is that participants were asked to selected text of their
own research area i.e. in which they have sound knowledge and experience. Because
to efficiently paraphrase a text one mush have the domain knowledge. The participants
were asked to generate paraphrased plagiarism examples with different amount of text
because people may have variation in the amount of text reused for plagiarism. The
three variants were: small, medium and large.</p>
        <p>
          Documents were collected from domains including (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) Technology, (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) Life
Sciences, and (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) Humanities. The abbreviations used in xml annotation files are (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
technology, (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) life_sciences, and (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) humanities respectively for each domain. Total 250
pairs of text fragments collected from all the four groups of people. Table 1 shows
detailed statistics of the text fragment pairs.
After generating source-plagiarised fragment pairs, we collected 500 document pairs
from Wikipedia5, and Project Gutenberg6 on the same topics that were used in fragment
pairs.
3.3
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Generating Text Alignment Corpus</title>
        <p>
          The proposed corpus contains total 1000 documents (500 source documents, and 500
suspicious documents). Out of these 500 document pairs, 250 pairs are plagiarised, and
remaining 250 document pairs are non-plagiarised. The 250 plagiarised pairs were
created by inserting text fragments into documents. Only one source-plagiarised fragment
pair was inserted into one source-suspicious document pair. The source-plagiarised
fragment pairs belonging to Technology domain were inserted into source-suspicious
documents belonging to same domain i.e Technology, and similarly source-plagiarised
fragment pairs from other domains were inserted into source-suspicious document pairs
which belonged to the same domain. Table 2 presents the domain vise statistics of
fragment pairs in the corpus.
To collect corpus while ensuring authenticity and least dependencies on the
contributors, three types of contributors were selected for our study: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) family, (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) colleagues
and friends and (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) university students. Note that all the contributors were volunteers,
and were not paid for the purpose of data collection.
5 http://www.wikipedia.org/
6 http://www.gutenberg.org/
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Peer Review</title>
      <p>In this section we will review the other participant’s corpora. We are reviewing only
those corpora which are completely in English i.e. both the source and suspicious
documents are in English, as it is not possible for us to review the corpora in the other
unfamiliar languages. We randomly observed some of the document pairs of the
entire corpus and reported our observations. Alvi15’s corpus constitute of obfuscation
strategies: ‘non plagiarized’, ‘human retelling’, ‘synonym replacement’ and ‘character
substitution’. In ‘non plagiarized’ class, we couldn’t find any matching pairs at all, as
it supposed to be. In ‘human retelling’ class, the inserted text in the source document
has been paraphrased in the suspicious document. In ‘synonym replacement’ category,
most of the words in the inserted text buffer have been replaced by their synonyms.
In ‘character substitution’, the character we found substituted mostly is ‘the’ has been
replaced by ‘thy’ at some points in the corpus. Mohtaj15’s corpus compromise of
obfuscation strategies: ‘non plagiarized’, ‘no-obfuscation’, ‘random obfuscation’, and
‘simulated obfuscation’. In ‘non plagiarized’ class, we couldn’t find any matching strings.
In ‘no-obfuscation’, string buffers has been inserted into the documents pairs with no
obfuscation i.e. the inserted text is exactly same in both the source and suspicious
documents. In ‘random obfuscation’, the text has been randomly obfuscated i.e. the words of
the matched string has been reordered randomly and makes no sense grammatically and
has no meaning. In ‘simulated obfuscation’, the text that has been inserted randomly in
the documents pairs has been paraphrased. Palkovskii15’s corpus consist of
obfuscation strategies: ‘non plagiarized’, ‘no-obfuscation’, ‘random obfuscation’, ‘translation
obfuscation’ and ‘summary obfuscation’. In ‘non plagiarized’ class, again we couldn’t
find any matching text in the observed pairs. Again in ‘no-obfuscation’, buffers has been
inserted into the documents pairs with no obfuscation i.e. no change in text at all.
Similarly in ‘random obfuscation’, the text has been randomly obfuscated i.e. the words of
the matched string has been reordered randomly and makes no sense grammatically and
has no meaning. In ‘summary obfuscation’, the source document is basically the short
summary of the suspicious document. In "translation obfuscation", the inserted text has
been paraphrased in the suspicious document. Overall we find all the three corpora error
free and true in realism.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>This paper explained the construction of a new corpus for text reuse &amp; plagiarism
detection research. This corpus contains examples of simulated plagiarism, and has been
created manually. Our corpus is available to others for evaluation of techniques
developed for plagiarism &amp; text reuse detection. The corpus allows much more deeper
analysis of different strategies used by people having different level of education.</p>
      <p>In future, we plan to gather more document pairs to increase the size of the corpus.
Also we will apply &amp; evaluate different techniques using this corpus for text reuse &amp;
plagiarism detection.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>We thank all the volunteers for their contribution in corpus construction.
18. Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.: Overview of the 1st
International Competition on Plagiarism Detection. In: Stein, B., Rosso, P., Stamatatos, E.,
Koppel, M., Agirre, E. (eds.) SEPLN 09 Workshop on Uncovering Plagiarism, Authorship,
and Social Software Misuse (PAN 09). pp. 1–9. CEUR-WS.org (Sep 2009),
http://ceur-ws.org/Vol-502
19. White, D.R., Joy, M.S.: Sentence-based natural language plagiarism detection. Journal on</p>
      <p>
        Educational Resources in Computing (JERIC) 4(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ), 2 (2004)
20. Zobel, J.: Uni cheats racket: A case study in plagiarism investigation. In: Proceedings of the
Sixth Australasian Conference on Computing Education-Volume 30. pp. 357–365.
      </p>
      <p>Australian Computer Society, Inc. (2004)
21. Zu Eissen, S.M., Stein, B.: Intrinsic plagiarism detection. In: Advances in Information
Retrieval, pp. 565–569. Springer (2006)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Boisvert</surname>
            ,
            <given-names>R.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Irwin</surname>
            ,
            <given-names>M.J.:</given-names>
          </string-name>
          <article-title>Plagiarism on the rise</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>49</volume>
          (
          <issue>6</issue>
          ),
          <fpage>23</fpage>
          -
          <lpage>24</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Clough</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Plagiarism in natural and programming languages: an overview of current tools and technologies</article-title>
          .
          <source>Research Memoranda: CS-00-05</source>
          , Department of Computer Science, University of Sheffield, UK pp.
          <fpage>1</fpage>
          -
          <lpage>31</lpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Clough</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stevenson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Developing a corpus of plagiarised short answers</article-title>
          .
          <source>Language Resources and Evaluation</source>
          <volume>45</volume>
          (
          <issue>1</issue>
          ),
          <fpage>5</fpage>
          -
          <lpage>24</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Clough</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , et al.:
          <article-title>Old and new challenges in automatic plagiarism detection</article-title>
          .
          <source>In: National Plagiarism Advisory Service</source>
          ,
          <year>2003</year>
          ; http://ir. shef. ac. uk/cloughie/index. html.
          <source>Citeseer</source>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Culwin</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lancaster</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Plagiarism issues for higher education</article-title>
          .
          <source>Vine</source>
          <volume>31</volume>
          (
          <issue>2</issue>
          ),
          <fpage>36</fpage>
          -
          <lpage>41</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gaizauskas</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Foster</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilks</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arundel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clough</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piao</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>The meter corpus: a corpus for analysing journalistic text reuse</article-title>
          .
          <source>In: Proceedings of the Corpus Linguistics 2001 Conference</source>
          . pp.
          <fpage>214</fpage>
          -
          <lpage>223</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Judge</surname>
          </string-name>
          , G.:
          <article-title>Plagiarism: Bringing economics and education together (with a little help from it). Computers in Higher Education Economics Reviews (Virtual edition</article-title>
          )
          <volume>20</volume>
          ,
          <fpage>21</fpage>
          -
          <lpage>26</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Lyon</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malcolm</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dickerson</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Detecting short passages of similar text in large document collections</article-title>
          .
          <source>In: Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing</source>
          . pp.
          <fpage>118</fpage>
          -
          <lpage>125</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Plagiarism: a misplaced emphasis</article-title>
          .
          <source>Journal of Information Ethics</source>
          <volume>3</volume>
          (
          <issue>2</issue>
          ),
          <fpage>36</fpage>
          -
          <lpage>47</lpage>
          (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Maurer</surname>
            ,
            <given-names>H.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kappe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaka</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Plagiarism-a survey</article-title>
          .
          <source>J. UCS</source>
          <volume>12</volume>
          (
          <issue>8</issue>
          ),
          <fpage>1050</fpage>
          -
          <lpage>1084</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>McCabe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Research report of the center for academic integrity</article-title>
          . http://www.academicintegrity.org (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>In other (people's) words: Plagiarism by university students-literature and lessons. Assessment &amp; evaluation in higher education 28(5</article-title>
          ),
          <fpage>471</fpage>
          -
          <lpage>488</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrón-Cedeño</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eiselt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Overview of the 2nd international competition on plagiarism detection</article-title>
          . In: CLEF (Notebook Papers/LABs/Workshops) (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrón-Cedeño</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eiselt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Overview of the 3rd international competition on plagiarism detection</article-title>
          .
          <source>In: Notebook Papers of CLEF 11 Labs and Workshops</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beyer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Busse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tippmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 6th international competition on plagiarism detection</article-title>
          . In: CLEF (Online Working Notes/Labs/Workshop) (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tippmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiesel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 5th international competition on plagiarism detection</article-title>
          . In: CLEF (Online Working Notes/Labs/Workshop) (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrón-Cedeño</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>An evaluation framework for plagiarism detection</article-title>
          .
          <source>In: Proceedings of the 23rd international conference on computational linguistics: Posters</source>
          . pp.
          <fpage>997</fpage>
          -
          <lpage>1005</lpage>
          . Association for Computational Linguistics (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>