<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Study of Reuse and Plagiarism in Speech and Natural Language Processing papers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Joseph Mariani</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gil Francopoulo</string-name>
          <email>gil.francopoulo@wanadoo.fr</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patrick Paroubek</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>LIMSI</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Université Paris-Saclay (France)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>LIMSI</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Université Paris-Saclay</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tagmatica (France)</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>72</fpage>
      <lpage>83</lpage>
      <abstract>
        <p>The aim of this experiment is to present an easy way to compare fragments of texts in order to detect (supposed) results of copy &amp; paste operations between articles in the domain of Natural Language Processing, including Speech Processing (NLP). The search space of the comparisons is a corpus labelled as NLP4NLP, which includes 34 different sources and gathers a large part of the publications in the NLP field over the past 50 years. This study considers the similarity between the papers of each individual source and the complete set of papers in the whole corpus, according to four different types of relationship (self-reuse, self-plagiarism, reuse and plagiarism) and in both directions: a source paper borrowing a fragment of text from another paper of the collection, or in the reverse direction, fragments of text from the source paper being borrowed and inserted in another paper of the collection. Everything starts with a copy &amp; paste and, of course the flood of documents that we see today could not exist without the practical ease of copy &amp; paste. This is not new but what is new is that the availability of archives allows us to study a vast amount of papers in our domain (i.e. Natural Language Processing, NLP, both for written and spoken materials) and to figure out the level of reuse and plagiarism in this area. Our work comes after the various studies initiated in the Workshop entitled: “Rediscovering 50 Years of Discoveries in Natural Language Processing” on the occasion of ACL's 50th anniversary in 2012 [Radev et al 2013] where a group of researchers studied the content of the corpus recorded in the ACL Anthology [Bird et al 2008]. Among these studies, one was devoted to reuse and it is worth quoting Gupta and Rosso [Gupta et al 2012]: “It becomes essential to check the authenticity and the novelty of the submitted text before the acceptance. It becomes nearly impossible for a human judge (reviewer) to discover the source of the submitted work, if any, unless the source is already known. Automatic plagiarism detection applications identify such potential sources for the submitted work and based on it a human judge can easily take the decision”. Let's add that this subject is a specific and active domain ruled yearly by the PAN international plagiarism detection competition1. On our side, we also conducted a specific study of reuse and plagiarism in the papers published at the Language Resources and Evaluation conference (LREC), from 1998 to 2014 [Francopoulo et al 2016]. Our aim is not to present the state-of-art or to compare the various metrics and algorithms for reuse and plagiarism detection, see [Hoad et al 2003] [HaCohen-Kerner et al 2010] for instance. We position our work as an extrinsic detection, the aim of which is to find near-matches between texts, as opposed to intrinsic detection whose aim is to show that different parts of a presumably single-author text could not have been written by the same author [Stamatatos et al 2011a], [Stein et al 2011], [Bensalem et al 2014]. In contrast, our main objective is to deal with the entry level of the detection. The main question is: Is there a meaningful difference in taking the verbatim raw strings compared with the result of a linguistic parsing? A secondary objective is to present and study a series of ascertainments about the practices of our specific field.</p>
      </abstract>
      <kwd-group>
        <kwd>Plagiarism Detection</kwd>
        <kwd>Text reuse</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Speech Processing</kwd>
        <kwd>Scientometrics</kwd>
        <kwd>Informetrics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>Context</title>
    </sec>
    <sec id="sec-3">
      <title>Objectives</title>
      <p>speaking, and aside from the small corpora, one third comes from the ACL Anthology3, one third from the ISCA
Archive4 and one third from IEEE5.</p>
      <p>The detail of NLP4NLP is presented in table 1, as follows:
icassps
ijcnlp
inlg
isca
jep
lre
lrec
ltc
modulad
mts
muc
naacl
paclic
ranlp
sem
speechc
tacl
tal
taln
taslp
tipster
trec
Total
Total
without
duplicates
short name # docs format</p>
      <p>acl 4264 conference
acmtslp 82 journal
alta 262 conference
anlp 278 conference
cath 932 journal
cl 776 journal
coling 3813 conference
conll 842 conference
csal 762 journal
eacl 900 conference
emnlp 2020 conference
hlt 2219 conference
105 conference
1847 conference
67,9376
65,003
language access to content
English open access *
English private access
English open access *
English open access *
English private access
English open access *
English open access *
English open access *
English private access
English open access *
English open access *
English open access *
private access
open access *
open access *
open access
open access *
private access
open access *
private access
open access
open access
open access *
open access *
open access *
open access *
open access *
private access
open access *
open access
open access *
private access
open access *
open access
A phase of preprocessing has been applied to represent the various sources in a common format. This format
follows the organization of the ACL Anthology with two parts in parallel for each document: the metadata and
the content. Each document is labeled with a unique identifier, for instance “lrec2000_1” is reified on the hard
disk as two files: “lrec2000_1.bib” and “lrec2000_1.pdf”.</p>
      <p>For the metadata, we faced four different types of sources with different flavors and character encodings:
BibTeX (e.g. ACL Anthology), custom XML (e.g. TALN), database downloads (e.g. IEEE) or HTML program
of the conference (e.g. TREC). We wrote a series of small Java programs to transform these metadata into a
common BibTeX format under UTF8. Each file comprises the author names and the title. The file is located in a
directory which designates the year and the corpus.</p>
      <p>Concerning the content, we faced different formats possibly for the same corpus, and the amount of documents
being huge, we cannot designate the file type by hand individually. To deal with this, we wrote a program to
self-detect the type and sub-type as follows:</p>
      <p>A small amount of texts are in raw text: we keep them in this format.</p>
      <p>The vast majority of the documents are in PDF format of different sub-types. First, we used PDFBox7
to determine the sub-type of the PDF content: when the content is a textual content, we use PDFBox</p>
      <sec id="sec-3-1">
        <title>3 http://aclweb.org/anthology</title>
        <p>4 www.isca-speech.org/iscaweb/index.php/archive/online-archive
5 https://www.ieee.org/index.html
6 In the case of a joint conference, the papers are counted twice. This number reduces to 65,003, if we count only once duplicated papers.
Similarly, the number of venues is 577 when all venues are counted, but this number reduces to 558 when the 19 joint conferences are
counted only once.
again to extract the text, possibly with the use of the “Legion of the Bouncy Castle”8 to extract the
encrypted content. When the PDF is a text under the form of an image, we use PDFBox to extract the
images and then Tesseract OCR9 to transform the images into a textual content.</p>
        <p>Then, and after some experiments, two filters are applied to avoid getting rubbish content:</p>
        <p>The content should be at least 900 characters.</p>
        <p>
          The content should be of good quality. In order to evaluate this quality, the content is analyzed by the
morphological module of TagParser [Francopoulo 2007], a deep industrial parser based on a broad
English lexicon and Global Atlas (a knowledge base containing more than one million words from 18
Wikipedias) [
          <xref ref-type="bibr" rid="ref13">Francopoulo et al. 2013</xref>
          ] to detect out-of-the-vocabulary (OOV) words. Based on the
hypothesis that rubbish strings are OOV words, we retain a text when the ratio OOV / number of words
is less than 9%.
        </p>
        <p>
          We then apply a set of symbolic rules to split the abstract, body and reference section. The file is recorded in
XML. It should be noted that we made some experiments with other strategies, given the fact that we are able to
compare them with respect to a quantitative evaluation of the quality, as explained before. The first experiment
was to use ParsCit10 [
          <xref ref-type="bibr" rid="ref12">Councill et al. 2008</xref>
          ] but the evaluation of the quality was bad, specially when the content
is not pure ASCII. The result on accentuated Latin strings, or Arabic and Russian contents was awful. We also
tried Grobid11 but we did not succeed to run it correctly on Windows.
        </p>
        <p>A semi-automatic cleaning process was applied on the metadata in order to avoid false duplicates concerning
middle names (for X Y Z, is Y a second given name or the first part of the family name?) and for this purpose,
we use the specific BibTex format where the given name is separated from the family name with a comma. Then
typographic variants (e.g. “Jean-Luc” versus “Jean Luc” or “Herve” versus “Hervé”) were searched in a tedious
process and false duplicates were normalized in order to be merged. The resulting number of different authors is
48,894.</p>
        <p>
          Figures are not extracted because we are unable to compare images. See [
          <xref ref-type="bibr" rid="ref14">Francopoulo et al 2015</xref>
          ] for more
details about the extraction process as well as the solutions for some tricky problems like joint conferences
management or abstract / body / reference sections detection.
        </p>
        <p>The majority (90%) of the documents come from conferences, the rest coming from journals. The overall
number of words is roughly 270M. Initially, the texts are in four languages: English, French, German and
Russian. The number of texts in German and Russian is less than 0.5%. They are detected automatically and are
ignored. The texts in French are a little bit more numerous (3%), and are kept with the same status as the English
ones. This is not a problem as our tool is able to process English and French.</p>
        <p>The corpus is a collection of documents of a single technical domain, which is NLP in the broad sense, and of
course, some conferences are specialized in certain topics like written language processing, spoken language
processing, including signal processing, information retrieval or machine translation.</p>
        <p>5.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Definitions</title>
      <p>As the terminology is fuzzy and contradictory among the scientific literature, we need first to define four
important terms in order to avoid any misunderstanding.</p>
      <p>The term “self-reuse” is used for a copy &amp; paste when the source of the copy has an author who belongs to the
group of authors of the text of the paste and when the source is cited.</p>
      <p>The term “self-plagiarism” is used for a copy &amp; paste when the source of the copy has similarly an author who
belongs to the group of authors of the text of the paste, but when the source is not cited.</p>
      <p>The term “reuse” is used for a copy &amp; paste when the source of the copy has no author in the group of authors of
the paste and when the source is cited.</p>
      <p>The term “plagiarism” is used for a copy &amp; paste when the source of the copy has no author in the group of the
paste and when the source is not cited.</p>
      <p>
        Said in other words, the terms “self-reuse” and “reuse” qualify a situation with a proper source citation, on the
contrary of “self-plagiarism” and “plagiarism”. Let’s note that in spite of the fact that the term “self-plagiarism”
seems to be contradictory as authors should be free to use their own wordings, we use this term because it is the
usual habit within the community of plagiarism detection - some authors also use the term “recycling”, for
instance [
        <xref ref-type="bibr" rid="ref22">HaCohen-Kerner et al 2010</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>Directions</title>
      <p>Another point to clarify concerns the expression “source papers”. As a convention, we call “focus” the corpus
corresponding to the source which is studied. The whole NL4NLP collection is the “search space”. We examine</p>
      <sec id="sec-5-1">
        <title>7 https://pdfbox.apache.org</title>
        <p>8 http://www.bouncycastle.org/
9 https://code.google.com/p/tesseract-ocr
10 https://github.com/knmnyn/ParsCit
11 https://github.com/kermitt2/grobid
the copy &amp; paste operations in both directions: we study the configuration with a source paper borrowing
fragments of text from other papers of the NLP4NLP collection, in other words, a backward study, and we also
study in the reverse direction the fragments of the source paper being borrowed by papers of the NLP4NLP
collection, in other words, a forward study.</p>
        <p>7.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Algorithm</title>
      <p>
        Comparison of word sequences has proven to be an effective method for detection of copy &amp; paste [
        <xref ref-type="bibr" rid="ref10">Clough et al
2002</xref>
        a] and in several occasions, this method won the PAN contest [
        <xref ref-type="bibr" rid="ref1">Barron-Cedeno et al 2010</xref>
        ], so we will adopt
this strategy. In our case, the corpus is first processed with the deep NLP parser TagParser [Francopoulo 2007]
to produce a Passage format [
        <xref ref-type="bibr" rid="ref33">Vilnat et al 2010</xref>
        ] with lemma and part-of-speech (POS) indications.
The algorithm is as follows:
      </p>
      <p>For each document of the focus (the source corpus), all the sliding windows12 of lemmas (typically 5 to 7,
excluding punctuations) are built and recorded under the form of a character string key in an index locally to
a document.</p>
      <p>An index gathering all these local indexes is built and is called the “focus index”.</p>
      <p>For each document apart from the focus (i.e. outside the source corpus), all the sliding windows are built
and only the windows contained in the focus index are recorded in an index locally to this document. This
filtering operation is done to optimize the comparison phase, as there is no need to compare the windows out
of the focus index.</p>
      <p>
        Then, the keys are compared to compute a similarity overlapping score [
        <xref ref-type="bibr" rid="ref24">Lyon et al 2001</xref>
        ] between
documents D1 and D2, with the Jaccard distance: score(D1,D2) = shared windows# / union# (D1
windows, D2 windows). The pairs of documents D1 / D2 are then filtered according to a threshold in order
to retain only significant similarity scoring situations.
8.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Algorithm comments and evaluation</title>
      <p>In a first implementation, we compared the raw character strings with a segmentation based on space and
punctuation. But, due to the fact that the input is the result of PDF formatting, the texts may contain variable
caesura for line endings or some little textual variations. Our objective is to compare at a higher level than
hyphen variation (there are different sorts of hyphens), caesura (the sequence X/-/endOfLine/Y needs to match
an entry XY in the lexicon to distinguish from an hyphen binding a composition), upper/lower case variation,
plural, orthographic variation (“normalise” versus “normalize”), spellchecking (particularly useful when the PDF
is an image and when the extraction is of low quality) and abbreviation (“NP” versus “Noun Phrase” or “HMM”
versus “Hidden Markov Model”). Some rubbish sequence of characters (e.g. a series of hyphens) were also
detected and cleaned.</p>
      <p>
        Given that a parser takes all these variations and cleanings into account, we decided to apply a full linguistic
parsing, as a second strategy. The syntactic structures and relations are ignored. Then a module for entity linking
is called in order to bind different names referring to the same entity, a process often labeled as “entity linking”
in the literature [
        <xref ref-type="bibr" rid="ref19">Guo et al 2011</xref>
        ][
        <xref ref-type="bibr" rid="ref25">Moro et al 2014</xref>
        ]. This process is based on the “Global Atlas” Knowledge Base
[
        <xref ref-type="bibr" rid="ref13">Francopoulo et al 2013</xref>
        ] which comprises the LRE Map [Calzolari et al 2012]. Thus “British National Corpus”
is considered as possibly abbreviated to “BNC”, as well as less regular names like “ItalWordNet” possibly
abbreviated to “IWN”. Each entry of the Knowledge Base has a canonical form, possibly associated with
different variants: the aim is to normalize into a canonical form to neutralize proper noun obfuscations based on
variant substitutions. After this processing, only the sentences with at least a verb are considered.
We examined the differences between those two strategies concerning all types of copy &amp; paste situations above
the threshold, choosing the LREC source as the focus. The results are presented in Table 2, with the last column
adding the two other columns without the duplicates produced by the couples of the same year.
The strategy based on linguistic processing provides more pairs (+158) and we examined these differences.
Among these pairs, the vast majority (80%) concerns caesura: this is normal because most conferences demand a
double column format, so the authors frequently use caesura to save place13. The other differences (20%) are
12 Also called “n-grams” in some NLP publications.
13 Concerning this specific problem, for instance, PACLIC and COLING which are one column formatted give much better extraction quality
than LREC and ACL which are two columns formatted.
mainly caused by lexical variations and spellchecking. Thus, the results show that using raw texts gives a more
“silent” system. The drawback is that the computation is much longer14, but we think that it is worth the value.
9.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Tuning parameters</title>
      <p>There are three parameters that had to be tuned: the window size, the distance function and the threshold. The
main problem we had was that we did not have any gold standard to evaluate the quality specifically on our
corpus and the burden to annotate a corpus was too heavy. We therefore decided to start from the parameters
presented in the articles related to the PAN contest. We then computed the results, picked a random selection of
pairs that we examined and tuned the parameters accordingly. All experiments were conducted with LREC as the
focus and NLP4NLP as the search space.</p>
      <p>
        In the PAN related articles, different window sizes are used. A window of five is the most frequent one
[
        <xref ref-type="bibr" rid="ref23">Kasprzak et al 2010</xref>
        ], but our results show that a lot of common sequences like “the linguistic unit is the”
overload the pairwise score. After some trials, we decided to select a size of seven tokens, in agreement with
[
        <xref ref-type="bibr" rid="ref9">Citron and Ginsparg 2014</xref>
        ].
      </p>
      <p>
        Concerning the distance function, the Jaccard distance is frequently used but let’s note that other formulas are
applicable and documented in the literature. For instance, some authors use an approximation with the following
formula: score(D1,D2) = shared windows# / min(D1 windows#, D2 windows#) [
        <xref ref-type="bibr" rid="ref11">Clough et al 2009</xref>
        ], which is
faster to compute, because there is no need to compute the union. Given that computation time is not a problem
for us, we kept the most used function which is the Jaccard distance.
      </p>
      <p>Concerning the threshold, we tried thresholds of 0.03 and 0.04 (3 to 4%) and we compared the results. The last
value gave more significant results, as it reduced noise, while still allowing to detect meaningful pairs of similar
papers.</p>
      <p>After running the first trials, we discovered that using the Jaccard distance resulted in considering as similar a set
of two papers, one of them being of small content. This may be the case for invited talks, for example, when the
author only provide a short abstract. In this case, a simple acknowledgement to the same institution may produce
a similarity score higher than the threshold. The same happens for some eldest papers when the OCR produced a
truncated document. In order to solve this problem, we added a second threshold on the minimum number of
shared windows that we set at 50 after considering the corresponding erroneous cases.</p>
      <p>
        10. Special considerations concerning authorship and citations
As previously explained, our aim is to distinguish a copy &amp; paste fragment associated with a citation compared
to a fragment without any citation. To this end, we proceed with an approximation: we do not bind exactly the
anchor in the text, but we parse the reference section and consider that, globally to the text, the document cites
(or not) the other document. Due to the fact, that we have proper author identification for each document, the
corpus forms a complex web of citations. We are thus able to distinguish self-reuse versus self-plagiarism and
reuse versus plagiarism. We are in a situation slightly different from METER where the references are not
linked. Let’s recall that METER is the corpus usually involved in plagiarism detection competitions [
        <xref ref-type="bibr" rid="ref18">Gaizauskas
et al 2001</xref>
        ][
        <xref ref-type="bibr" rid="ref10">Clough et al 2002</xref>
        b].
      </p>
      <p>11. Precision about the anteriority test
Given the fact that some papers and drafts of papers can circulate among researchers before the official
published date, it is impossible to verify exactly when a document is issued; moreover we do not have any more
detailed time indication than the year, as we don’t know the precise date of submission. This is why we also
consider the same year within the comparisons. In this case, it is difficult to determine which are the borrowing
and borrowed papers, and in some cases they may even have been written simultaneously. However, if one paper
cites a second one, while it is not cited by the second one, it may serve as a sign to consider it as being the
borrowing paper.</p>
      <p>12. Resulting files
The program computes a detailed result for each individual source as an HTML page where all similar pairs of
documents are listed with their similarity score, with the common fragments displayed as red highlighted
snippets and HTML links back to the original 67,937 documents15. For each of the 4 categories (Self-reuse,
SelfPlagiarism, Reuse and Plagiarism), the program produces the list of couples of “similar” papers according to our
criteria, with their similarity score, and the global results in the form of matrices displaying the number of papers
14 It takes 25 hours instead of 3 hours on a mid-range mono-processor Xeon E3-1270 V2 with 32G of RAM.
15 But the space limitations do not allow to present these results in lengthy details. Furthermore, we do not want to display personal results.
that are similar in each couple of the 34 sources, in the forward and backward directions (the using sources are
on the X axis, while the used sources are on the Y axis). The total of used and using papers, and the difference
between those totals, are presented, while the 7 (Table 3) or 5 (Table 4) top using or used sources are indicated
in green.</p>
      <p>We conducted a manual checking of the couples of papers showing a very high similarity: the 14 couples that
showed a similarity of 1 were the duplication of a paper due to an error in editing the proceedings of a
conference. We also found after those first trials erroneous results of the OCR for some eldest papers which
resulted in files containing several papers, in full or in fragments, or where blanks were inserted after each
individual character. We excluded those 86 documents from the corpus being considered.</p>
      <p>Checking those results, we also mentioned several cases where the author was the same, but with a different
spelling, or where references were properly quoted, but with a different wording, a different spelling (American
English versus British English, for example) or an improper reference to the source. We had to manually correct
those cases, and move the corresponding couples of papers in the correct category (from reuse or plagiarism to
self-reuse or self-plagiarism in the case of authors names, from plagiarism to reuse, in the case of references).</p>
      <p>13. Self-reuse and Self-Plagiarism
Table 3 provides the results of merging self-reuse (authors reusing their own text while quoting the source paper)
and self-plagiarism (authors reusing their own text without quoting the source paper). As we see, it is a rather
frequent phenomenon, with a total of 12,493 documents (i.e. 18% of the 67,937 documents!). In 61% of the
cases (7,650 self-plagiarisms over 12,493), the authors do not quote the source paper. We found that 205 papers
have exactly the same title, and that 130 papers have both the same title and the same list of authors! Also 3,560
papers have exactly the same list of authors. Given the large number of documents, it is impossible to conduct a
manual checking of all the couples.</p>
      <p>We see that the most used sources are the large conferences: ISCA, IEEE-ICASSP, ACL, COLING, HLT,
EMNLP and LREC. The most using sources are not only those large conferences, but also the journals:
IEEETransactions on Acoustics, Speech and Language Processing (and its various avatars) (TASLP), Computer
Speech and Language (CSAL), Computational Linguistics (CL) and Speech Com. If we consider the balance
between the using and the used sources, we clearly see that the flow of papers goes from conferences to journals.</p>
      <p>The largest flows of self-reuse and self-plagiarism concern ISCA and ICASSP, in both directions, but especially
from ISCA to ICASSP, ICASSP and ISCA to TASLP (also in the reverse direction) and to CSAL, ISCA to
Speech Com, ACL to Computational Linguistics, ISCA to LREC and EMNLP to ACL.</p>
      <p>If we want to study the influence a given conference (or journal) has on another one, we must however recall
that these figures are raw figures in terms of number of documents, and we must not forget that some
conferences (or journals) are much bigger than others. For instance, LREC is a conference with more than 4,500
documents compared to LRE which is a journal with only 308 documents. If we relate the number of published
papers that reuse another paper to the total number of published papers, we may see that 17% of the LRE papers
(52 over 308) use content coming from the LREC conferences, without quoting them in 66% of the cases. Also
the frequency of the conferences (annual or biennial) and the calendar (date of the conference and of the
submission deadline) may influence the flow of papers between the sources.</p>
      <p>The similarity scores range from 4% to 97% (Fig. 1). We see that about 4,500 couples of papers have a similarity
score equal or superior to 10%; about 900 (1.3% of the total number of papers) have a score superior or equal to
30%. Looking at the ones with the largest similarity score, we found a few examples of important variants in the
spelling of the same authors’ names, and cases of republishing the corrigendum of a previously published paper
or of republishing a paper with a small difference in the title and one missing author in the authors’ list. In one
case, the same research center is described by the same author in two different conferences with an overlapping
of 90%. In another case, the difference of the two papers is primarily in the name of the systems being presented,
funded by the same project agency in two different contracts, while the description has a 45% overlap!
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 0 5 0 5 0 5 0 5 0 5 0 5 0 5 0 5 0 5 0 5 0
1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 0 0 1</p>
      <p>1 1 1
d
a
l
u
d
o
m
0 19
0 0
0 0
0 1
0 0
0 0
0 9
0 3
0 0
0 1
0 7
0 13
0 2
0 9
0 1
0 12
0 0
0 0
0 6
0 2
0 0
0 8
0 0
0 9
0 3
0 2
0 1
0 0
0 0
0 0
0 0
0 1
0 0
0 0
0 109
c
h
c
e
e
p
s
s
t
m
c
u
m
l
c
a
a
n
c
li
c
a
p
p
l
n
a
r
m
e
s
l
c
a
t
l
a
t
n
l
a
t
p
l
s
a
t
r
e
t
s
p
it
c
e
tr
d
e
s
u
l
a
t
o
T
g
n
i
s
u
l
a
t
o
T
e
c
n
e
r
e
iff</p>
      <p>D</p>
      <p>Used Using lca ltscapm ltaa lapn tcah lc ilcgon llcno lsca lcea lenpm lth isscsap lijcnp ilng isca jpe lre lrce ltc ludaodm tsm cum lcana ilccap lranp sem cscepeh ltca lta ltan ltspa ittrspe trce ltsuaoedT litsuaongT iffrcneeeD
14. Reuse and Plagiarism
Table 4 provides the results of merging reuse (authors reusing fragments of the texts of other authors while
quoting the source paper) and plagiarism (authors reusing fragments of the texts of other authors without quoting
the source paper). As we see, there are very few cases altogether. Only 261 papers (i.e. less than 0.4% of the
67,937 documents) reuse a fragment of papers written by other authors that they quote. In 60% of the cases (156
plagiarisms over 261), the authors do not quote the source paper, but these possible cases of plagiarism only
represent 0.23% of the total number of papers. Given those small numbers, we were able to conduct a manual
checking of those couples.</p>
      <p>Among the couple papers placed in the “Reuse” category, it appeared that 12 have a least one author in common,
but with a somehow different spelling and should therefore be placed in the “Self-reuse” category. Among the
couples of papers placed in the “Plagiarism” category, 25 have a least one author in common, but with a
somehow different spelling and should therefore be placed in the “Self-plagiarism” category and 14 correctly
quote the source paper, but with variants in the spelling of the authors’ names, of the paper’s title or of the
conference or journal source or forgetting to place the source paper in the references and should therefore be
placed in the “Reuse” category. It therefore resulted in 107 cases of “reuse” and 117 possible cases of plagiarism
(0.17% of the papers) that we studied more closely. We found the following explanations:</p>
      <p>The paper cites another reference from the same authors of the source paper (typically a previous
reference, or a paper published in a Journal) (46 cases)
Both papers use extracts of a third paper that they both cite (31 cases)
The authors of the two papers are different, but from the same laboratory (typically in industrial
laboratories or funding agencies) (11 cases)
The authors previously co-authored papers (typically as supervisor and PhD student or postdoc) but are
now in a different laboratory (11 cases)
The authors of the papers are different, but collaborated in the same project which is presented in the
two papers (2 cases)
The two papers present the same short example, result or definition coming from another source (13
cases)
If we exclude those cases, only 3 cases of possible plagiarism remain that correspond to the same paper which
appears as a patchwork of 3 other papers, while sharing several references with them.</p>
      <p>The similarity scores range from 4% to 42% (Fig. 2). Only 34 couples of papers have a similarity score equal or
higher than 10%. For example, the couple showing the highest similarity score comprises a paper published in
1998 and a paper published in 2000 which both describe Chart parsing using the words of the initial paper
published 20 years earlier in 1980, that they both properly quote. Among the three remaining possible cases of
plagiarism, the highest similarity score is 10%, with a shared window of 200 tokens.</p>
      <p>0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
1
21
41
61
15. Time delay between publication and reuse
We now consider the duration between the publication of a paper and its reuse (in all 4 categories) in another
publication. It appears that 38% of the similar papers were published on the same year, 71% within the next year,
100%
80%
60%
40%
20%
0%
100%
80%
60%
40%
20%
0%
83% over 2 years and 93% over 3 years (Figure 3 and 4). Only 7% reuse material from an earlier period. The
average duration is 1.22 years. 30% of the similar papers published on the same year concern the couple of
conferences ISCA-ICASSP.</p>
      <p>5000
4000
3000
2000
1000</p>
      <p>0
We now consider the reuse of conference papers in journal papers (Figures 5 and 6). We observe here a similar
time schedule, with a delay of one year: 12% of the reused papers were published on the same year, 41% within
the next year, 68% over 2 years, 85% over 3 years and 93% over 4 years. Only 7% reuse material from an earlier
period. The average duration is 2.07 years.</p>
      <p>Self-reuse and self-plagiarism are of a different nature. Let’s recall that they concern papers that have at least
one author in common. Of course, a copy &amp; paste operation is easy and frequent but there is another phenomena
to take into account which is difficult to distinguish from copy &amp; paste: this is the style of the author. Everybody
has habits to formulate its ideas, and, even on a long period, most authors seem to keep the same chunks of
prepared words. As we’ve seen, almost 40% of the cases concern papers that are published on the same year:
authors submit two similar papers at two different conferences on the same year, and publish the two papers in
16 https://en.wikipedia.org/wiki/Right_to_quote
both conferences if both are accepted. It is very difficult to prevent those cases as none of the papers are
published when the other is submitted. Another frequent case is the publication of a paper in a journal after its
publication in a conference. Here also, it is a natural and usual process, sometimes even encouraged by the
journal editors after a pre-selection of the best papers in a conference.</p>
      <p>
        As a tentative to moderate these figures and to justify self-reuse and self-plagiarism of previously published
material, it is worth quoting Pamela
        <xref ref-type="bibr" rid="ref29">Samuelson [Samuelson 1994</xref>
        ]:
      </p>
      <p>The previous work must be restated to lay the groundwork for a new contribution in the second work,
Portions of the previous work must be repeated to deal with new evidence or arguments,
The audience for each work is so different that publishing the same work in different places is necessary to
get the message out,</p>
      <p>The authors think they said it so well the first time that it makes no sense to say it differently a second time.</p>
      <p>She considers that 30% is an upper limit in the reuse of parts of a previously published paper.</p>
      <p>We believe that following these two sets of principles regarding (self) reuse and plagiarism will help maintaining
an ethical behavior in our community.</p>
      <p>
        17. Further developments
A limitation of our approach is that it fails to identify copy &amp; paste when the original text has been strongly
altered. Our study of graphical variations of a common meaning is presently limited to geographical variants,
technical abbreviations (e.g. HMM versus Hidden Markov Model) and resource names aliases from the LRE
Map. We plan to deal with “rogeting” which is the practice of replacing words with supposedly synonymous
alternatives in order to disguise plagiarism17 by obfuscation, see [
        <xref ref-type="bibr" rid="ref27">Potthast et al 2010</xref>
        ][
        <xref ref-type="bibr" rid="ref8">Chong et al 2011</xref>
        ][
        <xref ref-type="bibr" rid="ref7">Ceska et
al 2009</xref>
        ] for another presentation. Detecting paraphrases and transpositions of passive / active sentences, seems
in contrast rather difficult to implement [
        <xref ref-type="bibr" rid="ref2">Barron-Cedeno et al 2013</xref>
        ]. A more tractable development is to
artificially modify the n-gram to match as presented in [
        <xref ref-type="bibr" rid="ref26">Nawab et al 2012</xref>
        ]. Another track of development could
be to simplify the input to retain only the plain words, a process labeled as “stopwords n-gram” by [
        <xref ref-type="bibr" rid="ref30 ref31">Stamatatos
2011</xref>
        b].
      </p>
      <p>
        Another direction of improvement is to isolate and ignore tables in order to reduce noise, but this is a complex
task as documented in [
        <xref ref-type="bibr" rid="ref17">Frey et al 2015</xref>
        ]. Let’s note that this is not a big problem in our approach, as we ignore
sentences without any verb and as verbs are not very frequent within a table.
      </p>
      <p>More generally, we could also study the position and rhetorical structure of the copy &amp; paste in order to identify
and justify their function.</p>
      <p>We may finally explore whether copy &amp; paste is more common for non native English speakers, given that it is
frequent that they publish first in their native language at a national conference and then in English in an
international conference or an international journal, in order to broaden their audience.</p>
      <p>18. Conclusions
To our knowledge, this paper is the first which reports results on the study of copy &amp; paste operations on corpora
of NLP archives of this size. Based on a simple method of n-gram comparison after text processing using NLP,
this method is easy to implement. Of course, this process makes a large number of pairwise comparisons
(65,000*65,000), which still represents a practical computing limitation.</p>
      <p>As our measures show, self-reuse and self-plagiarism are common practices. This is not specific to our field and
is certainly related to the current tendency which is called “salami-slicing” publication caused by the
publishand-perish demand18. But we gladly notice that plagiarism is very uncommon in our community.</p>
      <p>19. Bibliographical references
17 https://en.wikipedia.org/wiki/Rogeting
18 To this regard, we must ourselves admit that the reader will find a certain degree of overlapping between this paper and the one we
published at LREC 2016 also on reuse and plagiarism, but specifically related to the LREC papers, at least on the description of the
NLP4NLP corpus.
29.
30.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Barron-Cedeno</surname>
            <given-names>Alberto</given-names>
          </string-name>
          , Potthast Martin,
          <string-name>
            <given-names>Rosso Paolo</given-names>
            , Stein Benno, Eiselt
            <surname>Andreas</surname>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Corpus and Evaluation Measures for Automatic Plagiarism Detection</article-title>
          , Proceedings of LREC, Valletta, Malta.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Barron-Cedeno</surname>
            <given-names>Alberto</given-names>
          </string-name>
          , Vila Marta, Marti Maria Antonia, Rosso
          <string-name>
            <surname>Paolo</surname>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Plagiarism Meets Paraphrasing Insights for the Next Generation in Automatic Plagiarism Detection</article-title>
          , Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Bensalem</given-names>
            <surname>Imene</surname>
          </string-name>
          , Rosso Paolo, Chikhi
          <string-name>
            <surname>Salim</surname>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Intrinsic Plagiarism Detection using N-gram</article-title>
          <string-name>
            <surname>Classes</surname>
          </string-name>
          ,
          <source>Proceedings of the Conference on Empirical Methods in Natural Language Processing</source>
          , Doha, Qatar.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Bird</given-names>
            <surname>Steven</surname>
          </string-name>
          , Dale Robert,
          <string-name>
            <surname>Dorr Bonnie</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gibson</surname>
            <given-names>Bryan</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joseph Mark</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kan</surname>
          </string-name>
          Min-Yen, Lee Dongwon, Powley Brett,
          <string-name>
            <surname>Radev Dragomir</surname>
            <given-names>R</given-names>
          </string-name>
          , Tan Yee Fan (
          <year>2008</year>
          ).
          <article-title>The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics</article-title>
          , Proceedings of LREC, Marrakech, Morocco.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Calzolari</given-names>
            <surname>Nicoletta</surname>
          </string-name>
          ,
          <string-name>
            <surname>Del Gratta</surname>
            <given-names>Riccardo</given-names>
          </string-name>
          , Francopoulo Gil, Mariani Joseph, Rubino Francesco,
          <source>Russo Irene, Soria</source>
          <volume>26</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Claudia</surname>
          </string-name>
          (
          <year>2012</year>
          ).
          <source>The LRE Map. Harmonising Community Descriptions of Resources, Proceedings of LREC</source>
          , Istanbul, Turkey.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Ceska</given-names>
            <surname>Zdenek</surname>
          </string-name>
          , Fox
          <string-name>
            <surname>Chris</surname>
          </string-name>
          (
          <year>2009</year>
          ).
          <source>The Influence of Text Pre-processing on Plagiarism Detection, Proceedings of the Recent Advances in Natural Language Processing</source>
          , Borovets, Bulgaria.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Chong</given-names>
            <surname>Miranda</surname>
          </string-name>
          , Specia
          <string-name>
            <surname>Lucia</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Lexical Generalisation for Word-level Matching in Plagiarism Detection</article-title>
          ,
          <source>Proceedings of Recent Advances in Natural Language Processing</source>
          , Hissar, Bulgaria.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Citron Daniel</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ginsparg Paul</surname>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Patterns of text reuse in a scientific corpus</article-title>
          ,
          <source>PNAS</source>
          <year>2015</year>
          112 (
          <issue>1</issue>
          )
          <fpage>25</fpage>
          -
          <lpage>30</lpage>
          ; published ahead of
          <source>print December 8</source>
          ,
          <year>2014</year>
          , doi:10.1073/pnas.1415135111 Clough Paul, Gaizauskas Robert,
          <string-name>
            <given-names>Piao Scott S L</given-names>
            ,
            <surname>Wilks Yorick</surname>
          </string-name>
          (
          <year>2002a</year>
          ).
          <source>Measuring Text Reuse. Proceedings of ACL'02</source>
          , Philadelphia, USA.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Clough</given-names>
            <surname>Paul</surname>
          </string-name>
          , Gaizauskas Robert,
          <string-name>
            <surname>Piao Scott S L</surname>
          </string-name>
          , (
          <year>2002b</year>
          ).
          <article-title>Building and annotating a corpus for the study of journalistic text reuse</article-title>
          ,
          <source>Proceedings of LREC</source>
          ,
          <string-name>
            <surname>Las</surname>
            <given-names>Palmas</given-names>
          </string-name>
          , Spain.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Clough</given-names>
            <surname>Paul</surname>
          </string-name>
          , Stevenson Mark (
          <year>2009</year>
          ).
          <article-title>Developing a Corpus of Plagiarised Short Answers, Language Resources</article-title>
          and Evaluation, Springer.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Councill</surname>
          </string-name>
          , Isaac G.,
          <string-name>
            <surname>Giles</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
          </string-name>
          and Kan,
          <string-name>
            <surname>Min-Yen</surname>
          </string-name>
          (
          <year>2008</year>
          ),
          <article-title>ParsCit: An open-source CRF reference string parsing package</article-title>
          .
          <source>In Proceedings of the Language Resources and Evaluation Conference (LREC</source>
          <year>2008</year>
          ), Marrakesh, Morocco, May 2008
          <string-name>
            <given-names>Francopoulo</given-names>
            <surname>Gil</surname>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>TagParser: well on the way to ISO-TC37 conformance</article-title>
          .
          <source>Proceedings of ICGL (International Conference on Global Interoperability for Language Resources)</source>
          ,
          <source>Hong Kong.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Francopoulo</given-names>
            <surname>Gil</surname>
          </string-name>
          , Marcoul Frédéric, Causse David,
          <string-name>
            <given-names>Piparo</given-names>
            <surname>Grégory</surname>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Global Atlas: Proper Nouns, from Wikipedia to LMF, in LMF Lexical Markup Framework (Francopoulo</article-title>
          , ed), ISTE Wiley.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Francopoulo</given-names>
            <surname>Gil</surname>
          </string-name>
          , Mariani Joseph, Paroubek
          <string-name>
            <surname>Patrick</surname>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>NLP4NLP: the cobbler's children won't go unshod, in DLib Magazine: The magazine of Digital Library Research19</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Francopoulo</given-names>
            <surname>Gil</surname>
          </string-name>
          , Mariani Joseph, Paroubek
          <string-name>
            <surname>Patrick</surname>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>A Study of Reuse and Plagiarism in LREC papers</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>Proceedings of LREC</source>
          <year>2016</year>
          , Portorož, Slovenia.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Frey</given-names>
            <surname>Matthias</surname>
          </string-name>
          , Kern
          <string-name>
            <surname>Roman</surname>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Efficient Table Annotation for Digital Articles, in D-Lib Magazine: The magazine of Digital Library Research20</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Gaizauskas</given-names>
            <surname>Robert</surname>
          </string-name>
          , Foster Jonathan, Wilks Yorick,
          <string-name>
            <surname>Arundel John</surname>
          </string-name>
          , Clough Paul,
          <string-name>
            <surname>Piao Scott S L</surname>
          </string-name>
          (
          <year>2001</year>
          ).
          <article-title>The METER Corpus: A Corpus for Analysing Journalistic Text Reuse</article-title>
          .
          <source>Proceedings of the Corpus Linguistics Conference</source>
          , Lancaster, UK.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Guo</given-names>
            <surname>Yuhang</surname>
          </string-name>
          , Che Wanxiang, Liu Ting, Li
          <string-name>
            <surname>Sheng</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>A Graph-based Method for Entity Linking</article-title>
          , International Joint Conference on NLP,
          <string-name>
            <surname>Chiang</surname>
            <given-names>Mai</given-names>
          </string-name>
          , Thailand.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Gupta</given-names>
            <surname>Parth</surname>
          </string-name>
          , Rosso
          <string-name>
            <surname>Paolo</surname>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Text Reuse with ACL: (Upward) Trends</article-title>
          ,
          <source>Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries</source>
          , Jeju, Republic of Korea.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Hoad Timothy</surname>
            <given-names>C</given-names>
          </string-name>
          , Zobel
          <string-name>
            <surname>Justin</surname>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>Methods for identifying Versioned and Plagiarised Documents</article-title>
          ,
          <source>Journal of the American Society for Information Science and Technology.</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>HaCohen-Kerner</surname>
            <given-names>Yaakov</given-names>
          </string-name>
          , Tayeb Aharon,
          <string-name>
            <surname>Ben-Dror Natan</surname>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Detection of Simple Plagiarism in Computer Science Papers</article-title>
          ,
          <source>in Proceedings of the 23rd International Conference on Computational Linguistics (COLING)</source>
          , Beijing, PRC.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>Kasprzak</given-names>
            <surname>Jan</surname>
          </string-name>
          , Brandejs
          <string-name>
            <surname>Michal</surname>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Improving the Reliability of the Plagiarism Detection System Lab</article-title>
          ,
          <source>in Proceedings of the Uncovering Plagiarism, Authorship and Social Software Misuse (PAN)</source>
          , Padua, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Lyon</given-names>
            <surname>Caroline</surname>
          </string-name>
          , Malcolm James, Dickerson
          <string-name>
            <surname>Bob</surname>
          </string-name>
          (
          <year>2001</year>
          ).
          <article-title>Detecting Short Passages of Similar Text in large document collections</article-title>
          ,
          <source>Proc. of the Empirical Methods in Natural Language Processing Conference</source>
          , Pittsburgh, PA USA.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Moro</given-names>
            <surname>Andrea</surname>
          </string-name>
          , Raganato Alessandro, Navigli
          <string-name>
            <surname>Roberto</surname>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Entity Linking meets Word Sense Disambiguation: a Unified Approach, Transactions of the Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <given-names>Nawab</given-names>
            <surname>Rao Muhammad Adeel</surname>
          </string-name>
          , Stevenson Mark, Clough Paul (
          <year>2012</year>
          ).
          <article-title>Detecting Text Reuse with Modified and Weighted N-grams</article-title>
          ,
          <source>First Joint Conference on Lexical and Computational Semantics</source>
          , Montréal, Canada.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <given-names>Potthast</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Stein Benno</given-names>
            ,
            <surname>Barron-Cedeno</surname>
          </string-name>
          <string-name>
            <given-names>Alberto</given-names>
            , Rosso
            <surname>Paolo</surname>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>An Evaluation Framework for Plagiarism Detection</article-title>
          ,
          <source>in Proceedings of the 23rd International Conference on Computational Linguistics (COLING)</source>
          , Beijing, PRC.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Radev Dragomir</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muthukrishnan</surname>
            <given-names>Pradeep</given-names>
          </string-name>
          , Qazvinian Vahed, Abu-Jbara,
          <source>Amjad (203)</source>
          .
          <source>The ACL Anthology Network Corpus, Language Resources and Evaluation</source>
          <volume>47</volume>
          :
          <fpage>919</fpage>
          -
          <lpage>944</lpage>
          , Springer.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <given-names>Samuelson</given-names>
            <surname>Pamela</surname>
          </string-name>
          (
          <year>1994</year>
          ).
          <article-title>Self-plagiarism or fair use?</article-title>
          <source>Communications of the ACM</source>
          <volume>37</volume>
          (
          <issue>8</issue>
          ):
          <fpage>21</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <given-names>Stamatatos</given-names>
            <surname>Efstathios</surname>
          </string-name>
          , Koppel
          <string-name>
            <surname>Moshe</surname>
          </string-name>
          (
          <year>2011a</year>
          ).
          <article-title>Plagiarism and authorship analysis: introduction to the special issue</article-title>
          ,
          <source>Language Resources and Evaluation</source>
          , Springer.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <given-names>Stamatatos</given-names>
            <surname>Efstathios</surname>
          </string-name>
          (
          <year>2011b</year>
          ).
          <article-title>Plagiarism detection using stopword n-grams</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology.</source>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <given-names>Stein</given-names>
            <surname>Benno</surname>
          </string-name>
          , Lipka Nedim, Prettenhofer
          <string-name>
            <surname>Peter</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Intrinsic plagiarism analysis</article-title>
          ,
          <source>Language Resources and Evaluation</source>
          , Springer.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <given-names>Vilnat</given-names>
            <surname>Anne</surname>
          </string-name>
          , Paroubek Patrick, Villemonte de la Clergerie Eric, Francopoulo Gil, Guénot
          <string-name>
            <surname>Marie-Laure</surname>
          </string-name>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <surname>PASSAGE Syntactic</surname>
          </string-name>
          <article-title>Representation: a Minimal Common Ground for Evaluation</article-title>
          .
          <source>Proceedings of LREC</source>
          <year>2010</year>
          , Valletta, Malta.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>